Pesquisa | BVS - MINISTÉRIO DA SAÚDE

Using residual regressions to quantify and map signal leakage in genomic prediction.

Valente, Bruno D; de Los Campos, Gustavo; Grueneberg, Alexander; Chen, Ching-Yi; Ros-Freixedes, Roger; Herring, William O.

Genet Sel Evol ; 55(1): 57, 2023 Aug 07.

Artigo em Inglês | MEDLINE | ID: mdl-37550618

RESUMO

BACKGROUND: Most genomic prediction applications in animal breeding use genotypes with tens of thousands of single nucleotide polymorphisms (SNPs). However, modern sequencing technologies and imputation algorithms can generate ultra-high-density genotypes (including millions of SNPs) at an affordable cost. Empirical studies have not produced clear evidence that using ultra-high-density genotypes can significantly improve prediction accuracy. However, (whole-genome) prediction accuracy is not very informative about the ability of a model to capture the genetic signals from specific genomic regions. To address this problem, we propose a simple methodology that detects chromosome regions for which a specific model (e.g., single-step genomic best linear unbiased prediction (ssGBLUP)) may fail to fully capture the genetic signal present in such segments-a phenomenon that we refer to as signal leakage. We propose to detect regions with evidence of signal leakage by testing the association of residuals from a pedigree or a genomic model with SNP genotypes. We discuss how this approach can be used to map regions with signals that are poorly captured by a model and to identify strategies to fix those problems (e.g., using a different prior or increasing marker density). Finally, we explored the proposed approach to scan for signal leakage of different models (pedigree-based, ssGBLUP, and various Bayesian models) applied to growth-related phenotypes (average daily gain and backfat thickness) in pigs. RESULTS: We report widespread evidence of signal leakage for pedigree-based models. Including a percentage of animals with SNP data in ssGBLUP reduced the extent of signal leakage. However, local peaks of missed signals remained in some regions, even when all animals were genotyped. Using variable selection priors solves leakage points that are caused by excessive shrinkage of marker effects. Nevertheless, these models still miss signals in some regions due to low linkage disequilibrium between the SNPs on the array used and causal variants. Thus, we discuss how such problems could be addressed by adding sequence SNPs from those regions to the prediction model. CONCLUSIONS: Residual single-marker regression analysis is a simple approach that can be used to detect regional genomic signals that are poorly captured by a model and to indicate ways to fix such problems.

Assuntos

Genoma , Genômica , Animais , Suínos , Teorema de Bayes , Genômica/métodos , Genótipo , Fenótipo , Polimorfismo de Nucleotídeo Único , Linhagem , Modelos Genéticos

Fine mapping and accurate prediction of complex traits using Bayesian Variable Selection models applied to biobank-size data.

de Los Campos, Gustavo; Grueneberg, Alexander; Funkhouser, Scott; Pérez-Rodríguez, Paulino; Samaddar, Anirban.

Eur J Hum Genet ; 31(3): 313-320, 2023 03.

Artigo em Inglês | MEDLINE | ID: mdl-35853950

RESUMO

Modern GWAS studies use an enormous sample size and ultra-high density SNP genotypes. These conditions reduce the mapping resolution of marginal association tests-the method most often used in GWAS. Multi-locus Bayesian Variable Selection (BVS) offers a one-stop solution for powerful and precise mapping of risk variants and polygenic risk score (PRS) prediction. We show (with an extensive simulation) that multi-locus BVS methods can achieve high power with a low false discovery rate and a much better mapping resolution than marginal association tests. We demonstrate the performance of BVS for mapping and PRS prediction using data from blood biomarkers from the UK-Biobank (~300,000 samples and ~5.5 million SNPs). The article is accompanied by open-source R-software that implement the methods used in the study and scales to biobank-sized data.

Assuntos

Bancos de Espécimes Biológicos , Herança Multifatorial , Humanos , Teorema de Bayes , Software , Simulação por Computador , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único

MOSS: multi-omic integration with sparse value decomposition.

Gonzalez-Reymundez, Agustin; Grueneberg, Alexander; Lu, Guanqi; Alves, Filipe Couto; Rincon, Gonzalo; Vazquez, Ana I.

Bioinformatics ; 38(10): 2956-2958, 2022 05 13.

Artigo em Inglês | MEDLINE | ID: mdl-35561193

RESUMO

SUMMARY: This article presents multi-omic integration with sparse value decomposition (MOSS), a free and open-source R package for integration and feature selection in multiple large omics datasets. This package is computationally efficient and offers biological insight through capabilities, such as cluster analysis and identification of informative omic features. AVAILABILITY AND IMPLEMENTATION: https://CRAN.R-project.org/package=MOSS. SUPPLEMENTARY INFORMATION: Supplementary information can be found at https://github.com/agugonrey/GonzalezReymundez2021.

Assuntos

Software , Análise por Conglomerados

BGData - A Suite of R Packages for Genomic Analysis with Big Data.

Grueneberg, Alexander; de Los Campos, Gustavo.

G3 (Bethesda) ; 9(5): 1377-1383, 2019 05 07.

Artigo em Inglês | MEDLINE | ID: mdl-30894453

RESUMO

We created a suite of packages to enable analysis of extremely large genomic data sets (potentially millions of individuals and millions of molecular markers) within the R environment. The package offers: a matrix-like interface for .bed files (PLINK's binary format for genotype data), a novel class of linked arrays that allows linking data stored in multiple files to form a single array accessible from the R computing environment, methods for parallel computing capabilities that can carry out computations on very large data sets without loading the entire data into memory and a basic set of methods for statistical genetic analyses. The package is accessible through CRAN and GitHub. In this note, we describe the classes and methods implemented in each of the packages that make the suite and illustrate the use of the packages using data from the UK Biobank.

Assuntos

Big Data , Biologia Computacional/métodos , Genômica/métodos , Software , Algoritmos

Integrated landscape of copy number variation and RNA expression associated with nodal metastasis in invasive ductal breast carcinoma.

Behring, Michael; Shrestha, Sadeep; Manne, Upender; Cui, Xiangqin; Gonzalez-Reymundez, Agustin; Grueneberg, Alexander; Vazquez, Ana I.

Oncotarget ; 9(96): 36836-36848, 2018 Dec 07.

Artigo em Inglês | MEDLINE | ID: mdl-30627325

RESUMO

BACKGROUND: Lymph node metastasis (NM) in breast cancer is a clinical predictor of patient outcomes, but how its genetic underpinnings contribute to aggressive phenotypes is unclear. Our objective was to create the first landscape analysis of CNV-associated NM in ductal breast cancer. To assess the role of copy number variations (CNVs) in NM, we compared CNVs and/or associated mRNA expression in primary tumors of patients with NM to those without metastasis. RESULTS: We found CNV loss in chromosomes 1, 3, 9, 18, and 19 and gains in chromosomes 5, 8, 12, 14, 16-17, and 20 that were associated with NM and replicated in both databases. In primary tumors, per-gene CNVs associated with NM were ten times more frequent than mRNA expression; however, there were few CNV-driven changes in mRNA expression that differed by nodal status. Overlapping regions of CNV changes and mRNA expression were evident for the CTAGE5 gene. In 8q12, 11q13-14, 20q1, and 17q14-24 regions, there were gene-specific gains in CNV-driven mRNA expression associated with NM. METHODS: Data on CNV and mRNA expression from the TCGA and the METABRIC consortium of breast ductal carcinoma were utilized to identify CNV-based features associated with NM. Within each dataset, associations were compared across omic platforms to identify CNV-driven variations in gene expression. Only replications across both datasets were considered as determinants of NM. CONCLUSIONS: Gains in CTAGE5, NDUFC2, EIF4EBP1, and PSCA genes and their expression may aid in early diagnosis of metastatic breast carcinoma and have potential as therapeutic targets.

Will Big Data Close the Missing Heritability Gap?

Kim, Hwasoon; Grueneberg, Alexander; Vazquez, Ana I; Hsu, Stephen; de Los Campos, Gustavo.

Genetics ; 207(3): 1135-1145, 2017 11.

Artigo em Inglês | MEDLINE | ID: mdl-28893854

RESUMO

Despite the important discoveries reported by genome-wide association (GWA) studies, for most traits and diseases the prediction R-squared (R-sq.) achieved with genetic scores remains considerably lower than the trait heritability. Modern biobanks will soon deliver unprecedentedly large biomedical data sets: Will the advent of big data close the gap between the trait heritability and the proportion of variance that can be explained by a genomic predictor? We addressed this question using Bayesian methods and a data analysis approach that produces a surface response relating prediction R-sq. with sample size and model complexity (e.g., number of SNPs). We applied the methodology to data from the interim release of the UK Biobank. Focusing on human height as a model trait and using 80,000 records for model training, we achieved a prediction R-sq. in testing (n = 22,221) of 0.24 (95% C.I.: 0.23-0.25). Our estimates show that prediction R-sq. increases with sample size, reaching an estimated plateau at values that ranged from 0.1 to 0.37 for models using 500 and 50,000 (GWA-selected) SNPs, respectively. Soon much larger data sets will become available. Using the estimated surface response, we forecast that larger sample sizes will lead to further improvements in prediction R-sq. We conclude that big data will lead to a substantial reduction of the gap between trait heritability and the proportion of interindividual differences that can be explained with a genomic predictor. However, even with the power of big data, for complex traits we anticipate that the gap between prediction R-sq. and trait heritability will not be fully closed.

Assuntos

Bases de Dados Factuais/normas , Variação Genética , Estudo de Associação Genômica Ampla/normas , Característica Quantitativa Herdável , Adulto , Idoso , Estatura/genética , Confiabilidade dos Dados , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Modelos Genéticos , Tamanho da Amostra

Distribution of cold adaptation proteins in microbial mats in Lake Joyce, Antarctica: Analysis of metagenomic data by using two bioinformatics tools.

Koo, Hyunmin; Hakim, Joseph A; Fisher, Phillip R E; Grueneberg, Alexander; Andersen, Dale T; Bej, Asim K.

J Microbiol Methods ; 120: 23-8, 2016 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-26578243

RESUMO

In this study, we report the distribution and abundance of cold-adaptation proteins in microbial mat communities in the perennially ice-covered Lake Joyce, located in the McMurdo Dry Valleys, Antarctica. We have used MG-RAST and R code bioinformatics tools on Illumina HiSeq2000 shotgun metagenomic data and compared the filtering efficacy of these two methods on cold-adaptation proteins. Overall, the abundance of cold-shock DEAD-box protein A (CSDA), antifreeze proteins (AFPs), fatty acid desaturase (FAD), trehalose synthase (TS), and cold-shock family of proteins (CSPs) were present in all mat samples at high, moderate, or low levels, whereas the ice nucleation protein (INP) was present only in the ice and bulbous mat samples at insignificant levels. Considering the near homogeneous temperature profile of Lake Joyce (0.08-0.29 °C), the distribution and abundance of these proteins across various mat samples predictively correlated with known functional attributes necessary for microbial communities to thrive in this ecosystem. The comparison of the MG-RAST and the R code methods showed dissimilar occurrences of the cold-adaptation protein sequences, though with insignificant ANOSIM (R = 0.357; p-value = 0.012), ADONIS (R(2) = 0.274; p-value = 0.03) and STAMP (p-values = 0.521-0.984) statistical analyses. Furthermore, filtering targeted sequences using the R code accounted for taxonomic groups by avoiding sequence redundancies, whereas the MG-RAST provided total counts resulting in a higher sequence output. The results from this study revealed for the first time the distribution of cold-adaptation proteins in six different types of microbial mats in Lake Joyce, while suggesting a simpler and more manageable user-defined method of R code, as compared to a web-based MG-RAST pipeline.

Assuntos

Proteínas Anticongelantes/análise , Lagos/microbiologia , Metagenômica/métodos , Microbiologia da Água , Regiões Antárticas , Proteínas Anticongelantes/genética , Proteínas da Membrana Bacteriana Externa/análise , Proteínas da Membrana Bacteriana Externa/genética , Proteínas de Bactérias/análise , Proteínas de Bactérias/genética , Sequência de Bases , Temperatura Baixa , Biologia Computacional/métodos , Ecossistema , Ácidos Graxos Dessaturases/análise , Glucosiltransferases/análise , Gelo/análise

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA