Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 2.016
Filtrar
Más filtros

Tipo del documento
Intervalo de año de publicación
1.
Cell ; 186(17): 3659-3673.e23, 2023 08 17.
Artículo en Inglés | MEDLINE | ID: mdl-37527660

RESUMEN

Many regions in the human genome vary in length among individuals due to variable numbers of tandem repeats (VNTRs). To assess the phenotypic impact of VNTRs genome-wide, we applied a statistical imputation approach to estimate the lengths of 9,561 autosomal VNTR loci in 418,136 unrelated UK Biobank participants and 838 GTEx participants. Association and statistical fine-mapping analyses identified 58 VNTRs that appeared to influence a complex trait in UK Biobank, 18 of which also appeared to modulate expression or splicing of a nearby gene. Non-coding VNTRs at TMCO1 and EIF3H appeared to generate the largest known contributions of common human genetic variation to risk of glaucoma and colorectal cancer, respectively. Each of these two VNTRs associated with a >2-fold range of risk across individuals. These results reveal a substantial and previously unappreciated role of non-coding VNTRs in human health and gene regulation.


Asunto(s)
Canales de Calcio , Neoplasias Colorrectales , Factor 3 de Iniciación Eucariótica , Glaucoma , Repeticiones de Minisatélite , Humanos , Canales de Calcio/genética , Neoplasias Colorrectales/genética , Genoma Humano , Glaucoma/genética , Polimorfismo Genético , Factor 3 de Iniciación Eucariótica/genética
2.
Cell ; 185(18): 3426-3440.e19, 2022 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-36055201

RESUMEN

The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS 1kGP resource, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina. We performed single-nucleotide variant (SNV) and short insertion and deletion (INDEL) discovery and generated a comprehensive set of structural variants (SVs) by integrating multiple analytic methods through a machine learning model. We show gains in sensitivity and precision of variant calls compared to phase 3, especially among rare SNVs as well as INDELs and SVs spanning frequency spectrum. We also generated an improved reference imputation panel, making variants discovered here accessible for association studies.


Asunto(s)
Genoma Humano , Secuenciación Completa del Genoma , Femenino , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Mutación INDEL , Masculino , Polimorfismo de Nucleótido Simple
3.
Cell ; 179(3): 736-749.e15, 2019 10 17.
Artículo en Inglés | MEDLINE | ID: mdl-31626772

RESUMEN

Underrepresentation of Asian genomes has hindered population and medical genetics research on Asians, leading to population disparities in precision medicine. By whole-genome sequencing of 4,810 Singapore Chinese, Malays, and Indians, we found 98.3 million SNPs and small insertions or deletions, over half of which are novel. Population structure analysis demonstrated great representation of Asian genetic diversity by three ethnicities in Singapore and revealed a Malay-related novel ancestry component. Furthermore, demographic inference suggested that Malays split from Chinese ∼24,800 years ago and experienced significant admixture with East Asians ∼1,700 years ago, coinciding with the Austronesian expansion. Additionally, we identified 20 candidate loci for natural selection, 14 of which harbored robust associations with complex traits and diseases. Finally, we show that our data can substantially improve genotype imputation in diverse Asian and Oceanian populations. These results highlight the value of our data as a resource to empower human genetics discovery across broad geographic regions.


Asunto(s)
Genética de Población , Genoma Humano/genética , Selección Genética , Secuenciación Completa del Genoma , Pueblo Asiatico/genética , Femenino , Genotipo , Humanos , Malasia/epidemiología , Masculino , Polimorfismo de Nucleótido Simple/genética , Singapur/epidemiología
4.
Cell ; 174(3): 716-729.e27, 2018 07 26.
Artículo en Inglés | MEDLINE | ID: mdl-29961576

RESUMEN

Single-cell RNA sequencing technologies suffer from many sources of technical noise, including under-sampling of mRNA molecules, often termed "dropout," which can severely obscure important gene-gene relationships. To address this, we developed MAGIC (Markov affinity-based graph imputation of cells), a method that shares information across similar cells, via data diffusion, to denoise the cell count matrix and fill in missing transcripts. We validate MAGIC on several biological systems and find it effective at recovering gene-gene relationships and additional structures. Applied to the epithilial to mesenchymal transition, MAGIC reveals a phenotypic continuum, with the majority of cells residing in intermediate states that display stem-like signatures, and infers known and previously uncharacterized regulatory interactions, demonstrating that our approach can successfully uncover regulatory relations without perturbations.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Algoritmos , Línea Celular , Epistasis Genética/genética , Redes Reguladoras de Genes/genética , Humanos , Cadenas de Markov , MicroARNs/genética , ARN Mensajero/genética , Programas Informáticos
5.
Am J Hum Genet ; 111(5): 990-995, 2024 May 02.
Artículo en Inglés | MEDLINE | ID: mdl-38636510

RESUMEN

Since genotype imputation was introduced, researchers have been relying on the estimated imputation quality from imputation software to perform post-imputation quality control (QC). However, this quality estimate (denoted as Rsq) performs less well for lower-frequency variants. We recently published MagicalRsq, a machine-learning-based imputation quality calibration, which leverages additional typed markers from the same cohort and outperforms Rsq as a QC metric. In this work, we extended the original MagicalRsq to allow cross-cohort model training and named the new model MagicalRsq-X. We removed the cohort-specific estimated minor allele frequency and included linkage disequilibrium scores and recombination rates as additional features. Leveraging whole-genome sequencing data from TOPMed, specifically participants in the BioMe, JHS, WHI, and MESA studies, we performed comprehensive cross-cohort evaluations for predominantly European and African ancestral individuals based on their inferred global ancestry with the 1000 Genomes and Human Genome Diversity Project data as reference. Our results suggest MagicalRsq-X outperforms Rsq in almost every setting, with 7.3%-14.4% improvement in squared Pearson correlation with true R2, corresponding to 85-218 K variant gains. We further developed a metric to quantify the genetic distances of a target cohort relative to a reference cohort and showed that such metric largely explained the performance of MagicalRsq-X models. Finally, we found MagicalRsq-X saved up to 53 known genome-wide significant variants in one of the largest blood cell trait GWASs that would be missed using the original Rsq for QC. In conclusion, MagicalRsq-X shows superiority for post-imputation QC and benefits genetic studies by distinguishing well and poorly imputed lower-frequency variants.


Asunto(s)
Frecuencia de los Genes , Genotipo , Polimorfismo de Nucleótido Simple , Programas Informáticos , Humanos , Estudios de Cohortes , Desequilibrio de Ligamiento , Estudio de Asociación del Genoma Completo/métodos , Genoma Humano , Control de Calidad , Aprendizaje Automático , Secuenciación Completa del Genoma/normas , Secuenciación Completa del Genoma/métodos
6.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38706317

RESUMEN

Single-cell RNA sequencing (scRNA-seq) enables the exploration of cellular heterogeneity by analyzing gene expression profiles in complex tissues. However, scRNA-seq data often suffer from technical noise, dropout events and sparsity, hindering downstream analyses. Although existing works attempt to mitigate these issues by utilizing graph structures for data denoising, they involve the risk of propagating noise and fall short of fully leveraging the inherent data relationships, relying mainly on one of cell-cell or gene-gene associations and graphs constructed by initial noisy data. To this end, this study presents single-cell bilevel feature propagation (scBFP), two-step graph-based feature propagation method. It initially imputes zero values using non-zero values, ensuring that the imputation process does not affect the non-zero values due to dropout. Subsequently, it denoises the entire dataset by leveraging gene-gene and cell-cell relationships in the respective steps. Extensive experimental results on scRNA-seq data demonstrate the effectiveness of scBFP in various downstream tasks, uncovering valuable biological insights.


Asunto(s)
Análisis de Secuencia de ARN , Análisis de la Célula Individual , Análisis de la Célula Individual/métodos , Análisis de Secuencia de ARN/métodos , Humanos , Algoritmos , Perfilación de la Expresión Génica/métodos , Biología Computacional/métodos , RNA-Seq/métodos
7.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38628114

RESUMEN

Spatial transcriptomics (ST) has become a powerful tool for exploring the spatial organization of gene expression in tissues. Imaging-based methods, though offering superior spatial resolutions at the single-cell level, are limited in either the number of imaged genes or the sensitivity of gene detection. Existing approaches for enhancing ST rely on the similarity between ST cells and reference single-cell RNA sequencing (scRNA-seq) cells. In contrast, we introduce stDiff, which leverages relationships between gene expression abundance in scRNA-seq data to enhance ST. stDiff employs a conditional diffusion model, capturing gene expression abundance relationships in scRNA-seq data through two Markov processes: one introducing noise to transcriptomics data and the other denoising to recover them. The missing portion of ST is predicted by incorporating the original ST data into the denoising process. In our comprehensive performance evaluation across 16 datasets, utilizing multiple clustering and similarity metrics, stDiff stands out for its exceptional ability to preserve topological structures among cells, positioning itself as a robust solution for cell population identification. Moreover, stDiff's enhancement outcomes closely mirror the actual ST data within the batch space. Across diverse spatial expression patterns, our model accurately reconstructs them, delineating distinct spatial boundaries. This highlights stDiff's capability to unify the observed and predicted segments of ST data for subsequent analysis. We anticipate that stDiff, with its innovative approach, will contribute to advancing ST imputation methodologies.


Asunto(s)
Benchmarking , Perfilación de la Expresión Génica , Análisis por Conglomerados , Difusión , Cadenas de Markov , Análisis de Secuencia de ARN , Transcriptoma
8.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38349062

RESUMEN

Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool to gain biological insights at the cellular level. However, due to technical limitations of the existing sequencing technologies, low gene expression values are often omitted, leading to inaccurate gene counts. Existing methods, including advanced deep learning techniques, struggle to reliably impute gene expressions due to a lack of mechanisms that explicitly consider the underlying biological knowledge of the system. In reality, it has long been recognized that gene-gene interactions may serve as reflective indicators of underlying biology processes, presenting discriminative signatures of the cells. A genomic data analysis framework that is capable of leveraging the underlying gene-gene interactions is thus highly desirable and could allow for more reliable identification of distinctive patterns of the genomic data through extraction and integration of intricate biological characteristics of the genomic data. Here we tackle the problem in two steps to exploit the gene-gene interactions of the system. We first reposition the genes into a 2D grid such that their spatial configuration reflects their interactive relationships. To alleviate the need for labeled ground truth gene expression datasets, a self-supervised 2D convolutional neural network is employed to extract the contextual features of the interactions from the spatially configured genes and impute the omitted values. Extensive experiments with both simulated and experimental scRNA-seq datasets are carried out to demonstrate the superior performance of the proposed strategy against the existing imputation methods.


Asunto(s)
Aprendizaje Profundo , Epistasis Genética , Análisis de Datos , Genómica , Expresión Génica , Perfilación de la Expresión Génica , Análisis de Secuencia de ARN
9.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38600665

RESUMEN

Single-cell RNA sequencing (scRNA-seq) facilitates the study of cell type heterogeneity and the construction of cell atlas. However, due to its limitations, many genes may be detected to have zero expressions, i.e. dropout events, leading to bias in downstream analyses and hindering the identification and characterization of cell types and cell functions. Although many imputation methods have been developed, their performances are generally lower than expected across different kinds and dimensions of data and application scenarios. Therefore, developing an accurate and robust single-cell gene expression data imputation method is still essential. Considering to maintain the original cell-cell and gene-gene correlations and leverage bulk RNA sequencing (bulk RNA-seq) data information, we propose scINRB, a single-cell gene expression imputation method with network regularization and bulk RNA-seq data. scINRB adopts network-regularized non-negative matrix factorization to ensure that the imputed data maintains the cell-cell and gene-gene similarities and also approaches the gene average expression calculated from bulk RNA-seq data. To evaluate the performance, we test scINRB on simulated and experimental datasets and compare it with other commonly used imputation methods. The results show that scINRB recovers gene expression accurately even in the case of high dropout rates and dimensions, preserves cell-cell and gene-gene similarities and improves various downstream analyses including visualization, clustering and trajectory inference.


Asunto(s)
Algoritmos , Análisis de la Célula Individual , RNA-Seq , Análisis de la Célula Individual/métodos , Análisis de Secuencia de ARN/métodos , Análisis por Conglomerados , Expresión Génica , Perfilación de la Expresión Génica , Programas Informáticos
10.
Mol Biol Evol ; 41(5)2024 May 03.
Artículo en Inglés | MEDLINE | ID: mdl-38662789

RESUMEN

Ancient genomic analyses are often restricted to utilizing pseudohaploid data due to low genome coverage. Leveraging low-coverage data by imputation to calculate phased diploid genotypes that enables haplotype-based interrogation and single nucleotide polymorphism (SNP) calling at unsequenced positions is highly desirable. This has not been investigated for ancient cattle genomes despite these being compelling subjects for archeological, evolutionary, and economic reasons. Here, we test this approach by sequencing a Mesolithic European aurochs (18.49×; 9,852 to 9,376 calBCE) and an Early Medieval European cow (18.69×; 427 to 580 calCE) and combine these with published individuals: two ancient and three modern. We downsample these genomes (0.25×, 0.5×, 1.0×, and 2.0×) and impute diploid genotypes, utilizing a reference panel of 171 published modern cattle genomes that we curated for 21.7 million (Mn) phased SNPs. We recover high densities of correct calls with an accuracy of >99.1% at variant sites for the lowest downsample depth of 0.25×, increasing to >99.5% for 2.0× (transversions only, minor allele frequency [MAF] ≥ 2.5%). The recovery of SNPs correlates with coverage; on average, 58% of sites are recovered for 0.25× increasing to 87% for 2.0×, utilizing an average of 3.5 million (Mn) transversions (MAF ≥2.5%), even in the aurochs, despite the highest temporal distance from the modern reference panel. Our imputed genomes behave similarly to directly called data in allele frequency-based analyses, for example consistently identifying runs of homozygosity >2 Mb, including a long homozygous region in the Mesolithic European aurochs.


Asunto(s)
Genoma , Polimorfismo de Nucleótido Simple , Animales , Bovinos/genética , ADN Antiguo/análisis , Haplotipos , Genotipo , Genómica/métodos
11.
Am J Hum Genet ; 109(11): 1986-1997, 2022 11 03.
Artículo en Inglés | MEDLINE | ID: mdl-36198314

RESUMEN

Whole-genome sequencing (WGS) is the gold standard for fully characterizing genetic variation but is still prohibitively expensive for large samples. To reduce costs, many studies sequence only a subset of individuals or genomic regions, and genotype imputation is used to infer genotypes for the remaining individuals or regions without sequencing data. However, not all variants can be well imputed, and the current state-of-the-art imputation quality metric, denoted as standard Rsq, is poorly calibrated for lower-frequency variants. Here, we propose MagicalRsq, a machine-learning-based method that integrates variant-level imputation and population genetics statistics, to provide a better calibrated imputation quality metric. Leveraging WGS data from the Cystic Fibrosis Genome Project (CFGP), and whole-exome sequence data from UK BioBank (UKB), we performed comprehensive experiments to evaluate the performance of MagicalRsq compared to standard Rsq for partially sequenced studies. We found that MagicalRsq aligns better with true R2 than standard Rsq in almost every situation evaluated, for both European and African ancestry samples. For example, when applying models trained from 1,992 CFGP sequenced samples to an independent 3,103 samples with no sequencing but TOPMed imputation from array genotypes, MagicalRsq, compared to standard Rsq, achieved net gains of 1.4 million rare, 117k low-frequency, and 18k common variants, where net gains were gained numbers of correctly distinguished variants by MagicalRsq over standard Rsq. MagicalRsq can serve as an improved post-imputation quality metric and will benefit downstream analysis by better distinguishing well-imputed variants from those poorly imputed. MagicalRsq is freely available on GitHub.


Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Humanos , Estudio de Asociación del Genoma Completo/métodos , Polimorfismo de Nucleótido Simple/genética , Calibración , Genotipo , Aprendizaje Automático
12.
Am J Hum Genet ; 109(6): 1007-1015, 2022 06 02.
Artículo en Inglés | MEDLINE | ID: mdl-35508176

RESUMEN

Genotype imputation is an integral tool in genome-wide association studies, in which it facilitates meta-analysis, increases power, and enables fine-mapping. With the increasing availability of whole-genome-sequence datasets, investigators have access to a multitude of reference-panel choices for genotype imputation. In principle, combining all sequenced whole genomes into a single large panel would provide the best imputation performance, but this is often cumbersome or impossible due to privacy restrictions. Here, we describe meta-imputation, a method that allows imputation results generated using different reference panels to be combined into a consensus imputed dataset. Our meta-imputation method requires small changes to the output of existing imputation tools to produce necessary inputs, which are then combined using dynamically estimated weights that are tailored to each individual and genome segment. In the scenarios we examined, the method consistently outperforms imputation using a single reference panel and achieves accuracy comparable to imputation using a combined reference panel.


Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Genoma , Estudio de Asociación del Genoma Completo/métodos , Genotipo , Humanos , Polimorfismo de Nucleótido Simple/genética , Proyectos de Investigación
13.
Am J Hum Genet ; 109(9): 1620-1637, 2022 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-36055211

RESUMEN

Genetically informed drug development and repurposing is an attractive prospect for improving patient outcomes in psychiatry; however, the effectiveness of these endeavors is confounded by heterogeneity. We propose an approach that links interventions implicated by disorder-associated genetic risk, at the population level, to a framework that can target these compounds to individuals. Specifically, results from genome-wide association studies are integrated with expression data to prioritize individual "directional anchor" genes for which the predicted risk-increasing direction of expression could be counteracted by an existing drug. While these compounds represent plausible therapeutic candidates, they are not likely to be equally efficacious for all individuals. To account for this heterogeneity, we constructed polygenic scores restricted to variants annotated to the network of genes that interact with each directional anchor gene. These metrics, which we call a pharmagenic enrichment score (PES), identify individuals with a higher burden of genetic risk, localized in biological processes related to the candidate drug target, to inform precision drug repurposing. We used this approach to investigate schizophrenia and bipolar disorder and reveal several compounds targeting specific directional anchor genes that could be plausibly repurposed. These genetic risk scores, mapped to the networks associated with target genes, revealed biological insights that cannot be observed in undifferentiated genome-wide polygenic risk score (PRS). For example, an enrichment of these partitioned scores in schizophrenia cases with otherwise low PRS. In summary, genetic risk could be used more specifically to direct drug repurposing candidates that target particular genes implicated in psychiatric and other complex disorders.


Asunto(s)
Trastorno Bipolar , Esquizofrenia , Trastorno Bipolar/tratamiento farmacológico , Trastorno Bipolar/genética , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Humanos , Herencia Multifactorial/genética , Factores de Riesgo , Esquizofrenia/tratamiento farmacológico , Esquizofrenia/genética
14.
Am J Hum Genet ; 109(9): 1653-1666, 2022 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-35981533

RESUMEN

Understanding the genetic basis of human diseases and traits is dependent on the identification and accurate genotyping of genetic variants. Deep whole-genome sequencing (WGS), the gold standard technology for SNP and indel identification and genotyping, remains very expensive for most large studies. Here, we quantify the extent to which array genotyping followed by genotype imputation can approximate WGS in studies of individuals of African, Hispanic/Latino, and European ancestry in the US and of Finnish ancestry in Finland (a population isolate). For each study, we performed genotype imputation by using the genetic variants present on the Illumina Core, OmniExpress, MEGA, and Omni 2.5M arrays with the 1000G, HRC, and TOPMed imputation reference panels. Using the Omni 2.5M array and the TOPMed panel, ≥90% of bi-allelic single-nucleotide variants (SNVs) are well imputed (r2 > 0.8) down to minor-allele frequencies (MAFs) of 0.14% in African, 0.11% in Hispanic/Latino, 0.35% in European, and 0.85% in Finnish ancestries. There was little difference in TOPMed-based imputation quality among the arrays with >700k variants. Individual-level imputation quality varied widely between and within the three US studies. Imputation quality also varied across genomic regions, producing regions where even common (MAF > 5%) variants were consistently not well imputed across ancestries. The extent to which array genotyping and imputation can approximate WGS therefore depends on reference panel, genotype array, sample ancestry, and genomic location. Imputation quality by variant or genomic region can be queried with our new tool, RsqBrowser, now deployed on the Michigan Imputation Server.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Polimorfismo de Nucleótido Simple , Frecuencia de los Genes/genética , Estudio de Asociación del Genoma Completo , Genotipo , Humanos , Polimorfismo de Nucleótido Simple/genética , Secuenciación Completa del Genoma
15.
Am J Hum Genet ; 109(2): 299-310, 2022 02 03.
Artículo en Inglés | MEDLINE | ID: mdl-35090584

RESUMEN

Spontaneous clearance of acute hepatitis C virus (HCV) infection is associated with single nucleotide polymorphisms (SNPs) on the MHC class II. We fine-mapped the MHC region in European (n = 1,600; 594 HCV clearance/1,006 HCV persistence) and African (n = 1,869; 340 HCV clearance/1,529 HCV persistence) ancestry individuals and evaluated HCV peptide binding affinity of classical alleles. In both populations, HLA-DQß1Leu26 (p valueMeta = 1.24 × 10-14) located in pocket 4 was negatively associated with HCV spontaneous clearance and HLA-DQß1Pro55 (p valueMeta = 8.23 × 10-11) located in the peptide binding region was positively associated, independently of HLA-DQß1Leu26. These two amino acids are not in linkage disequilibrium (r2 < 0.1) and explain the SNPs and classical allele associations represented by rs2647011, rs9274711, HLA-DQB1∗03:01, and HLA-DRB1∗01:01. Additionally, HCV persistence classical alleles tagged by HLA-DQß1Leu26 had fewer HCV binding epitopes and lower predicted binding affinities compared to clearance alleles (geometric mean of combined IC50 nM of persistence versus clearance; 2,321 nM versus 761.7 nM, p value = 1.35 × 10-38). In summary, MHC class II fine-mapping revealed key amino acids in HLA-DQß1 explaining allelic and SNP associations with HCV outcomes. This mechanistic advance in understanding of natural recovery and immunogenetics of HCV might set the stage for much needed enhancement and design of vaccine to promote spontaneous clearance of HCV infection.


Asunto(s)
Cadenas beta de HLA-DQ/genética , Hepacivirus/patogenicidad , Hepatitis C/genética , Interacciones Huésped-Patógeno/genética , Polimorfismo de Nucleótido Simple , Enfermedad Aguda , Alelos , Sustitución de Aminoácidos , Población Negra , Femenino , Expresión Génica , Estudio de Asociación del Genoma Completo , Genotipo , Cadenas beta de HLA-DQ/inmunología , Hepacivirus/crecimiento & desarrollo , Hepacivirus/inmunología , Hepatitis C/etnología , Hepatitis C/inmunología , Hepatitis C/virología , Interacciones Huésped-Patógeno/inmunología , Humanos , Leucina/inmunología , Leucina/metabolismo , Masculino , Prolina/inmunología , Prolina/metabolismo , Isoformas de Proteínas/genética , Isoformas de Proteínas/inmunología , Remisión Espontánea , Población Blanca
16.
Biostatistics ; 2024 Jun 07.
Artículo en Inglés | MEDLINE | ID: mdl-38850151

RESUMEN

DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows-which is usually the case in methylation data analysis. Finally, we apply and compare our proposed method with alternative approaches on two real methylation datasets, showing how covariates such as cell type, tissue type or age can enhance the accuracy of imputed values.

17.
Biostatistics ; 25(2): 306-322, 2024 Apr 15.
Artículo en Inglés | MEDLINE | ID: mdl-37230469

RESUMEN

Measurement error is common in environmental epidemiologic studies, but methods for correcting measurement error in regression models with multiple environmental exposures as covariates have not been well investigated. We consider a multiple imputation approach, combining external or internal calibration samples that contain information on both true and error-prone exposures with the main study data of multiple exposures measured with error. We propose a constrained chained equations multiple imputation (CEMI) algorithm that places constraints on the imputation model parameters in the chained equations imputation based on the assumptions of strong nondifferential measurement error. We also extend the constrained CEMI method to accommodate nondetects in the error-prone exposures in the main study data. We estimate the variance of the regression coefficients using the bootstrap with two imputations of each bootstrapped sample. The constrained CEMI method is shown by simulations to outperform existing methods, namely the method that ignores measurement error, classical calibration, and regression prediction, yielding estimated regression coefficients with smaller bias and confidence intervals with coverage close to the nominal level. We apply the proposed method to the Neighborhood Asthma and Allergy Study to investigate the associations between the concentrations of multiple indoor allergens and the fractional exhaled nitric oxide level among asthmatic children in New York City. The constrained CEMI method can be implemented by imposing constraints on the imputation matrix using the mice and bootImpute packages in R.


Asunto(s)
Algoritmos , Exposición a Riesgos Ambientales , Niño , Humanos , Animales , Ratones , Exposición a Riesgos Ambientales/efectos adversos , Estudios Epidemiológicos , Calibración , Sesgo
18.
Brief Bioinform ; 24(2)2023 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-36715274

RESUMEN

The advance in single-cell RNA-sequencing (scRNA-seq) sheds light on cell-specific transcriptomic studies of cell developments, complex diseases and cancers. Nevertheless, scRNA-seq techniques suffer from 'dropout' events, and imputation tools are proposed to address the sparsity. Here, rather than imputation, we propose a tool, SMURF, to extract the low-dimensional embeddings from cells and genes utilizing matrix factorization with a mixture of Poisson-Gamma divergent as objective while preserving self-consistency. SMURF exhibits feasible cell subpopulation discovery efficacy with obtained cell embeddings on replicated in silico and eight web lab scRNA datasets with ground truth cell types. Furthermore, SMURF can reduce the cell embedding to a 1D-oval space to recover the time course of cell cycle. SMURF can also serve as an imputation tool; the in silico data assessment shows that SMURF parades the most robust gene expression recovery power with low root mean square error and high Pearson correlation. Moreover, SMURF recovers the gene distribution for the WM989 Drop-seq data. SMURF is available at https://github.com/deepomicslab/SMURF.


Asunto(s)
Análisis de Expresión Génica de una Sola Célula , Programas Informáticos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Perfilación de la Expresión Génica , Análisis por Conglomerados
19.
Brief Bioinform ; 24(2)2023 03 19.
Artículo en Inglés | MEDLINE | ID: mdl-36653906

RESUMEN

Spatially resolved transcriptomics technologies enable comprehensive measurement of gene expression patterns in the context of intact tissues. However, existing technologies suffer from either low resolution or shallow sequencing depth. Here, we present DIST, a deep learning-based method that imputes the gene expression profiles on unmeasured locations and enhances the gene expression for both original measured spots and imputed spots by self-supervised learning and transfer learning. We evaluate the performance of DIST for imputation, clustering, differential expression analysis and functional enrichment analysis. The results show that DIST can impute the gene expression accurately, enhance the gene expression for low-quality data, help detect more biological meaningful differentially expressed genes and pathways, therefore allow for deeper insights into the biological processes.


Asunto(s)
Aprendizaje Profundo , Transcriptoma , Perfilación de la Expresión Génica/métodos , Análisis por Conglomerados
20.
Brief Bioinform ; 25(1)2023 11 22.
Artículo en Inglés | MEDLINE | ID: mdl-38084919

RESUMEN

Single-cell ATAC-seq (scATAC-seq) is a recently developed approach that provides means to investigate open chromatin at single cell level, to assess epigenetic regulation and transcription factors binding landscapes. The sparsity of the scATAC-seq data calls for imputation. Similarly, preprocessing (filtering) may be required to reduce computational load due to the large number of open regions. However, optimal strategies for both imputation and preprocessing have not been yet evaluated together. We present SAPIEnS (scATAC-seq Preprocessing and Imputation Evaluation System), a benchmark for scATAC-seq imputation frameworks, a combination of state-of-the-art imputation methods with commonly used preprocessing techniques. We assess different types of scATAC-seq analysis, i.e. clustering, visualization and digital genomic footprinting, and attain optimal preprocessing-imputation strategies. We discuss the benefits of the imputation framework depending on the task and the number of the dataset features (peaks). We conclude that the preprocessing with the Boruta method is beneficial for the majority of tasks, while imputation is helpful mostly for small datasets. We also implement a SAPIEnS database with pre-computed transcription factor footprints based on imputed data with their activity scores in a specific cell type. SAPIEnS is published at: https://github.com/lab-medvedeva/SAPIEnS. SAPIEnS database is available at: https://sapiensdb.com.


Asunto(s)
Epigénesis Genética , Genómica , Genómica/métodos , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , Regulación de la Expresión Génica , Análisis por Conglomerados
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA