Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 13 de 13
Filtrar
1.
PLoS Genet ; 17(9): e1009772, 2021 09.
Artigo em Inglês | MEDLINE | ID: mdl-34516545

RESUMO

Late-onset Alzheimer's disease (LOAD) is the most common type of dementia causing irreversible brain damage to the elderly and presents a major public health challenge. Clinical research and genome-wide association studies have suggested a potential contribution of the endocytic pathway to AD, with an emphasis on common loci. However, the contribution of rare variants in this pathway to AD has not been thoroughly investigated. In this study, we focused on the effect of rare variants on AD by first applying a rare-variant gene-set burden analysis using genes in the endocytic pathway on over 3,000 individuals with European ancestry from three large whole-genome sequencing (WGS) studies. We identified significant associations of rare-variant burden within the endocytic pathway with AD, which were successfully replicated in independent datasets. We further demonstrated that this endocytic rare-variant enrichment is associated with neurofibrillary tangles (NFTs) and age-related phenotypes, increasing the risk of obtaining severer brain damage, earlier age-at-onset, and earlier age-of-death. Next, by aggregating rare variants within each gene, we sought to identify single endocytic genes associated with AD and NFTs. Careful examination using NFTs revealed one significantly associated gene, ANKRD13D. To identify functional associations, we integrated bulk RNA-Seq data from over 600 brain tissues and found two endocytic expression genes (eGenes), HLA-A and SLC26A7, that displayed significant influences on their gene expressions. Differential expressions between AD patients and controls of these three identified genes were further examined by incorporating scRNA-Seq data from 48 post-mortem brain samples and demonstrated distinct expression patterns across cell types. Taken together, our results demonstrated strong rare-variant effect in the endocytic pathway on AD risk and progression and functional effect of gene expression alteration in both bulk and single-cell resolution, which may bring more insight and serve as valuable resources for future AD genetic studies, clinical research, and therapeutic targeting.


Assuntos
Doença de Alzheimer/patologia , Endocitose , Fenótipo , Doença de Alzheimer/genética , Estudo de Associação Genômica Ampla , Humanos , Polimorfismo de Nucleotídeo Único , Sequenciamento Completo do Genoma
2.
PLoS Genet ; 16(9): e1009018, 2020 09.
Artigo em Inglês | MEDLINE | ID: mdl-32925908

RESUMO

Reverse causality has made it difficult to establish the causal directions between obesity and prediabetes and obesity and insulin resistance. To disentangle whether obesity causally drives prediabetes and insulin resistance already in non-diabetic individuals, we utilized the UK Biobank and METSIM cohort to perform a Mendelian randomization (MR) analyses in the non-diabetic individuals. Our results suggest that both prediabetes and systemic insulin resistance are caused by obesity (p = 1.2×10-3 and p = 3.1×10-24). As obesity reflects the amount of body fat, we next studied how adipose tissue affects insulin resistance. We performed both bulk RNA-sequencing and single nucleus RNA sequencing on frozen human subcutaneous adipose biopsies to assess adipose cell-type heterogeneity and mitochondrial (MT) gene expression in insulin resistance. We discovered that the adipose MT gene expression and body fat percent are both independently associated with insulin resistance (p≤0.05 for each) when adjusting for the decomposed adipose cell-type proportions. Next, we showed that these 3 factors, adipose MT gene expression, body fat percent, and adipose cell types, explain a substantial amount (44.39%) of variance in insulin resistance and can be used to predict it (p≤2.64×10-5 in 3 independent human cohorts). In summary, we demonstrated that obesity is a strong determinant of both prediabetes and insulin resistance, and discovered that individuals' adipose cell-type composition, adipose MT gene expression, and body fat percent predict their insulin resistance, emphasizing the critical role of adipose tissue in systemic insulin resistance.


Assuntos
Tecido Adiposo/metabolismo , Resistência à Insulina/fisiologia , Obesidade/genética , Adipócitos/metabolismo , Adiposidade , Adulto , Índice de Massa Corporal , Estudos de Coortes , Diabetes Mellitus Tipo 2/metabolismo , Feminino , Humanos , Resistência à Insulina/genética , Masculino , Pessoa de Meia-Idade , Obesidade/fisiopatologia , Estado Pré-Diabético/metabolismo , Estado Pré-Diabético/fisiopatologia , Gordura Subcutânea/metabolismo
3.
Bioinformatics ; 37(1): 9-16, 2021 Apr 09.
Artigo em Inglês | MEDLINE | ID: mdl-33416856

RESUMO

MOTIVATION: Since the first human genome was sequenced in 2001, there has been a rapid growth in the number of bioinformatic methods to process and analyze next-generation sequencing (NGS) data for research and clinical studies that aim to identify genetic variants influencing diseases and traits. To achieve this goal, one first needs to call genetic variants from NGS data, which requires multiple computationally intensive analysis steps. Unfortunately, there is a lack of an open-source pipeline that can perform all these steps on NGS data in a manner, which is fully automated, efficient, rapid, scalable, modular, user-friendly and fault tolerant. To address this, we introduce xGAP, an extensible Genome Analysis Pipeline, which implements modified GATK best practice to analyze DNA-seq data with the aforementioned functionalities. RESULTS: xGAP implements massive parallelization of the modified GATK best practice pipeline by splitting a genome into many smaller regions with efficient load-balancing to achieve high scalability. It can process 30× coverage whole-genome sequencing (WGS) data in ∼90 min. In terms of accuracy of discovered variants, xGAP achieves average F1 scores of 99.37% for single nucleotide variants and 99.20% for insertion/deletions across seven benchmark WGS datasets. We achieve highly consistent results across multiple on-premises (SGE & SLURM) high-performance clusters. Compared to the Churchill pipeline, with similar parallelization, xGAP is 20% faster when analyzing 50× coverage WGS on Amazon Web Service. Finally, xGAP is user-friendly and fault tolerant where it can automatically re-initiate failed processes to minimize required user intervention. AVAILABILITY AND IMPLEMENTATION: xGAP is available at https://github.com/Adigorla/xgap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

4.
PLoS Genet ; 15(12): e1008481, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31834882

RESUMO

Many disease risk loci identified in genome-wide association studies are present in non-coding regions of the genome. Previous studies have found enrichment of expression quantitative trait loci (eQTLs) in disease risk loci, indicating that identifying causal variants for gene expression is important for elucidating the genetic basis of not only gene expression but also complex traits. However, detecting causal variants is challenging due to complex genetic correlation among variants known as linkage disequilibrium (LD) and the presence of multiple causal variants within a locus. Although several fine-mapping approaches have been developed to overcome these challenges, they may produce large sets of putative causal variants when true causal variants are in high LD with many non-causal variants. In eQTL studies, there is an additional source of information that can be used to improve fine-mapping called allelic imbalance (AIM) that measures imbalance in gene expression on two chromosomes of a diploid organism. In this work, we develop a novel statistical method that leverages both AIM and total expression data to detect causal variants that regulate gene expression. We illustrate through simulations and application to 10 tissues of the Genotype-Tissue Expression (GTEx) dataset that our method identifies the true causal variants with higher specificity than an approach that uses only eQTL information. Across all tissues and genes, our method achieves a median reduction rate of 11% in the number of putative causal variants. We use chromatin state data from the Roadmap Epigenomics Consortium to show that the putative causal variants identified by our method are enriched for active regions of the genome, providing orthogonal support that our method identifies causal variants with increased specificity.


Assuntos
Desequilíbrio Alélico , Cromatina/genética , Mapeamento Cromossômico/métodos , Locos de Características Quantitativas , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Humanos , Desequilíbrio de Ligação , Herança Multifatorial , Polimorfismo de Nucleotídeo Único
5.
PLoS Comput Biol ; 15(12): e1007556, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31851693

RESUMO

Next-generation sequencing technology (NGS) enables the discovery of nearly all genetic variants present in a genome. A subset of these variants, however, may have poor sequencing quality due to limitations in NGS or variant callers. In genetic studies that analyze a large number of sequenced individuals, it is critical to detect and remove those variants with poor quality as they may cause spurious findings. In this paper, we present ForestQC, a statistical tool for performing quality control on variants identified from NGS data by combining a traditional filtering approach and a machine learning approach. Our software uses the information on sequencing quality, such as sequencing depth, genotyping quality, and GC contents, to predict whether a particular variant is likely to be false-positive. To evaluate ForestQC, we applied it to two whole-genome sequencing datasets where one dataset consists of related individuals from families while the other consists of unrelated individuals. Results indicate that ForestQC outperforms widely used methods for performing quality control on variants such as VQSR of GATK by considerably improving the quality of variants to be included in the analysis. ForestQC is also very efficient, and hence can be applied to large sequencing datasets. We conclude that combining a machine learning algorithm trained with sequencing quality information and the filtering approach is a practical approach to perform quality control on genetic variants from sequencing data.


Assuntos
Variação Genética , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Software , Algoritmos , Biologia Computacional , Bases de Dados Genéticas/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala/normas , Humanos , Aprendizado de Máquina , Polimorfismo de Nucleotídeo Único , Controle de Qualidade , Sequenciamento Completo do Genoma/normas , Sequenciamento Completo do Genoma/estatística & dados numéricos
6.
Br J Anaesth ; 123(6): 877-886, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31627890

RESUMO

BACKGROUND: Rapid, preoperative identification of patients with the highest risk for medical complications is necessary to ensure that limited infrastructure and human resources are directed towards those most likely to benefit. Existing risk scores either lack specificity at the patient level or utilise the American Society of Anesthesiologists (ASA) physical status classification, which requires a clinician to review the chart. METHODS: We report on the use of machine learning algorithms, specifically random forests, to create a fully automated score that predicts postoperative in-hospital mortality based solely on structured data available at the time of surgery. Electronic health record data from 53 097 surgical patients (2.01% mortality rate) who underwent general anaesthesia between April 1, 2013 and December 10, 2018 in a large US academic medical centre were used to extract 58 preoperative features. RESULTS: Using a random forest classifier we found that automatically obtained preoperative features (area under the curve [AUC] of 0.932, 95% confidence interval [CI] 0.910-0.951) outperforms Preoperative Score to Predict Postoperative Mortality (POSPOM) scores (AUC of 0.660, 95% CI 0.598-0.722), Charlson comorbidity scores (AUC of 0.742, 95% CI 0.658-0.812), and ASA physical status (AUC of 0.866, 95% CI 0.829-0.897). Including the ASA physical status with the preoperative features achieves an AUC of 0.936 (95% CI 0.917-0.955). CONCLUSIONS: This automated score outperforms the ASA physical status score, the Charlson comorbidity score, and the POSPOM score for predicting in-hospital mortality. Additionally, we integrate this score with a previously published postoperative score to demonstrate the extent to which patient risk changes during the perioperative period.


Assuntos
Registros Eletrônicos de Saúde/estatística & dados numéricos , Nível de Saúde , Mortalidade Hospitalar , Aprendizado de Máquina , Complicações Pós-Operatórias/diagnóstico , Adolescente , Adulto , Idoso , Idoso de 80 Anos ou mais , California , Comorbidade , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Período Pré-Operatório , Medição de Risco , Fatores de Risco , Adulto Jovem
7.
Front Bioinform ; 1: 792605, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-36303752

RESUMO

Calling differential methylation at a cell-type level from tissue-level bulk data is a fundamental challenge in genomics that has recently received more attention. These studies most often aim at identifying statistical associations rather than causal effects. However, existing methods typically make an implicit assumption about the direction of effects, and thus far, little to no attention has been given to the fact that this directionality assumption may not hold and can consequently affect statistical power and control for false positives. We demonstrate that misspecification of the model directionality can lead to a drastic decrease in performance and increase in risk of spurious findings in cell-type-specific differential methylation analysis, and we discuss the need to carefully consider model directionality before choosing a statistical method for analysis.

8.
Nat Commun ; 11(1): 1971, 2020 04 24.
Artigo em Inglês | MEDLINE | ID: mdl-32332754

RESUMO

We present Bisque, a tool for estimating cell type proportions in bulk expression. Bisque implements a regression-based approach that utilizes single-cell RNA-seq (scRNA-seq) or single-nucleus RNA-seq (snRNA-seq) data to generate a reference expression profile and learn gene-specific bulk expression transformations to robustly decompose RNA-seq data. These transformations significantly improve decomposition performance compared to existing methods when there is significant technical variation in the generation of the reference profile and observed bulk expression. Importantly, compared to existing methods, our approach is extremely efficient, making it suitable for the analysis of large genomic datasets that are becoming ubiquitous. When applied to subcutaneous adipose and dorsolateral prefrontal cortex expression datasets with both bulk RNA-seq and snRNA-seq data, Bisque replicates previously reported associations between cell type proportions and measured phenotypes across abundant and rare cell types. We further propose an additional mode of operation that merely requires a set of known marker genes.


Assuntos
Biologia Computacional/métodos , RNA-Seq/métodos , Análise de Célula Única/métodos , Tecido Adiposo/metabolismo , Algoritmos , Perfilação da Expressão Gênica/métodos , Regulação da Expressão Gênica , Genômica , Humanos , Córtex Pré-Frontal/metabolismo , RNA Citoplasmático Pequeno , Software , Transcriptoma
9.
Nat Commun ; 11(1): 2891, 2020 06 03.
Artigo em Inglês | MEDLINE | ID: mdl-32493922

RESUMO

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

10.
Sci Rep ; 10(1): 11019, 2020 07 03.
Artigo em Inglês | MEDLINE | ID: mdl-32620816

RESUMO

Single-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. We observe that snRNA-seq is commonly subject to contamination by high amounts of ambient RNA, which can lead to biased downstream analyses, such as identification of spurious cell types if overlooked. We present a novel approach to quantify contamination and filter droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: (1) human differentiating preadipocytes in vitro, (2) fresh mouse brain tissue, and (3) human frozen adipose tissue (AT) from six individuals. All three data sets showed evidence of extranuclear RNA contamination, and we observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq, our clustering strategy also successfully filtered single-cell RNA-seq data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem.


Assuntos
Tecido Adiposo/metabolismo , Encéfalo/metabolismo , Análise de Sequência de RNA/métodos , Animais , Perfilação da Expressão Gênica , Humanos , Funções Verossimilhança , Camundongos , Análise de Célula Única , Aprendizado de Máquina Supervisionado
11.
PLoS One ; 15(9): e0239474, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-32960917

RESUMO

Worldwide, testing capacity for SARS-CoV-2 is limited and bottlenecks in the scale up of polymerase chain reaction (PCR-based testing exist. Our aim was to develop and evaluate a machine learning algorithm to diagnose COVID-19 in the inpatient setting. The algorithm was based on basic demographic and laboratory features to serve as a screening tool at hospitals where testing is scarce or unavailable. We used retrospectively collected data from the UCLA Health System in Los Angeles, California. We included all emergency room or inpatient cases receiving SARS-CoV-2 PCR testing who also had a set of ancillary laboratory features (n = 1,455) between 1 March 2020 and 24 May 2020. We tested seven machine learning models and used a combination of those models for the final diagnostic classification. In the test set (n = 392), our combined model had an area under the receiver operator curve of 0.91 (95% confidence interval 0.87-0.96). The model achieved a sensitivity of 0.93 (95% CI 0.85-0.98), specificity of 0.64 (95% CI 0.58-0.69). We found that our machine learning algorithm had excellent diagnostic metrics compared to SARS-CoV-2 PCR. This ensemble machine learning algorithm to diagnose COVID-19 has the potential to be used as a screening tool in hospital settings where PCR testing is scarce or unavailable.


Assuntos
Betacoronavirus , Técnicas de Laboratório Clínico/métodos , Infecções por Coronavirus/diagnóstico , Pacientes Internados , Aprendizado de Máquina , Pneumonia Viral/diagnóstico , Adulto , Idoso , Área Sob a Curva , COVID-19 , Teste para COVID-19 , Técnicas de Laboratório Clínico/normas , Humanos , Los Angeles , Programas de Rastreamento/métodos , Programas de Rastreamento/normas , Pessoa de Meia-Idade , Pandemias , Reação em Cadeia da Polimerase , Estudos Retrospectivos , SARS-CoV-2
12.
Emerg Top Life Sci ; 3(4): 399-409, 2019 Aug 16.
Artigo em Inglês | MEDLINE | ID: mdl-33523207

RESUMO

Next-generation sequencing has allowed genetic studies to collect genome sequencing data from a large number of individuals. However, raw sequencing data are not usually interpretable due to fragmentation of the genome and technical biases; therefore, analysis of these data requires many computational approaches. First, for each sequenced individual, sequencing data are aligned and further processed to account for technical biases. Then, variant calling is performed to obtain information on the positions of genetic variants and their corresponding genotypes. Quality control (QC) is applied to identify individuals and genetic variants with sequencing errors. These procedures are necessary to generate accurate variant calls from sequencing data, and many computational approaches have been developed for these tasks. This review will focus on current widely used approaches for variant calling and QC.

13.
Lebniz Int Proc Inform ; 20162016 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34335990

RESUMO

Linear mixed models (LMMs) can be applied in the meta-analyses of responses from individuals across multiple contexts, increasing power to detect associations while accounting for confounding effects arising from within-individual variation. However, traditional approaches to fitting these models can be computationally intractable. Here, we describe an efficient and exact method for fitting a multiple-context linear mixed model. Whereas existing exact methods may be cubic in their time complexity with respect to the number of individuals, our approach for multiple-context LMMs (mcLMM) is linear. These improvements allow for large-scale analyses requiring computing time and memory magnitudes of order less than existing methods. As examples, we apply our approach to identify expression quantitative trait loci from large-scale gene expression data measured across multiple tissues as well as joint analyses of multiple phenotypes in genome-wide association studies at biobank scale.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa