Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 8 de 8
Filtrar
1.
Cell ; 184(8): 2068-2083.e11, 2021 04 15.
Artículo en Inglés | MEDLINE | ID: mdl-33861964

RESUMEN

Understanding population health disparities is an essential component of equitable precision health efforts. Epidemiology research often relies on definitions of race and ethnicity, but these population labels may not adequately capture disease burdens and environmental factors impacting specific sub-populations. Here, we propose a framework for repurposing data from electronic health records (EHRs) in concert with genomic data to explore the demographic ties that can impact disease burdens. Using data from a diverse biobank in New York City, we identified 17 communities sharing recent genetic ancestry. We observed 1,177 health outcomes that were statistically associated with a specific group and demonstrated significant differences in the segregation of genetic variants contributing to Mendelian diseases. We also demonstrated that fine-scale population structure can impact the prediction of complex disease risk within groups. This work reinforces the utility of linking genomic data to EHRs and provides a framework toward fine-scale monitoring of population health.


Asunto(s)
Etnicidad/genética , Salud Poblacional , Bases de Datos Genéticas , Registros Electrónicos de Salud , Genómica , Humanos , Autoinforme
2.
Bioinformatics ; 37(19): 3372-3373, 2021 Oct 11.
Artículo en Inglés | MEDLINE | ID: mdl-33774671

RESUMEN

SUMMARY: Finding informative predictive features in high-dimensional biological case-control datasets is challenging. The Extreme Pseudo-Sampling (EPS) algorithm offers a solution to the challenge of feature selection via a combination of deep learning and linear regression models. First, using a variational autoencoder, it generates complex latent representations for the samples. Second, it classifies the latent representations of cases and controls via logistic regression. Third, it generates new samples (pseudo-samples) around the extreme cases and controls in the regression model. Finally, it trains a new regression model over the upsampled space. The most significant variables in this regression are selected. We present an open-source implementation of the algorithm that is easy to set up, use and customize. Our package enhances the original algorithm by providing new features and customizability for data preparation, model training and classification functionalities. We believe the new features will enable the adoption of the algorithm for a diverse range of datasets. AVAILABILITY AND IMPLEMENTATION: The software package for Python is available online at https://github.com/roohy/eps. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

3.
Pac Symp Biocomput ; 28: 121-132, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36540970

RESUMEN

Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks using IBD mapping. Clustering algorithms play an important role in finding these groups accurately and at scale. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare the statistical power of clustering algorithms via simulating 2.3 million clusters across 850 experiments. We found Infomap and Markov Clustering (MCL) community detection methods to have high statistical power in most of the scenarios. They yield a 30% increase in power compared to the current state-of-art approach, with a 3 orders of magnitude lower runtime. We also found that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications. We extend our findings to real datasets by analyzing the Population Architecture using Genomics and Epidemiology (PAGE) Study dataset with 51,000 samples and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters. We demonstrate the power of our approach by recovering signals of rare genetic variation in the Whole-Exome Sequence data of 200,000 individuals in the UK Biobank. We provide an efficient implementation to enable clustering at scale for IBD mapping for various populations and scenarios.Supplementary Information: The code, along with supplementary methods and figures are available at https://github.com/roohy/localIBDClustering.


Asunto(s)
Algoritmos , Biología Computacional , Humanos , Genómica , Análisis por Conglomerados
4.
Nat Med ; 29(7): 1845-1856, 2023 07.
Artículo en Inglés | MEDLINE | ID: mdl-37464048

RESUMEN

An individual's disease risk is affected by the populations that they belong to, due to shared genetics and environmental factors. The study of fine-scale populations in clinical care is important for identifying and reducing health disparities and for developing personalized interventions. To assess patterns of clinical diagnoses and healthcare utilization by fine-scale populations, we leveraged genetic data and electronic medical records from 35,968 patients as part of the UCLA ATLAS Community Health Initiative. We defined clusters of individuals using identity by descent, a form of genetic relatedness that utilizes shared genomic segments arising due to a common ancestor. In total, we identified 376 clusters, including clusters with patients of Afro-Caribbean, Puerto Rican, Lebanese Christian, Iranian Jewish and Gujarati ancestry. Our analysis uncovered 1,218 significant associations between disease diagnoses and clusters and 124 significant associations with specialty visits. We also examined the distribution of pathogenic alleles and found 189 significant alleles at elevated frequency in particular clusters, including many that are not regularly included in population screening efforts. Overall, this work progresses the understanding of health in understudied communities and can provide the foundation for further study into health inequities.


Asunto(s)
Atención a la Salud , Aceptación de la Atención de Salud , Humanos , Los Angeles , Irán , Etnicidad
5.
medRxiv ; 2023 Mar 29.
Artículo en Inglés | MEDLINE | ID: mdl-37034679

RESUMEN

Peripheral artery disease (PAD) is a form of atherosclerotic cardiovascular disease, affecting ∼8 million Americans, and is known to have racial and ethnic disparities. PAD has been reported to have significantly higher prevalence in African Americans (AAs) compared to non-Hispanic European Americans (EAs). Hispanic/Latinos (HLs) have been reported to have lower or similar rates of PAD compared to EAs, despite having a paradoxically high burden of PAD risk factors, however recent work suggests prevalence may differ between sub-groups. Here we examined a large cohort of diverse adults in the Bio Me biobank in New York City (NYC). We observed the prevalence of PAD at 1.7% in EAs vs 8.5% and 9.4% in AAs and HLs, respectively; and among HL sub-groups, at 11.4% and 11.5% in Puerto Rican and Dominican populations, respectively. Follow-up analysis that adjusted for common risk factors demonstrated that Dominicans had the highest increased risk for PAD relative to EAs (OR=3.15 (95% CI 2.33-4.25), P <6.44×10 -14 ). To investigate whether genetic factors may explain this increased risk, we performed admixture mapping by testing the association between local ancestry (LA) and PAD in Dominican Bio Me participants (N=1,940) separately for European (EUR), African (AFR) and Native American (NAT) continental ancestry tracts. We identified a NAT ancestry tract at chromosome 2q35 that was significantly associated with PAD (OR=2.05 (95% CI 1.51-2.78), P <4.06×10 -6 ) with 22.5% vs 12.5% PAD prevalence in heterozygous NAT tract carriers versus non-carriers, respectively. Fine-mapping at this locus implicated tag SNP rs78529201 located within a long intergenic non-coding RNA (lincRNA) LINC00607 , a gene expression regulator of key genes related to thrombosis and extracellular remodeling of endothelial cells, suggesting a putative link of the 2q35 locus to PAD etiology. In summary, we showed how leveraging health systems data helped understand nuances of PAD risk across HL sub-groups and admixture mapping approaches elucidated a novel risk locus in a Dominican population.

6.
Front Genet ; 14: 1181167, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37600667

RESUMEN

Peripheral artery disease (PAD) is a form of atherosclerotic cardiovascular disease, affecting ∼8 million Americans, and is known to have racial and ethnic disparities. PAD has been reported to have a significantly higher prevalence in African Americans (AAs) compared to non-Hispanic European Americans (EAs). Hispanic/Latinos (HLs) have been reported to have lower or similar rates of PAD compared to EAs, despite having a paradoxically high burden of PAD risk factors; however, recent work suggests prevalence may differ between sub-groups. Here, we examined a large cohort of diverse adults in the BioMe biobank in New York City. We observed the prevalence of PAD at 1.7% in EAs vs. 8.5% and 9.4% in AAs and HLs, respectively, and among HL sub-groups, the prevalence was found at 11.4% and 11.5% in Puerto Rican and Dominican populations, respectively. Follow-up analysis that adjusted for common risk factors demonstrated that Dominicans had the highest increased risk for PAD relative to EAs [OR = 3.15 (95% CI 2.33-4.25), p < 6.44 × 10-14]. To investigate whether genetic factors may explain this increased risk, we performed admixture mapping by testing the association between local ancestry and PAD in Dominican BioMe participants (N = 1,813) separately from European, African, and Native American (NAT) continental ancestry tracts. The top association with PAD was an NAT ancestry tract at chromosome 2q35 [OR = 1.96 (SE = 0.16), p < 2.75 × 10-05) with 22.6% vs. 12.9% PAD prevalence in heterozygous NAT tract carriers versus non-carriers, respectively. Fine-mapping at this locus implicated tag SNP rs78529201 located within a long intergenic non-coding RNA (lincRNA) LINC00607, a gene expression regulator of key genes related to thrombosis and extracellular remodeling of endothelial cells, suggesting a putative link of the 2q35 locus to PAD etiology. Efforts to reproduce the signal in other Hispanic cohorts were unsuccessful. In summary, we showed how leveraging health system data helped understand nuances of PAD risk across HL sub-groups and admixture mapping approaches elucidated a putative risk locus in a Dominican population.

7.
Nat Commun ; 12(1): 3546, 2021 06 10.
Artículo en Inglés | MEDLINE | ID: mdl-34112768

RESUMEN

The ability to identify segments of genomes identical-by-descent (IBD) is a part of standard workflows in both statistical and population genetics. However, traditional methods for finding local IBD across all pairs of individuals scale poorly leading to a lack of adoption in very large-scale datasets. Here, we present iLASH, an algorithm based on similarity detection techniques that shows equal or improved accuracy in simulations compared to current leading methods and speeds up analysis by several orders of magnitude on genomic datasets, making IBD estimation tractable for millions of individuals. We apply iLASH to the PAGE dataset of ~52,000 multi-ethnic participants, including several founder populations with elevated IBD sharing, identifying IBD segments in ~3 minutes per chromosome compared to over 6 days for a state-of-the-art algorithm. iLASH enables efficient analysis of very large-scale datasets, as we demonstrate by computing IBD across the UK Biobank (~500,000 individuals), detecting 12.9 billion pairwise connections.


Asunto(s)
Genética de Población/métodos , Genómica/métodos , Algoritmos , Simulación por Computador , Bases de Datos Genéticas , Genoma Humano , Haplotipos , Humanos , Linaje , Polimorfismo de Nucleótido Simple , Control de Calidad , Reino Unido/epidemiología , Reino Unido/etnología
8.
Front Genet ; 9: 297, 2018.
Artículo en Inglés | MEDLINE | ID: mdl-30123241

RESUMEN

Whole transcriptome studies typically yield large amounts of data, with expression values for all genes or transcripts of the genome. The search for genes of interest in a particular study setting can thus be a daunting task, usually relying on automated computational methods. Moreover, most biological questions imply that such a search should be performed in a multivariate setting, to take into account the inter-genes relationships. Differential expression analysis commonly yields large lists of genes deemed significant, even after adjustment for multiple testing, making the subsequent study possibilities extensive. Here, we explore the use of supervised learning methods to rank large ensembles of genes defined by their expression values measured with RNA-Seq in a typical 2 classes sample set. First, we use one of the variable importance measures generated by the random forests classification algorithm as a metric to rank genes. Second, we define the EPS (extreme pseudo-samples) pipeline, making use of VAEs (Variational Autoencoders) and regressors to extract a ranking of genes while leveraging the feature space of both virtual and comparable samples. We show that, on 12 cancer RNA-Seq data sets ranging from 323 to 1,210 samples, using either a random forests-based gene selection method or the EPS pipeline outperforms differential expression analysis for 9 and 8 out of the 12 datasets respectively, in terms of identifying subsets of genes associated with survival. These results demonstrate the potential of supervised learning-based gene selection methods in RNA-Seq studies and highlight the need to use such multivariate gene selection methods alongside the widely used differential expression analysis.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA