Pesquisa | Portal de Pesquisa da BVS

IGD: high-performance search for large-scale genomic interval datasets.

Feng, Jianglin; Sheffield, Nathan C.

Bioinformatics ; 37(1): 118-120, 2021 Apr 09.

Artigo em Inglês | MEDLINE | ID: mdl-33367484

RESUMO

SUMMARY: Databases of large-scale genome projects now contain thousands of genomic interval datasets. These data are a critical resource for understanding the function of DNA. However, our ability to examine and integrate interval data of this scale is limited. Here, we introduce the integrated genome database (IGD), a method and tool for searching genome interval datasets more than three orders of magnitude faster than existing approaches, while using only one hundredth of the memory. IGD uses a novel linear binning method that allows us to scale analysis to billions of genomic regions. AVAILABILITYAND IMPLEMENTATION: https://github.com/databio/IGD. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Augmented Interval List: a novel data structure for efficient genomic interval search.

Feng, Jianglin; Ratan, Aakrosh; Sheffield, Nathan C.

Bioinformatics ; 35(23): 4907-4911, 2019 12 01.

Artigo em Inglês | MEDLINE | ID: mdl-31150060

RESUMO

MOTIVATION: Genomic data is frequently stored as segments or intervals. Because this data type is so common, interval-based comparisons are fundamental to genomic analysis. As the volume of available genomic data grows, developing efficient and scalable methods for searching interval data is necessary. RESULTS: We present a new data structure, the Augmented Interval List (AIList), to enumerate intersections between a query interval q and an interval set R. An AIList is constructed by first sorting R as a list by the interval start coordinate, then decomposing it into a few approximately flattened components (sublists), and then augmenting each sublist with the running maximum interval end. The query time for AIList is O(log2N+n+m), where n is the number of overlaps between R and q, N is the number of intervals in the set R and m is the average number of extra comparisons required to find the n overlaps. Tested on real genomic interval datasets, AIList code runs 5-18 times faster than standard high-performance code based on augmented interval-trees, nested containment lists or R-trees (BEDTools). For large datasets, the memory-usage for AIList is 4-60% of other methods. The AIList data structure, therefore, provides a significantly improved fundamental operation for highly scalable genomic data analysis. AVAILABILITY AND IMPLEMENTATION: An implementation of the AIList data structure with both construction and search algorithms is available at http://ailist.databio.org. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Genômica , Software , Algoritmos , Genoma

Visualization and Quantification of the Association Between Breast Cancer and Cholesterol in the All of Us Research Program.

Feng, Jianglin; Symonds, Esteban Astiazaran; Karnes, Jason H.

Cancer Inform ; 22: 11769351221144132, 2023.

Artigo em Inglês | MEDLINE | ID: mdl-36654923

RESUMO

Epidemiologic evidence for the association of cholesterol and breast cancer is inconsistent. Several factors may contribute to this inconsistency, including limited sample sizes, confounding effects of antihyperlipidemic treatment, age, and body mass index, and the assumption that the association follows a simple linear function. Here, we aimed to address these factors by combining visualization and quantification a large-scale contemporary electronic health record database (the All of Us Research Program). We find clear visual and quantitative evidence that breast cancer is strongly, positively, and near-linearly associated with total cholesterol and low-density lipoprotein cholesterol, but not associated with triglycerides. The association of breast cancer with high-density lipoprotein cholesterol was non-linear and age dependent. Standardized odds ratios were 2.12 (95% confidence interval 1.9-2.48), P = 5.6 × 10-31 for total cholesterol; 1.99 (1.75-2.26), P = 2.6 × 10-26 for low-density lipoprotein cholesterol; 1.69 (1.3-2.2), P = 9.0 × 10-5 for high-density lipoprotein cholesterol at age < 56; and 0.65 (0.55-0.78), P = 1.2 × 10-6 for high-density lipoprotein cholesterol at age â©¾ 56. The inclusion of the lipid levels measured after antihyperlipidemic treatment in the analysis results in erroneous associations. We demonstrate that the use of the logistic regression without inspecting risk variable linearity and accounting for confounding effects may lead to inconsistent results.

Racial, ethnic, and gender differences in obesity and body fat distribution: An All of Us Research Program demonstration project.

Karnes, Jason H; Arora, Amit; Feng, Jianglin; Steiner, Heidi E; Sulieman, Lina; Boerwinkle, Eric; Clark, Cheryl; Cicek, Mine; Cohn, Elizabeth; Gebo, Kelly; Loperena-Cortes, Roxana; Ohno-Machado, Lucila; Mayo, Kelsey; Mockrin, Steve; Ramirez, Andrea; Schully, Sheri; Klimentidis, Yann C.

PLoS One ; 16(8): e0255583, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34358277

RESUMO

Differences in obesity and body fat distribution across gender and race/ethnicity have been extensively described. We sought to replicate these differences and evaluate newly emerging data from the All of Us Research Program (AoU). We compared body mass index (BMI), waist circumference, and waist-to-hip ratio from the baseline physical examination, and alanine aminotransferase (ALT) from the electronic health record in up to 88,195 Non-Hispanic White (NHW), 40,770 Non-Hispanic Black (NHB), 35,640 Hispanic, and 5,648 Asian participants. We compared AoU sociodemographic variable distribution to National Health and Nutrition Examination Survey (NHANES) data and applied the pseudo-weighting method for adjusting selection biases of AoU recruitment. Our findings replicate previous observations with respect to gender differences in BMI. In particular, we replicate the large gender disparity in obesity rates among NHB participants, in which obesity and mean BMI are much higher in NHB women than NHB men (33.34 kg/m2 versus 28.40 kg/m2 respectively; p<2.22x10-308). The overall age-adjusted obesity prevalence in AoU participants is similar overall but lower than the prevalence found in NHANES for NHW participants. ALT was higher in men than women, and lower among NHB participants compared to other racial/ethnic groups, consistent with previous findings. Our data suggest consistency of AoU with national averages related to obesity and suggest this resource is likely to be a major source of scientific inquiry and discovery in diverse populations.

Assuntos

Distribuição da Gordura Corporal , Índice de Massa Corporal , Etnicidade/estatística & dados numéricos , Obesidade/fisiopatologia , Planejamento de Assistência ao Paciente/organização & administração , Grupos Raciais/estatística & dados numéricos , Adolescente , Adulto , Idoso , Feminino , Humanos , Masculino , Pessoa de Meia-Idade , Inquéritos Nutricionais , Obesidade/epidemiologia , Fatores Sexuais , Estados Unidos/epidemiologia , Circunferência da Cintura , Adulto Jovem

Machine Learning for Prediction of Stable Warfarin Dose in US Latinos and Latin Americans.

Steiner, Heidi E; Giles, Jason B; Patterson, Hayley Knight; Feng, Jianglin; El Rouby, Nihal; Claudio, Karla; Marcatto, Leiliane Rodrigues; Tavares, Leticia Camargo; Galvez, Jubby Marcela; Calderon-Ospina, Carlos-Alberto; Sun, Xiaoxiao; Hutz, Mara H; Scott, Stuart A; Cavallari, Larisa H; Fonseca-Mendoza, Dora Janeth; Duconge, Jorge; Botton, Mariana Rodrigues; Santos, Paulo Caleb Junior Lima; Karnes, Jason H.

Front Pharmacol ; 12: 749786, 2021.

Artigo em Inglês | MEDLINE | ID: mdl-34776967

RESUMO

Populations used to create warfarin dose prediction algorithms largely lacked participants reporting Hispanic or Latino ethnicity. While previous research suggests nonlinear modeling improves warfarin dose prediction, this research has mainly focused on populations with primarily European ancestry. We compare the accuracy of stable warfarin dose prediction using linear and nonlinear machine learning models in a large cohort enriched for US Latinos and Latin Americans (ULLA). Each model was tested using the same variables as published by the International Warfarin Pharmacogenetics Consortium (IWPC) and using an expanded set of variables including ethnicity and warfarin indication. We utilized a multiple linear regression model and three nonlinear regression models: Bayesian Additive Regression Trees, Multivariate Adaptive Regression Splines, and Support Vector Regression. We compared each model's ability to predict stable warfarin dose within 20% of actual stable dose, confirming trained models in a 30% testing dataset with 100 rounds of resampling. In all patients (n = 7,030), inclusion of additional predictor variables led to a small but significant improvement in prediction of dose relative to the IWPC algorithm (47.8 versus 46.7% in IWPC, p = 1.43 × 10-15). Nonlinear models using IWPC variables did not significantly improve prediction of dose over the linear IWPC algorithm. In ULLA patients alone (n = 1,734), IWPC performed similarly to all other linear and nonlinear pharmacogenetic algorithms. Our results reinforce the validity of IWPC in a large, ethnically diverse population and suggest that additional variables that capture warfarin dose variability may improve warfarin dose prediction algorithms.

Seqpare: a novel metric of similarity between genomic interval sets.

Feng, Selena C; Sheffield, Nathan C; Feng, Jianglin.

F1000Res ; 9: 581, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-33500773

RESUMO

Searching genomic interval sets produced by sequencing methods has been widely and routinely performed; however, existing metrics for quantifying similarities among interval sets are inconsistent. Here we introduce Seqpare, a self-consistent and effective metric of similarity and tool for comparing sequences based on their interval sets. With this metric, the similarity of two interval sets is quantified by a single index, the ratio of their effective overlap over the union: an index of zero indicates unrelated interval sets, and an index of one means that the interval sets are identical. Analysis and tests confirm the effectiveness and self-consistency of the Seqpare metric.

Assuntos

Algoritmos , Genômica , Análise de Sequência/métodos

A novel iterative solution to the phase problem.

Feng, Jianglin.

Acta Crystallogr A ; 68(Pt 2): 298-300, 2012 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-22338665

RESUMO

A new Fourier cycling phasing method is proposed based on the mathematical principle of the global minimization. In reciprocal space, the Fourier coefficient is of a mixed form of the normalized structure factors (2E(o)(2) - E(c)(2))E(c), while in direct space the Fourier map is modified with a peak-picking procedure. This method does not use any preliminary information and does not rely on any critical parameter; it can start with either randomly assigned phases or fixed phases (all zeros). This method performs significantly better than the commonly used forms of Fourier cycling.

Deconvolution of the interatomic vector set using a convolution table.

Feng, Jianglin.

Acta Crystallogr A ; 65(Pt 1): 48-50, 2009 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-19092177

RESUMO

The deconvolution of the interatomic vector set (the ideal Patterson function) with the superposition technique is not complete because of the vector overlaps: multiple images and false peaks usually exist in the superposition map. Here, a new method for the deconvolution of the interatomic vector set is presented. This method involves constructing a table termed the ;convolution table' from vectors in a superposition map and then sorting the table so that vectors belonging to different images are separated, and thus the overlaps are naturally solved. This method does not use the symmetry information.

Automated electron tomography with scanning transmission electron microscopy.

Feng, Jianglin; Somlyo, Andrew P; Somlyo, Avril V; Shao, Zhifeng.

J Microsc ; 228(Pt 3): 406-12, 2007 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-18045335

RESUMO

We report the successful implementation of a fully automated tomographic data collection system in scanning transmission electron microscopy (STEM) mode. Autotracking is carried out by combining mechanical and electronic corrections for specimen movement. Autofocusing is based on contrast difference of a focus series of a small sample area. The focus gradient that exists in normal images due to specimen tilt is effectively removed by using dynamic focusing. An advantage of STEM tomography with dynamic focusing over TEM tomography is its ability to reconstruct large objects with a potentially higher resolution.

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA