RESUMEN
OBJECTIVE: Investigation of explainable deep learning methods for graph neural networks to predict HIV infections with social network information and performing domain adaptation to evaluate model transferability across different datasets. METHODS: Network data from two cohorts of younger sexual minority men (SMM) from two U.S. cities (Chicago, IL, and Houston, TX) were collected between 2014 and 2016. Feature importance from graph attention network (GAT) models were determined using GNNExplainer. Domain adaptation was performed to examine model transferability from one city dataset to the other dataset, training with 100% of the source dataset with 30% of the target dataset and prediction on the remaining 70% from the target dataset. RESULTS: Domain adaptation showed the ability of GAT to improve prediction over training with single city datasets. Feature importance analysis with GAT models in single city training indicated similar features across different cities, reinforcing potential application of GAT models in predicting HIV infections through domain adaptation. CONCLUSION: GAT models can be used to address the data sparsity issue in HIV study populations. They are powerful tools for predicting individual risk of HIV that can be further explored for better understanding of HIV transmission.
In this study, we conducted domain adaptation between two urban areas to predict HIV status by incorporating social network data.We employ GNNExplainer to elucidate the model's predictions on each city dataset, aligning them with knowledge of HIV risk factors.Domain adaptation resulted in better model performance over individual city training and has great potential for applications in modeling other sexually transmitted infections.
Asunto(s)
Aprendizaje Profundo , Infecciones por VIH , Redes Neurales de la Computación , Humanos , Infecciones por VIH/epidemiología , Masculino , Adulto , Minorías Sexuales y de Género/estadística & datos numéricos , Inteligencia Artificial , Adulto Joven , Homosexualidad Masculina/estadística & datos numéricosRESUMEN
Purpose: Radiation-induced lymphopenia is a common immune toxicity that adversely impacts treatment outcomes. We report here our approach to translate a deep-learning (DL) model developed to predict severe lymphopenia risk among esophageal cancer into a strategy for incorporating the immune system as an organ-at-risk (iOAR) to mitigate the risk. Materials and Methods: We conducted "virtual clinical trials" utilizing retrospective data for 10 intensity-modulated radiation therapy (IMRT) and 10 passively-scattered proton therapy (PSPT) esophageal cancer patients. For each patient, additional treatment plans of the modality other than the original were created employing standard-of-care (SOC) dose constraints. Predicted values of absolute lymphocyte count (ALC) nadir for all plans were estimated using a previously-developed DL model. The model also yielded the relative magnitudes of contributions of iOARs dosimetric factors to ALC nadir, which were used to compute iOARs dose-volume constraints, which were incorporated into optimization criteria to produce "IMRT-enhanced" and "intensity-modulated proton therapy (IMPT)-enhanced" plans. Results: Model-predicted ALC nadir for the original IMRT (IMRT-SOC) and PSPT plans agreed well with actual values. IMPT-SOC showed greater immune sparing vs IMRT and PSPT. The average mean body doses were 13.10 Gy vs 7.62 Gy for IMRT-SOC vs IMPT-SOC for patients treated with IMRT-SOC; and 8.08 Gy vs 6.68 Gy for PSPT vs IMPT-SOC for patients treated with PSPT. For IMRT patients, the average predicted ALC nadir of IMRT-SOC, IMRT-enhanced, IMPT-SOC, and IMPT-enhanced was 281, 327, 351, and 392 cells/µL, respectively. For PSPT patients, the average predicted ALC nadir of PSPT, IMPT-SOC, and IMPT-enhanced was 258, 316, and 350 cells/µL, respectively. Enhanced plans achieved higher predicted ALC nadir, with an average improvement of 40.8 cells/µL (20.6%). Conclusion: The proposed DL model-guided strategy to incorporate the immune system as iOAR in IMRT and IMPT optimization has the potential for radiation-induced lymphopenia mitigation. A prospective clinical trial is planned.
RESUMEN
We are in a booming era of artificial intelligence, particularly with the increased availability of technologies that can help generate content, such as ChatGPT. Healthcare institutions are discussing or have started utilizing these innovative technologies within their workflow. Major electronic health record vendors have begun to leverage large language models to process and analyze vast amounts of clinical natural language text, performing a wide range of tasks in healthcare settings to help alleviate clinicians' burden. Although such technologies can be helpful in applications such as patient education, drafting responses to patient questions and emails, medical record summarization, and medical research facilitation, there are concerns about the tools' readiness for use within the healthcare domain and acceptance by the current workforce. The goal of this article is to provide nurses with an understanding of the currently available foundation models and artificial intelligence tools, enabling them to evaluate the need for such tools and assess how they can impact current clinical practice. This will help nurses efficiently assess, implement, and evaluate these tools to ensure these technologies are ethically and effectively integrated into healthcare systems, while also rigorously monitoring their performance and impact on patient care.
Asunto(s)
Inteligencia Artificial , Registros Electrónicos de Salud , Humanos , Procesamiento de Lenguaje Natural , Informática Aplicada a la EnfermeríaRESUMEN
Runs-of-homozygosity (ROH) segments, contiguous homozygous regions in a genome were traditionally linked to families and inbred populations. However, a growing literature suggests that ROHs are ubiquitous in outbred populations. Still, most existing genetic studies of ROH in populations are limited to aggregated ROH content across the genome, which does not offer the resolution for mapping causal loci. This limitation is mainly due to a lack of methods for the efficient identification of shared ROH diplotypes. Here, we present a new method, ROH-DICE (runs-of-homozygous diplotype cluster enumerator), to find large ROH diplotype clusters, sufficiently long ROHs shared by a sufficient number of individuals, in large cohorts. ROH-DICE identified over 1 million ROH diplotypes that span over 100 single nucleotide polymorphisms (SNPs) and are shared by more than 100 UK Biobank participants. Moreover, we found significant associations of clustered ROH diplotypes across the genome with various self-reported diseases, with the strongest associations found between the extended human leukocyte antigen (HLA) region and autoimmune disorders. We found an association between a diplotype covering the homeostatic iron regulator (HFE) gene and hemochromatosis, even though the well-known causal SNP was not directly genotyped or imputed. Using a genome-wide scan, we identified a putative association between carriers of an ROH diplotype in chromosome 4 and an increase in mortality among COVID-19 patients (p-value = 1.82 × 10-11). In summary, our ROH-DICE method, by calling out large ROH diplotypes in a large outbred population, enables further population genetics into the demographic history of large populations. More importantly, our method enables a new genome-wide mapping approach for finding disease-causing loci with multi-marker recessive effects at a population scale.
Asunto(s)
Bancos de Muestras Biológicas , COVID-19 , Homocigoto , Polimorfismo de Nucleótido Simple , Humanos , Reino Unido , Polimorfismo de Nucleótido Simple/genética , COVID-19/genética , SARS-CoV-2/genética , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Genoma Humano , Biobanco del Reino UnidoRESUMEN
PURPOSE: Radiation-induced lymphopenia (RIL) is common among patients undergoing radiation therapy (RT)' Severe RIL has been linked to adverse outcomes. The severity and risk of RIL can be predicted from baseline clinical characteristics and dosimetric parameters. However, dosimetric parameters, e.g. dose-volume (DV) indices, are highly correlated with one another and are only weakly associated with RIL. Here we introduce the novel concept of "composite dosimetric score" (CDS) as the index that condenses the dose distribution in immune tissues of interest to study the dosimetric dependence of RIL. We derived an improved multivariate classification scheme for risk of grade 4 RIL (G4RIL), based on this novel RT dosimetric feature, for patients receiving chemo RT for esophageal cancer. METHODS AND MATERIALS: DV indices were extracted for 734 patients who received chemo RT for biopsy-proven esophageal cancer. Nonnegative matrix factorization was used to project the DV indices of lung, heart, and spleen into a single CDS; XGBoost was employed to explore significant interactions among predictors; and logistic regression was applied to combine the resultant CDS with baseline clinical factors and interaction terms to facilitate individualized prediction of immunotoxicity. Five-fold cross-validation was applied to evaluate the model performance. RESULTS: The CDS for selected immune organs at risk (ie, heart, lung, and spleen) (OR 1.791; 95 CI [1.350, 2.377]) was a statistically significant risk determinant for G4RIL. Pearson correlation coefficients for CDS versus G4RIL risk for individual immune organs at risk were greater than any single DV indicx. Personalized prediction of G4RIL based on CDS and 4 clinical risk factors yielded an area under the curve value of 0.78. Interaction between age and CDS revealed that G4RIL risk increased more sharply with increasing CDS for patients aged ≥65 years. CONCLUSIONS: Risk of immunotoxicity for patients undergoing chemo RT for esophageal cancer can be predicted by CDS. The CDS concept can be extended to immunotoxicity in other cancer types and in dose-response models currently based on DV indices. Personalized treatment planning should leverage composite dosimetric scoring methods rather than using individual or subsets of DV indices.
Asunto(s)
Neoplasias Esofágicas , Linfopenia , Aprendizaje Automático , Humanos , Neoplasias Esofágicas/radioterapia , Linfopenia/etiología , Masculino , Femenino , Anciano , Persona de Mediana Edad , Dosificación Radioterapéutica , Órganos en Riesgo/efectos de la radiación , Traumatismos por Radiación , Adulto , Anciano de 80 o más Años , Bazo/efectos de la radiación , Medicina de Precisión , Pulmón/efectos de la radiación , Modelos LogísticosRESUMEN
Existing imaging genetics studies have been mostly limited in scope by using imaging-derived phenotypes defined by human experts. Here, leveraging new breakthroughs in self-supervised deep representation learning, we propose a new approach, image-based genome-wide association study (iGWAS), for identifying genetic factors associated with phenotypes discovered from medical images using contrastive learning. Using retinal fundus photos, our model extracts a 128-dimensional vector representing features of the retina as phenotypes. After training the model on 40,000 images from the EyePACS dataset, we generated phenotypes from 130,329 images of 65,629 British White participants in the UK Biobank. We conducted GWAS on these phenotypes and identified 14 loci with genome-wide significance (p<5×10-8 and intersection of hits from left and right eyes). We also did GWAS on the retina color, the average color of the center region of the retinal fundus photos. The GWAS of retina colors identified 34 loci, 7 are overlapping with GWAS of raw image phenotype. Our results establish the feasibility of this new framework of genomic study based on self-supervised phenotyping of medical images.
Asunto(s)
Fondo de Ojo , Estudio de Asociación del Genoma Completo , Fenotipo , Retina , Humanos , Estudio de Asociación del Genoma Completo/métodos , Retina/diagnóstico por imagen , Masculino , Polimorfismo de Nucleótido Simple , Femenino , Procesamiento de Imagen Asistido por Computador/métodosRESUMEN
Understanding the genetic architecture of brain structure is challenging, partly due to difficulties in designing robust, non-biased descriptors of brain morphology. Until recently, brain measures for genome-wide association studies (GWAS) consisted of traditionally expert-defined or software-derived image-derived phenotypes (IDPs) that are often based on theoretical preconceptions or computed from limited amounts of data. Here, we present an approach to derive brain imaging phenotypes using unsupervised deep representation learning. We train a 3-D convolutional autoencoder model with reconstruction loss on 6130 UK Biobank (UKBB) participants' T1 or T2-FLAIR (T2) brain MRIs to create a 128-dimensional representation known as Unsupervised Deep learning derived Imaging Phenotypes (UDIPs). GWAS of these UDIPs in held-out UKBB subjects (n = 22,880 discovery and n = 12,359/11,265 replication cohorts for T1/T2) identified 9457 significant SNPs organized into 97 independent genetic loci of which 60 loci were replicated. Twenty-six loci were not reported in earlier T1 and T2 IDP-based UK Biobank GWAS. We developed a perturbation-based decoder interpretation approach to show that these loci are associated with UDIPs mapped to multiple relevant brain regions. Our results established unsupervised deep learning can derive robust, unbiased, heritable, and interpretable brain imaging phenotypes.
Asunto(s)
Sitios Genéticos , Estudio de Asociación del Genoma Completo , Humanos , Estudio de Asociación del Genoma Completo/métodos , Fenotipo , Encéfalo/diagnóstico por imagen , NeuroimagenRESUMEN
Methicillin-resistant Staphylococcus aureus (MRSA) poses significant morbidity and mortality in hospitals. Rapid, accurate risk stratification of MRSA is crucial for optimizing antibiotic therapy. Our study introduced a deep learning model, PyTorch_EHR, which leverages electronic health record (EHR) time-series data, including wide-variety patient specific data, to predict MRSA culture positivity within two weeks. 8,164 MRSA and 22,393 non-MRSA patient events from Memorial Hermann Hospital System, Houston, Texas are used for model development. PyTorch_EHR outperforms logistic regression (LR) and light gradient boost machine (LGBM) models in accuracy (AUROCPyTorch_EHR = 0.911, AUROCLR = 0.857, AUROCLGBM = 0.892). External validation with 393,713 patient events from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset in Boston confirms its superior accuracy (AUROCPyTorch_EHR = 0.859, AUROCLR = 0.816, AUROCLGBM = 0.838). Our model effectively stratifies patients into high-, medium-, and low-risk categories, potentially optimizing antimicrobial therapy and reducing unnecessary MRSA-specific antimicrobials. This highlights the advantage of deep learning models in predicting MRSA positive cultures, surpassing traditional machine learning models and supporting clinicians' judgments.
Asunto(s)
Aprendizaje Profundo , Staphylococcus aureus Resistente a Meticilina , Humanos , Registros Electrónicos de Salud , Staphylococcus aureus Resistente a Meticilina/genética , Cuidados Críticos , HospitalesRESUMEN
BACKGROUND: The rapid evolution of artificial intelligence (AI) in conjunction with recent updates in dual antiplatelet therapy (DAPT) management guidelines emphasizes the necessity for innovative models to predict ischemic or bleeding events after drug-eluting stent implantation. Leveraging AI for dynamic prediction has the potential to revolutionize risk stratification and provide personalized decision support for DAPT management. METHODS AND RESULTS: We developed and validated a new AI-based pipeline using retrospective data of drug-eluting stent-treated patients, sourced from the Cerner Health Facts data set (n=98 236) and Optum's de-identified Clinformatics Data Mart Database (n=9978). The 36 months following drug-eluting stent implantation were designated as our primary forecasting interval, further segmented into 6 sequential prediction windows. We evaluated 5 distinct AI algorithms for their precision in predicting ischemic and bleeding risks. Model discriminative accuracy was assessed using the area under the receiver operating characteristic curve, among other metrics. The weighted light gradient boosting machine stood out as the preeminent model, thus earning its place as our AI-DAPT model. The AI-DAPT demonstrated peak accuracy in the 30 to 36 months window, charting an area under the receiver operating characteristic curve of 90% [95% CI, 88%-92%] for ischemia and 84% [95% CI, 82%-87%] for bleeding predictions. CONCLUSIONS: Our AI-DAPT excels in formulating iterative, refined dynamic predictions by assimilating ongoing updates from patients' clinical profiles, holding value as a novel smart clinical tool to facilitate optimal DAPT duration management with high accuracy and adaptability.
Asunto(s)
Enfermedad de la Arteria Coronaria , Stents Liberadores de Fármacos , Infarto del Miocardio , Intervención Coronaria Percutánea , Humanos , Inhibidores de Agregación Plaquetaria/efectos adversos , Infarto del Miocardio/etiología , Enfermedad de la Arteria Coronaria/diagnóstico , Enfermedad de la Arteria Coronaria/cirugía , Stents Liberadores de Fármacos/efectos adversos , Inteligencia Artificial , Estudios Retrospectivos , Resultado del Tratamiento , Factores de Riesgo , Quimioterapia Combinada , Hemorragia/inducido químicamente , Pronóstico , Intervención Coronaria Percutánea/efectos adversosRESUMEN
Although genome-wide association studies (GWAS) have identified tens of thousands of genetic loci, the genetic architecture is still not fully understood for many complex traits. Most GWAS and sequencing association studies have focused on single nucleotide polymorphisms or copy number variations, including common and rare genetic variants. However, phased haplotype information is often ignored in GWAS or variant set tests for rare variants. Here we leverage the identity-by-descent (IBD) segments inferred from a random projection-based IBD detection algorithm in the mapping of genetic associations with complex traits, to develop a computationally efficient statistical test for IBD mapping in biobank-scale cohorts. We used sparse linear algebra and random matrix algorithms to speed up the computation, and a genome-wide IBD mapping scan of more than 400,000 samples finished within a few hours. Simulation studies showed that our new method had well-controlled type I error rates under the null hypothesis of no genetic association in large biobank-scale cohorts, and outperformed traditional GWAS single-variant tests when the causal variants were untyped and rare, or in the presence of haplotype effects. We also applied our method to IBD mapping of six anthropometric traits using the UK Biobank data and identified a total of 3,442 associations, 2,131 (62%) of which remained significant after conditioning on suggestive tag variants in the ± 3 centimorgan flanking regions from GWAS.
Asunto(s)
Bancos de Muestras Biológicas , Estudio de Asociación del Genoma Completo , Humanos , Estudio de Asociación del Genoma Completo/métodos , Variaciones en el Número de Copia de ADN , Haplotipos/genética , Fenotipo , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
The availability of large genotyped cohorts brings new opportunities for revealing high-resolution genetic structure of admixed populations, via local ancestry inference (LAI), the process of identifying the ancestry of each segment of an individual haplotype. Though current methods achieve high accuracy in standard cases, LAI is still challenging when reference populations are more similar (e.g., intra-continental), when the number of reference populations is too numerous, or when the admixture events are deep in time, all of which are increasingly unavoidable in large biobanks. Here, we present a new LAI method, Recomb-Mix. Adopting the commonly used site-based formulation based on the classic Li and Stephens' model, Recomb-Mix integrates the elements of existing methods and introduces a new graph collapsing to simplify counting paths with the same ancestry label readout. Through comprehensive benchmarking on various simulated datasets, we show that Recomb-Mix is more accurate than existing methods in diverse sets of scenarios while being competitive in terms of resource efficiency. We expect that Recomb-Mix will be a useful method for advancing genetics studies of admixed populations.
RESUMEN
Genome-wide association studies (GWAS) are used to identify relationships between genetic variations and specific traits. When applied to high-dimensional medical imaging data, a key step is to extract lower-dimensional, yet informative representations of the data as traits. Representation learning for imaging genetics is largely under-explored due to the unique challenges posed by GWAS in comparison to typical visual representation learning. In this study, we tackle this problem from the mutual information (MI) perspective by identifying key limitations of existing methods. We introduce a trans-modal learning framework Genetic InfoMax (GIM), including a regularized MI estimator and a novel genetics-informed transformer to address the specific challenges of GWAS. We evaluate GIM on human brain 3D MRI data and establish standardized evaluation protocols to compare it to existing approaches. Our results demonstrate the effectiveness of GIM and a significantly improved performance on GWAS.
RESUMEN
The Li and Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel. For small panels, the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics. However, LS becomes inefficient when sample size is large, because of its linear time complexity. Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer a fast method for giving some optimal solution (Viterbi) to the LS HMM. Previously, we introduced the minimal positional substring cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size (O(N)). This allows haplotype threading on very large biobank-scale panels on which the LS model is infeasible. Here, we present new results on the solution space of the MPSC. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the length maximal MPSC, and h-MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation.
Asunto(s)
Algoritmos , Programas Informáticos , Humanos , Haplotipos , Genotipo , EtnicidadRESUMEN
Although rates of recombination events across the genome (genetic maps) are fundamental to genetic research, the majority of current studies only use one standard map. There is evidence suggesting population differences in genetic maps, and thus estimating population-specific maps, are of interest. Although the recent availability of biobank-scale data offers such opportunities, current methods are not efficient at leveraging very large sample sizes. The most accurate methods are still linkage disequilibrium (LD)-based methods that are only tractable for a few hundred samples. In this work, we propose a fast and memory-efficient method for estimating genetic maps from population genotyping data. Our method, FastRecomb, leverages the efficient positional Burrows-Wheeler transform (PBWT) data structure for counting IBD segment boundaries as potential recombination events. We used PBWT blocks to avoid redundant counting of pairwise matches. Moreover, we used a panel-smoothing technique to reduce the noise from errors and recent mutations. Using simulation, we found that FastRecomb achieves state-of-the-art performance at 10-kb resolution, in terms of correlation coefficients between the estimated map and the ground truth. This is mainly because FastRecomb can effectively take advantage of large panels comprising more than hundreds of thousands of haplotypes. At the same time, other methods lack the efficiency to handle such data. We believe further refinement of FastRecomb would deliver more accurate genetic maps for the genetics community.
Asunto(s)
Bancos de Muestras Biológicas , Genoma , Haplotipos , Desequilibrio de Ligamiento , Polimorfismo de Nucleótido Simple , Recombinación GenéticaRESUMEN
MOTIVATION: Due to the rapid growth of the genetic database size, genealogical search, a process of inferring familial relatedness by identifying DNA matches, has become a viable approach to help individuals finding missing family members or law enforcement agencies locating suspects. A fast and accurate method is needed to search an out-of-database individual against millions of individuals. Most existing approaches only offer all-versus-all within panel match. Some prototype algorithms offer one-versus-all query from out-of-panel individual, but they do not tolerate errors. RESULTS: A new method, random projection-based identity-by-descent (IBD) detection (RaPID) query, is introduced to make fast genealogical search possible. RaPID-Query identifies IBD segments between a query haplotype and a panel of haplotypes. By integrating matches over multiple PBWT indexes, RaPID-Query manages to locate IBD segments quickly with a given cutoff length while allowing mismatched sites. A single query against all UK biobank autosomal chromosomes was completed within 2.76 seconds on average, with the minimum length 7 cM and 700 markers. RaPID-Query achieved a 0.016 false negative rate and a 0.012 false positive rate simultaneously on a chromosome 20 sequencing panel having 86 265 sites. This is comparable to the state-of-the-art IBD detection method TPBWT(out-of-sample) and Hap-IBD. The high-quality IBD segments yielded by RaPID-Query were able to distinguish up to fourth degree of the familial relatedness for a given individual pair, and the area under the receiver operating characteristic curve values are at least 97.28%. AVAILABILITY AND IMPLEMENTATION: The RaPID-Query program is available at https://github.com/ucfcbb/RaPID-Query.
Asunto(s)
Algoritmos , Cromosomas , Humanos , Haplotipos , Análisis de SecuenciaRESUMEN
PURPOSE: Early detection of brain metastases (BMs) is critical for prompt treatment and optimal control of the disease. In this study, we seek to predict the risk of developing BM among patients diagnosed with lung cancer on the basis of electronic health record (EHR) data and to understand what factors are important for the model to predict BM development through explainable artificial intelligence approaches accurately. MATERIALS AND METHODS: We trained a recurrent neural network model, REverse Time AttentIoN (RETAIN), to predict the risk of developing BM using structured EHR data. To interpret the model's decision process, we analyzed the attention weights in the RETAIN model and the SHAP values from a feature attribution method, Kernel SHAP, to identify the factors contributing to BM prediction. RESULTS: We developed a high-quality cohort with 4,466 patients with BM from the Cerner Health Fact database, which contains over 70 million patients from more than 600 hospitals. RETAIN uses this data set to achieve the best area under the receiver operating characteristic curve at 0.825, a significant improvement over the baseline model. We also extended a feature attribution method, Kernel SHAP, to structured EHR data for model interpretation. Both RETAIN and Kernel SHAP can identify important features related to BM prediction. CONCLUSION: To the best of our knowledge, this is the first study to predict BM using structured EHR data. We achieved decent prediction performance for BM prediction and identified factors highly relevant to BM development. The sensitivity analysis demonstrated that both RETAIN and Kernel SHAP could discriminate unrelated features and put more weight on the features important to BM. Our study explored the potential of applying explainable artificial intelligence for future clinical applications.
Asunto(s)
Neoplasias Encefálicas , Neoplasias Pulmonares , Humanos , Inteligencia Artificial , Registros Electrónicos de Salud , Detección Precoz del Cáncer , Neoplasias Encefálicas/secundarioRESUMEN
The Li & Stephens (LS) hidden Markov model (HMM) models the process of reconstructing a haplotype as a mosaic copy of haplotypes in a reference panel (haplotype threading). For small panels the probabilistic parameterization of LS enables modeling the uncertainties of such mosaics, and has been the foundational model for haplotype phasing and imputation. However, LS becomes inefficient when sample size is large (tens of thousands to millions), because of its linear time complexity ( O ( MN ), where M is the number of haplotypes and N is the number of sites in the panel). Recently the PBWT, an efficient data structure capturing the local haplotype matching among haplotypes, was proposed to offer fast methods for giving some optimal solution (Viterbi) to the LS HMM. But the solution space of the LS for large panels is still elusive. Previously we introduced the Minimal Positional Substring Cover (MPSC) problem as an alternative formulation of LS whose objective is to cover a query haplotype by a minimum number of segments from haplotypes in a reference panel. The MPSC formulation allows the generation of a haplotype threading in time constant to sample size ( O ( N )). This allows haplotype threading on very large biobank scale panels on which the LS model is infeasible. Here we present new results on the solution space of the MPSC by first identifying a property that any MPSC will have a set of required regions, and then proposing a MPSC graph. In addition, we derived a number of optimal algorithms for MPSC, including solution enumerations, the Length Maximal MPSC, and h -MPSC solutions. In doing so, our algorithms reveal the solution space of LS for large panels. Even though we only solved an extreme case of LS where the emission probability is 0, our algorithms can be made more robust by PBWT smoothing. We show that our method is informative in terms of revealing the characteristics of biobank-scale data sets and can improve genotype imputation.
RESUMEN
While rates of recombination events across the genome (genetic maps) are fundamental to genetic research, the majority of current studies only use one standard map. There is evidence suggesting population differences in genetic maps, and thus estimating population-specific maps are of interest. While the recent availability of biobank-scale data offers such opportunities, current methods are not efficient at leveraging very large sample sizes. The most accurate methods are still linkage-disequilibrium (LD)-based methods that are only tractable for a few hundred samples. In this work, we propose a fast and memory-efficient method for estimating genetic maps from population genotyping data. Our method, FastRecomb, leverages the efficient positional Burrows-Wheeler transform (PBWT) data structure for counting IBD segment boundaries as potential recombination events. We used PBWT blocks to avoid redundant counting of pairwise matches. Moreover, we used a panel smoothing technique to reduce the noise from errors and recent mutations. Using simulation, we found that FastRecomb achieves state-of-the-art performance at 10k resolution, in terms of correlation coefficients between the estimated map and the ground truth. This is mainly due to the fact that FastRecomb can effectively take advantage of large panels comprising more than hundreds of thousands of haplotypes. At the same time, other methods lack the efficiency to handle such data. We believe further refinement of FastRecomb would deliver more accurate genetic maps for the genetics community.
RESUMEN
MOTIVATION: The positional Burrows-Wheeler transform (PBWT) has led to tremendous strides in haplotype matching on biobank-scale data. For genetic genealogical search, PBWT-based methods have optimized the asymptotic runtime of finding long matches between a query haplotype and a predefined panel of haplotypes. However, to enable fast query searches, the full-sized panel and PBWT data structures must be kept in memory, preventing existing algorithms from scaling up to modern biobank panels consisting of millions of haplotypes. In this work, we propose a space-efficient variation of PBWT named Syllable-PBWT, which divides every haplotype into syllables, builds the PBWT positional prefix arrays on the compressed syllabic panel, and leverages the polynomial rolling hash function for positional substring comparison. With the Syllable-PBWT data structures, we then present a long match query algorithm named Syllable-Query. RESULTS: Compared to the most time- and space-efficient previously published solution to the long match query problem, Syllable-Query reduced the memory use by a factor of over 100 on both the UK Biobank genotype data and the 1000 Genomes Project sequence data. Surprisingly, the smaller size of our syllabic data structures allows for more efficient iteration and CPU cache usage, granting Syllable-Query even faster runtime than existing solutions. AVAILABILITY AND IMPLEMENTATION: https://github.com/ZhiGroup/Syllable-PBWT. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Asunto(s)
Algoritmos , Genoma , Haplotipos , Genotipo , Programas Informáticos , Análisis de Secuencia de ADN/métodosRESUMEN
In the recent biobank era of genetics, the problem of identical-by-descent (IBD) segment detection received renewed interest, as IBD segments in large cohorts offer unprecedented opportunities in the study of population and genealogical history, as well as genetic association of long haplotypes. While a new generation of efficient methods for IBD segment detection becomes available, direct comparison of these methods is difficult: existing benchmarks were often evaluated in different datasets, with some not openly accessible; methods benchmarked were run under suboptimal parameters; and benchmark performance metrics were not defined consistently. Here, we developed a comprehensive and completely open-source evaluation of the power, accuracy, and resource consumption of these IBD segment detection methods using realistic population genetic simulations with various settings. Our results pave the road for fair evaluation of IBD segment detection methods and provide an practical guide for users.