Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 21
Filter
1.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-38836403

ABSTRACT

In precision medicine, both predicting the disease susceptibility of an individual and forecasting its disease-free survival are areas of key research. Besides the classical epidemiological predictor variables, data from multiple (omic) platforms are increasingly available. To integrate this wealth of information, we propose new methodology to combine both cooperative learning, a recent approach to leverage the predictive power of several datasets, and polygenic hazard score models. Polygenic hazard score models provide a practitioner with a more differentiated view of the predicted disease-free survival than the one given by merely a point estimate, for instance computed with a polygenic risk score. Our aim is to leverage the advantages of cooperative learning for the computation of polygenic hazard score models via Cox's proportional hazard model, thereby improving the prediction of the disease-free survival. In our experimental study, we apply our methodology to forecast the disease-free survival for Alzheimer's disease (AD) using three layers of data. One layer contains epidemiological variables such as sex, APOE (apolipoprotein E, a genetic risk factor for AD) status and 10 leading principal components. Another layer contains selected genomic loci, and the last layer contains methylation data for selected CpG sites. We demonstrate that the survival curves computed via cooperative learning yield an AUC of around $0.7$, above the state-of-the-art performance of its competitors. Importantly, the proposed methodology returns (1) a linear score that can be easily interpreted (in contrast to machine learning approaches), and (2) a weighting of the predictive power of the involved data layers, allowing for an assessment of the importance of each omic (or other) platform. Similarly to polygenic hazard score models, our methodology also allows one to compute individual survival curves for each patient.


Subject(s)
Alzheimer Disease , Precision Medicine , Humans , Precision Medicine/methods , Alzheimer Disease/genetics , Alzheimer Disease/mortality , Disease-Free Survival , Machine Learning , Proportional Hazards Models , Multifactorial Inheritance , Male , Female , Multiomics
2.
Brief Bioinform ; 24(1)2023 01 19.
Article in English | MEDLINE | ID: mdl-36585781

ABSTRACT

Genetic similarity matrices are commonly used to assess population substructure (PS) in genetic studies. Through simulation studies and by the application to whole-genome sequencing (WGS) data, we evaluate the performance of three genetic similarity matrices: the unweighted and weighted Jaccard similarity matrices and the genetic relationship matrix. We describe different scenarios that can create numerical pitfalls and lead to incorrect conclusions in some instances. We consider scenarios in which PS is assessed based on loci that are located across the genome ('globally') and based on loci from a specific genomic region ('locally'). We also compare scenarios in which PS is evaluated based on loci from different minor allele frequency bins: common (>5%), low-frequency (5-0.5%) and rare (<0.5%) single-nucleotide variations (SNVs). Overall, we observe that all approaches provide the best clustering performance when computed based on rare SNVs. The performance of the similarity matrices is very similar for common and low-frequency variants, but for rare variants, the unweighted Jaccard matrix provides preferable clustering features. Based on visual inspection and in terms of standard clustering metrics, its clusters are the densest and the best separated in the principal component analysis of variants with rare SNVs compared with the other methods and different allele frequency cutoffs. In an application, we assessed the role of rare variants on local and global PS, using WGS data from multiethnic Alzheimer's disease data sets and European or East Asian populations from the 1000 Genome Project.


Subject(s)
Genome , Genomics , Principal Component Analysis , Gene Frequency , Computer Simulation , Genome-Wide Association Study , Polymorphism, Single Nucleotide
3.
BMC Bioinformatics ; 25(1): 43, 2024 Jan 25.
Article in English | MEDLINE | ID: mdl-38273228

ABSTRACT

The computation of a similarity measure for genomic data is a standard tool in computational genetics. The principal components of such matrices are routinely used to correct for biases due to confounding by population stratification, for instance in linear regressions. However, the calculation of both a similarity matrix and its singular value decomposition (SVD) are computationally intensive. The contribution of this article is threefold. First, we demonstrate that the calculation of three matrices (called the covariance matrix, the weighted Jaccard matrix, and the genomic relationship matrix) can be reformulated in a unified way which allows for the application of a randomized SVD algorithm, which is faster than the traditional computation. The fast SVD algorithm we present is adapted from an existing randomized SVD algorithm and ensures that all computations are carried out in sparse matrix algebra. The algorithm only assumes that row-wise and column-wise subtraction and multiplication of a vector with a sparse matrix is available, an operation that is efficiently implemented in common sparse matrix packages. An exception is the so-called Jaccard matrix, which does not have a structure applicable for the fast SVD algorithm. Second, an approximate Jaccard matrix is introduced to which the fast SVD computation is applicable. Third, we establish guaranteed theoretical bounds on the accuracy (in [Formula: see text] norm and angle) between the principal components of the Jaccard matrix and the ones of our proposed approximation, thus putting the proposed Jaccard approximation on a solid mathematical foundation, and derive the theoretical runtime of our algorithm. We illustrate that the approximation error is low in practice and empirically verify the theoretical runtime scalings on both simulated data and data of the 1000 Genome Project.


Subject(s)
Genome , Genomics , Algorithms , Linear Models
4.
Alzheimers Dement ; 20(5): 3397-3405, 2024 May.
Article in English | MEDLINE | ID: mdl-38563508

ABSTRACT

INTRODUCTION: Genome-wide association studies have identified numerous disease susceptibility loci (DSLs) for Alzheimer's disease (AD). However, only a limited number of studies have investigated the dependence of the genetic effect size of established DSLs on genetic ancestry. METHODS: We utilized the whole genome sequencing data from the Alzheimer's Disease Sequencing Project (ADSP) including 35,569 participants. A total of 25,459 subjects in four distinct populations (African ancestry, non-Hispanic White, admixed Hispanic, and Asian) were analyzed. RESULTS: We found that nine DSLs showed significant heterogeneity across populations. Single nucleotide polymorphism (SNP) rs2075650 in translocase of outer mitochondrial membrane 40 (TOMM40) showed the largest heterogeneity (Cochran's Q = 0.00, I2 = 90.08), followed by other SNPs in apolipoprotein C1 (APOC1) and apolipoprotein E (APOE). Two additional loci, signal-induced proliferation-associated 1 like 2 (SIPA1L2) and solute carrier 24 member 4 (SLC24A4), showed significant heterogeneity across populations. DISCUSSION: We observed substantial heterogeneity for the APOE-harboring 19q13.32 region with TOMM40/APOE/APOC1 genes. The largest risk effect was seen among African Americans, while Asians showed a surprisingly small risk effect.


Subject(s)
Alzheimer Disease , Genetic Predisposition to Disease , Genome-Wide Association Study , Mitochondrial Precursor Protein Import Complex Proteins , Polymorphism, Single Nucleotide , Humans , Alzheimer Disease/genetics , Genetic Predisposition to Disease/genetics , Polymorphism, Single Nucleotide/genetics , Apolipoproteins E/genetics , Female , Male , Apolipoprotein C-I/genetics , Aged , Membrane Transport Proteins/genetics , Genetic Loci/genetics
5.
BMC Bioinformatics ; 23(1): 547, 2022 Dec 19.
Article in English | MEDLINE | ID: mdl-36536276

ABSTRACT

As of June 2022, the GISAID database contains more than 11 million SARS-CoV-2 genomes, including several thousand nucleotide sequences for the most common variants such as delta or omicron. These SARS-CoV-2 strains have been collected from patients around the world since the beginning of the pandemic. We start by assessing the similarity of all pairs of nucleotide sequences using the Jaccard index and principal component analysis. As shown previously in the literature, an unsupervised cluster analysis applied to the SARS-CoV-2 genomes results in clusters of sequences according to certain characteristics such as their strain or their clade. Importantly, we observe that nucleotide sequences of common variants are often outliers in clusters of sequences stemming from variants identified earlier on during the pandemic. Motivated by this finding, we are interested in applying outlier detection to nucleotide sequences. We demonstrate that nucleotide sequences of common variants (such as alpha, delta, or omicron) can be identified solely based on a statistical outlier criterion. We argue that outlier detection might be a useful surveillance tool to identify emerging variants in real time as the pandemic progresses.


Subject(s)
COVID-19 , Humans , Base Sequence , SARS-CoV-2 , Cluster Analysis , Databases, Factual
6.
Genet Epidemiol ; 45(3): 316-323, 2021 04.
Article in English | MEDLINE | ID: mdl-33415739

ABSTRACT

Over 10,000 viral genome sequences of the SARS-CoV-2virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website (http://virological.org/) early on January 11. We utilize the published data on the single stranded RNAs of 11,132 SARS-CoV-2 patients in the GISAID database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modeling the mutation rate, applying phylogenetic tree approaches, and so forth, we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index. Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development.


Subject(s)
COVID-19/virology , Cluster Analysis , Genome, Viral/genetics , Geographic Mapping , SARS-CoV-2/classification , SARS-CoV-2/genetics , COVID-19/epidemiology , China/epidemiology , Databases, Genetic , Europe/epidemiology , Humans , Molecular Epidemiology , North America/epidemiology , Pandemics , Phylogeny , Principal Component Analysis , Prognosis , SARS-CoV-2/isolation & purification , SARS-CoV-2/pathogenicity , Spatio-Temporal Analysis
7.
Genet Epidemiol ; 45(1): 82-98, 2021 02.
Article in English | MEDLINE | ID: mdl-32929743

ABSTRACT

locStra is an R -package for the analysis of regional and global population stratification in whole-genome sequencing (WGS) studies, where regional stratification refers to the substructure defined by the loci in a particular region on the genome. Population substructure can be assessed based on the genetic covariance matrix, the genomic relationship matrix, and the unweighted/weighted genetic Jaccard similarity matrix. Using a sliding window approach, the regional similarity matrices are compared with the global ones, based on user-defined window sizes and metrics, for example, the correlation between regional and global eigenvectors. An algorithm for the specification of the window size is provided. As the implementation fully exploits sparse matrix algebra and is written in C++, the analysis is highly efficient. Even on single cores, for realistic study sizes (several thousand subjects, several million rare variants per subject), the runtime for the genome-wide computation of all regional similarity matrices does typically not exceed one hour, enabling an unprecedented investigation of regional stratification across the entire genome. The package is applied to three WGS studies, illustrating the varying patterns of regional substructure across the genome and its beneficial effects on association testing.


Subject(s)
Genome-Wide Association Study , Genome , Algorithms , Genomics , Humans , Polymorphism, Single Nucleotide , Whole Genome Sequencing
8.
Genet Epidemiol ; 45(7): 685-693, 2021 10.
Article in English | MEDLINE | ID: mdl-34159627

ABSTRACT

SARS-CoV-2 mortality has been extensively studied in relation to host susceptibility. How sequence variations in the SARS-CoV-2 genome affect pathogenicity is poorly understood. Starting in October 2020, using the methodology of genome-wide association studies (GWAS), we looked at the association between whole-genome sequencing (WGS) data of the virus and COVID-19 mortality as a potential method of early identification of highly pathogenic strains to target for containment. Although continuously updating our analysis, in December 2020, we analyzed 7548 single-stranded SARS-CoV-2 genomes of COVID-19 patients in the GISAID database and associated variants with mortality using a logistic regression. In total, evaluating 29,891 sequenced loci of the viral genome for association with patient/host mortality, two loci, at 12,053 and 25,088 bp, achieved genome-wide significance (p values of 4.09e-09 and 4.41e-23, respectively), though only 25,088 bp remained significant in follow-up analyses. Our association findings were exclusively driven by the samples that were submitted from Brazil (p value of 4.90e-13 for 25,088 bp). The mutation frequency of 25,088 bp in the Brazilian samples on GISAID has rapidly increased from about 0.4 in October/December 2020 to 0.77 in March 2021. Although GWAS methodology is suitable for samples in which mutation frequencies varies between geographical regions, it cannot account for mutation frequencies that change rapidly overtime, rendering a GWAS follow-up analysis of the GISAID samples that have been submitted after December 2020 as invalid. The locus at 25,088 bp is located in the P.1 strain, which later (April 2021) became one of the distinguishing loci (precisely, substitution V1176F) of the Brazilian strain as defined by the Centers for Disease Control. Specifically, the mutations at 25,088 bp occur in the S2 subunit of the SARS-CoV-2 spike protein, which plays a key role in viral entry of target host cells. Since the mutations alter amino acid coding sequences, they potentially imposing structural changes that could enhance viral infectivity and symptom severity. Our analysis suggests that GWAS methodology can provide suitable analysis tools for the real-time detection of new more transmissible and pathogenic viral strains in databases such as GISAID, though new approaches are needed to accommodate rapidly changing mutation frequencies over time, in the presence of simultaneously changing case/control ratios. Improvements of the associated metadata/patient information in terms of quality and availability will also be important to fully utilize the potential of GWAS methodology in this field.


Subject(s)
COVID-19 , Spike Glycoprotein, Coronavirus , Brazil , Genome-Wide Association Study , Humans , Mutation , Phylogeny , SARS-CoV-2 , Spike Glycoprotein, Coronavirus/genetics
9.
Ecol Evol ; 14(6): e11530, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38895566

ABSTRACT

The capacity of forests to sequester carbon in both above- and belowground compartments is a crucial tool to mitigate rising atmospheric carbon concentrations. Belowground carbon storage in forests is strongly linked to soil microbial communities that are the key drivers of soil heterotrophic respiration, organic matter decomposition and thus nutrient cycling. However, the relationships between tree diversity and soil microbial properties such as biomass and respiration remain unclear with inconsistent findings among studies. It is unknown so far how the spatial configuration and soil depth affect the relationship between tree richness and microbial properties. Here, we studied the spatial distribution of soil microbial properties in the context of a tree diversity experiment by measuring soil microbial biomass and respiration in subtropical forests (BEF-China experiment). We sampled soil cores at two depths at five locations along a spatial transect between the trees in mono- and hetero-specific tree pairs of the native deciduous species Liquidambar formosana and Sapindus saponaria. Our analyses showed decreasing soil microbial biomass and respiration with increasing soil depth and distance from the tree in mono-specific tree pairs. We calculated belowground overyielding of soil microbial biomass and respiration - which is higher microbial biomass or respiration than expected from the monocultures - and analysed the distribution patterns along the transect. We found no general overyielding across all sampling positions and depths. Yet, we encountered a spatial pattern of microbial overyielding with a significant microbial overyielding close to L. formosana trees and microbial underyielding close to S. saponaria trees. We found similar spatial patterns across microbial properties and depths that only differed in the strength of their effects. Our results highlight the importance of small-scale variations of tree-tree interaction effects on soil microbial communities and functions and are calling for better integration of within-plot variability to understand biodiversity-ecosystem functioning relationships.

10.
Genes (Basel) ; 15(5)2024 04 27.
Article in English | MEDLINE | ID: mdl-38790194

ABSTRACT

Depression is heritable, differs by sex, and has environmental risk factors such as cigarette smoking. However, the effect of single nucleotide polymorphisms (SNPs) on depression through cigarette smoking and the role of sex is unclear. In order to examine the association of SNPs with depression and smoking in the UK Biobank with replication in the COPDGene study, we used counterfactual-based mediation analysis to test the indirect or mediated effect of SNPs on broad depression through the log of pack-years of cigarette smoking, adjusting for age, sex, current smoking status, and genetic ancestry (via principal components). In secondary analyses, we adjusted for age, sex, current smoking status, genetic ancestry (via principal components), income, education, and living status (urban vs. rural). In addition, we examined sex-stratified mediation models and sex-moderated mediation models. For both analyses, we adjusted for age, current smoking status, and genetic ancestry (via principal components). In the UK Biobank, rs6424532 [LOC105378800] had a statistically significant indirect effect on broad depression through the log of pack-years of cigarette smoking (p = 4.0 × 10-4) among all participants and a marginally significant indirect effect among females (p = 0.02) and males (p = 4.0 × 10-3). Moreover, rs10501696 [GRM5] had a marginally significant indirect effect on broad depression through the log of pack-years of cigarette smoking (p = 0.01) among all participants and a significant indirect effect among females (p = 2.2 × 10-3). In the secondary analyses, the sex-moderated indirect effect was marginally significant for rs10501696 [GRM5] on broad depression through the log of pack-years of cigarette smoking (p = 0.01). In the COPDGene study, the effect of an SNP (rs10501696) in GRM5 on depressive symptoms and medication was mediated by log of pack-years (p = 0.02); however, no SNPs had a sex-moderated mediated effect on depressive symptoms. In the UK Biobank, we found SNPs in two genes [LOC105378800, GRM5] with an indirect effect on broad depression through the log of pack-years of cigarette smoking. In addition, the indirect effect for GRM5 on broad depression through smoking may be moderated by sex. These results suggest that genetic regions associated with broad depression may be mediated by cigarette smoking and this relationship may be moderated by sex.


Subject(s)
Depression , Polymorphism, Single Nucleotide , Humans , Male , Female , Depression/genetics , Depression/epidemiology , Middle Aged , Aged , Smoking/genetics , Sex Factors , Genetic Predisposition to Disease , United Kingdom/epidemiology , Cigarette Smoking/genetics , Cigarette Smoking/adverse effects , Risk Factors
11.
ArXiv ; 2023 Sep 25.
Article in English | MEDLINE | ID: mdl-37808094

ABSTRACT

Principal components computed via PCA (principal component analysis) are traditionally used to reduce dimensionality in genomic data or to correct for population stratification. In this paper, we explore the penalized eigenvalue problem (PEP) which reformulates the computation of the first eigenvector as an optimization problem and adds an L1 penalty constraint. The contribution of our article is threefold. First, we extend PEP by applying Nesterov smoothing to the original LASSO-type L1 penalty. This allows one to compute analytical gradients which enable faster and more efficient minimization of the objective function associated with the optimization problem. Second, we demonstrate how higher order eigenvectors can be calculated with PEP using established results from singular value decomposition (SVD). Third, using data from the 1000 Genome Project dataset, we empirically demonstrate that our proposed smoothed PEP allows one to increase numerical stability and obtain meaningful eigenvectors. We further investigate the utility of the penalized eigenvector approach over traditional PCA.

12.
Article in English | MEDLINE | ID: mdl-38179578

ABSTRACT

Quantum annealing is a specialized type of quantum computation that aims to use quantum fluctuations in order to obtain global minimum solutions of combinatorial optimization problems. Programmable D-Wave quantum annealers are available as cloud computing resources, which allow users low-level access to quantum annealing control features. In this article, we are interested in improving the quality of the solutions returned by a quantum annealer by encoding an initial state into the annealing process. We explore twoD-Wave features that allow one toencode such an initialstate: the reverse annealing (RA) and theh-gain(HG)features.RAaimstorefineaknownsolutionfollowinganannealpathstartingwithaclassical state representing a good solution, going backward to a point where a transverse field is present, and then finishing the annealing process with a forward anneal. The HG feature allows one to put a time-dependent weighting scheme on linear (h) biases of the Hamiltonian, and we demonstrate that this feature likewise can be used to bias the annealing to start from an initial state. We also consider a hybrid method consisting of a backward phase resembling RA and a forward phase using the HG initial state encoding. Importantly, we investigate the idea of iteratively applying RA and HG to a problem, with the goal of monotonically improving on an initial state that is not optimal. The HG encoding technique is evaluated on a variety of input problems including the edge-weighted maximum cut problem and the vertex-weighted maximum clique problem, demonstrating that the HG technique is a viable alternative to RA for some problems. We also investigate how the iterative procedures perform for both RA and HG initial state encodings on random whole-chip spin glasses with the native hardware connectivity of the D-Wave Chimera and Pegasus chips.

13.
Genes (Basel) ; 14(6)2023 05 24.
Article in English | MEDLINE | ID: mdl-37372314

ABSTRACT

We are interested in detecting a departure from the baseline in a longitudinal analysis in the context of multiple organ dysfunction syndrome (MODS). In particular, we are given gene expression reads at two time points for a fixed number of genes and individuals. The individuals can be subdivided into two groups, denoted as groups A and B. Using the two time points, we compute a contrast of gene expression reads per individual and gene. The age of each individual is known and it is used to compute, for each gene separately, a linear regression of the gene expression contrasts on the individual's age. Looking at the intercept of the linear regression to detect a departure from the baseline, we aim to reliably single out those genes for which there is a difference in the intercept among those individuals in group A and not in group B. In this work, we develop testing methodology for this setting based on two hypothesis tests-one under the null and one under an appropriately formulated alternative. We demonstrate the validity of our approach using a dataset created by bootstrapping from a real data application in the context of multiple organ dysfunction syndrome (MODS).


Subject(s)
Multiple Organ Failure , Humans , Multiple Organ Failure/genetics , Multiple Organ Failure/diagnosis , Linear Models , Gene Expression
14.
Epigenetics ; 18(1): 2257437, 2023 12.
Article in English | MEDLINE | ID: mdl-37731367

ABSTRACT

Background: Recent studies have identified thousands of associations between DNA methylation CpGs and complex diseases/traits, emphasizing the critical role of epigenetics in understanding disease aetiology and identifying biomarkers. However, association analyses based on methylation array data are susceptible to batch/slide effects, which can lead to inflated false positive rates or reduced statistical powerResults: We use multiple DNA methylation datasets based on the popular Illumina Infinium MethylationEPIC BeadChip array to describe consistent patterns and the joint distribution of slide effects across CpGs, confirming and extending previous results. The susceptible CpGs overlap with the Illumina Infinium HumanMethylation450 BeadChip array content.Conclusions: Our findings reveal systematic patterns in slide effects. The observations provide further insights into the characteristics of these effects and can improve existing adjustment approaches.


Subject(s)
DNA Methylation , Epigenesis, Genetic , Epigenomics , Multifactorial Inheritance
15.
Front Immunol ; 14: 1220028, 2023.
Article in English | MEDLINE | ID: mdl-37533854

ABSTRACT

Background: Influenza virus is responsible for a large global burden of disease, especially in children. Multiple Organ Dysfunction Syndrome (MODS) is a life-threatening and fatal complication of severe influenza infection. Methods: We measured RNA expression of 469 biologically plausible candidate genes in children admitted to North American pediatric intensive care units with severe influenza virus infection with and without MODS. Whole blood samples from 191 influenza-infected children (median age 6.4 years, IQR: 2.2, 11) were collected a median of 27 hours following admission; for 45 children a second blood sample was collected approximately seven days later. Extracted RNA was hybridized to NanoString mRNA probes, counts normalized, and analyzed using linear models controlling for age and bacterial co-infections (FDR q<0.05). Results: Comparing pediatric samples collected near admission, children with Prolonged MODS for ≥7 days (n=38; 9 deaths) had significant upregulation of nine mRNA transcripts associated with neutrophil degranulation (RETN, TCN1, OLFM4, MMP8, LCN2, BPI, LTF, S100A12, GUSB) compared to those who recovered more rapidly from MODS (n=27). These neutrophil transcripts present in early samples predicted Prolonged MODS or death when compared to patients who recovered, however in paired longitudinal samples, they were not differentially expressed over time. Instead, five genes involved in protein metabolism and/or adaptive immunity signaling pathways (RPL3, MRPL3, HLA-DMB, EEF1G, CD8A) were associated with MODS recovery within a week. Conclusion: Thus, early increased expression of neutrophil degranulation genes indicated worse clinical outcomes in children with influenza infection, consistent with reports in adult cohorts with influenza, sepsis, and acute respiratory distress syndrome.


Subject(s)
Bacterial Infections , Influenza, Human , Humans , Multiple Organ Failure/genetics , Influenza, Human/genetics , Influenza, Human/complications , Transcriptome , Phenotype , Hospitalization , Bacterial Infections/complications
16.
Sci Rep ; 12(1): 4499, 2022 Mar 16.
Article in English | MEDLINE | ID: mdl-35296721

ABSTRACT

Quantum annealers of D-Wave Systems, Inc., offer an efficient way to compute high quality solutions of NP-hard problems. This is done by mapping a problem onto the physical qubits of the quantum chip, from which a solution is obtained after quantum annealing. However, since the connectivity of the physical qubits on the chip is limited, a minor embedding of the problem structure onto the chip is required. In this process, and especially for smaller problems, many qubits will stay unused. We propose a novel method, called parallel quantum annealing, to make better use of available qubits, wherein either the same or several independent problems are solved in the same annealing cycle of a quantum annealer, assuming enough physical qubits are available to embed more than one problem. Although the individual solution quality may be slightly decreased when solving several problems in parallel (as opposed to solving each problem separately), we demonstrate that our method may give dramatic speed-ups in terms of the Time-To-Solution (TTS) metric for solving instances of the Maximum Clique problem when compared to solving each problem sequentially on the quantum annealer. Additionally, we show that solving a single Maximum Clique problem using parallel quantum annealing reduces the TTS significantly.

17.
Sci Rep ; 12(1): 8539, 2022 May 20.
Article in English | MEDLINE | ID: mdl-35595786

ABSTRACT

Quantum annealers manufactured by D-Wave Systems, Inc., are computational devices capable of finding high-quality heuristic solutions of NP-hard problems. In this contribution, we explore the potential and effectiveness of such quantum annealers for computing Boolean tensor networks. Tensors offer a natural way to model high-dimensional data commonplace in many scientific fields, and representing a binary tensor as a Boolean tensor network is the task of expressing a tensor containing categorical (i.e., [Formula: see text]) values as a product of low dimensional binary tensors. A Boolean tensor network is computed by Boolean tensor decomposition, and it is usually not exact. The aim of such decomposition is to minimize the given distance measure between the high-dimensional input tensor and the product of lower-dimensional (usually three-dimensional) tensors and matrices representing the tensor network. In this paper, we introduce and analyze three general algorithms for Boolean tensor networks: Tucker, Tensor Train, and Hierarchical Tucker networks. The computation of a Boolean tensor network is reduced to a sequence of Boolean matrix factorizations, which we show can be expressed as a quadratic unconstrained binary optimization problem suitable for solving on a quantum annealer. By using a novel method we introduce called parallel quantum annealing, we demonstrate that Boolean tensor's with up to millions of elements can be decomposed efficiently using a DWave 2000Q quantum annealer.

18.
Genes (Basel) ; 13(1)2022 01 06.
Article in English | MEDLINE | ID: mdl-35052450

ABSTRACT

Polygenic risk scores are a popular means to predict the disease risk or disease susceptibility of an individual based on its genotype information. When adding other important epidemiological covariates such as age or sex, we speak of an integrated risk model. Methodological advances for fitting more accurate integrated risk models are of immediate importance to improve the precision of risk prediction, thereby potentially identifying patients at high risk early on when they are still able to benefit from preventive steps/interventions targeted at increasing their odds of survival, or at reducing their chance of getting a disease in the first place. This article proposes a smoothed version of the "Lassosum" penalty used to fit polygenic risk scores and integrated risk models using either summary statistics or raw data. The smoothing allows one to obtain explicit gradients everywhere for efficient minimization of the Lassosum objective function while guaranteeing bounds on the accuracy of the fit. An experimental section on both Alzheimer's disease and COPD (chronic obstructive pulmonary disease) demonstrates the increased accuracy of the proposed smoothed Lassosum penalty compared to the original Lassosum algorithm (for the datasets under consideration), allowing it to draw equal with state-of-the-art methodology such as LDpred2 when evaluated via the AUC (area under the ROC curve) metric.


Subject(s)
Algorithms , Alzheimer Disease/genetics , Genetic Predisposition to Disease , Models, Genetic , Multifactorial Inheritance , Polymorphism, Single Nucleotide , Pulmonary Disease, Chronic Obstructive/genetics , Aged , Aged, 80 and over , Alzheimer Disease/pathology , Case-Control Studies , Female , Genome-Wide Association Study , Humans , Middle Aged , Pulmonary Disease, Chronic Obstructive/pathology
19.
PLoS One ; 17(5): e0266752, 2022.
Article in English | MEDLINE | ID: mdl-35544468

ABSTRACT

To increase power and minimize bias in statistical analyses, quantitative outcomes are often adjusted for precision and confounding variables using standard regression approaches. The outcome is modeled as a linear function of the precision variables and confounders; however, for many complex phenotypes, the assumptions of the linear regression models are not always met. As an alternative, we used neural networks for the modeling of complex phenotypes and covariate adjustments. We compared the prediction accuracy of the neural network models to that of classical approaches based on linear regression. Using data from the UK Biobank, COPDGene study, and Childhood Asthma Management Program (CAMP), we examined the features of neural networks in this context and compared them with traditional regression approaches for prediction of three outcomes: forced expiratory volume in one second (FEV1), age at smoking cessation, and log transformation of age at smoking cessation (due to age at smoking cessation being right-skewed). We used mean squared error to compare neural network and regression models, and found the models performed similarly unless the observed distribution of the phenotype was skewed, in which case the neural network had smaller mean squared error. Our results suggest neural network models have an advantage over standard regression approaches when the phenotypic distribution is skewed. However, when the distribution is not skewed, the approaches performed similarly. Our findings are relevant to studies that analyze phenotypes that are skewed by nature or where the phenotype of interest is skewed as a result of the ascertainment condition.


Subject(s)
Neural Networks, Computer , Smoking , Forced Expiratory Volume/genetics , Phenotype , Spirometry
20.
bioRxiv ; 2020 Nov 20.
Article in English | MEDLINE | ID: mdl-32637949

ABSTRACT

Over 10,000 viral genome sequences of the SARS-CoV-2 virus have been made readily available during the ongoing coronavirus pandemic since the initial genome sequence of the virus was released on the open access Virological website ( http://virological.org/ ) early on January 11. We utilize the published data on the single stranded RNAs of 11, 132 SARS-CoV-2 patients in the GISAID (Elbe and Buckland-Merrett, 2017; Shu and McCauley, 2017) database, which contains fully or partially sequenced SARS-CoV-2 samples from laboratories around the world. Among many important research questions which are currently being investigated, one aspect pertains to the genetic characterization/classification of the virus. We analyze data on the nucleotide sequencing of the virus and geographic information of a subset of 7, 640 SARS-CoV-2 patients without missing entries that are available in the GISAID database. Instead of modelling the mutation rate, applying phylogenetic tree approaches, etc., we here utilize a model-free clustering approach that compares the viruses at a genome-wide level. We apply principal component analysis to a similarity matrix that compares all pairs of these SARS-CoV-2 nucleotide sequences at all loci simultaneously, using the Jaccard index (Jaccard, 1901; Tan et al., 2005; Prokopenko et al., 2016; Schlauch et al., 2017). Our analysis results of the SARS-CoV-2 genome data illustrates the geographic and chronological progression of the virus, starting from the first cases that were observed in China to the current wave of cases in Europe and North America. This is in line with a phylogenetic analysis which we use to contrast our results. We also observe that, based on their sequence data, the SARS-CoV-2 viruses cluster in distinct genetic subgroups. It is the subject of ongoing research to examine whether the genetic subgroup could be related to diseases outcome and its potential implications for vaccine development.

SELECTION OF CITATIONS
SEARCH DETAIL