RESUMO
Genome-wide Association Studies (GWAS) methods have identified individual single-nucleotide polymorphisms (SNPs) significantly associated with specific phenotypes. Nonetheless, many complex diseases are polygenic and are controlled by multiple genetic variants that are usually non-linearly dependent. These genetic variants are marginally less effective and remain undetected in GWAS analysis. Kernel-based tests (KBT), which evaluate the joint effect of a group of genetic variants, are therefore critical for complex disease analysis. However, choosing different kernel functions in KBT can significantly influence the type I error control and power, and selecting the optimal kernel remains a statistically challenging task. A few existing methods suffer from inflated type 1 errors, limited scalability, inferior power or issues of ambiguous conclusions. Here, we present a new Bayesian framework, BayesKAT (https://github.com/wangjr03/BayesKAT), which overcomes these kernel specification issues by selecting the optimal composite kernel adaptively from the data while testing genetic associations simultaneously. Furthermore, BayesKAT implements a scalable computational strategy to boost its applicability, especially for high-dimensional cases where other methods become less effective. Based on a series of performance comparisons using both simulated and real large-scale genetics data, BayesKAT outperforms the available methods in detecting complex group-level associations and controlling type I errors simultaneously. Applied on a variety of groups of functionally related genetic variants based on biological pathways, co-expression gene modules and protein complexes, BayesKAT deciphers the complex genetic basis and provides mechanistic insights into human diseases.
Assuntos
Teorema de Bayes , Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Humanos , Estudo de Associação Genômica Ampla/métodos , Predisposição Genética para Doença , Algoritmos , Software , Biologia Computacional/métodos , Estudos de Associação Genética/métodosRESUMO
Many complex diseases exhibit pronounced sex differences that can affect both the initial risk of developing the disease, as well as clinical disease symptoms, molecular manifestations, disease progression, and the risk of developing comorbidities. Despite this, computational studies of molecular data for complex diseases often treat sex as a confounding variable, aiming to filter out sex-specific effects rather than attempting to interpret them. A more systematic, in-depth exploration of sex-specific disease mechanisms could significantly improve our understanding of pathological and protective processes with sex-dependent profiles. This survey discusses dedicated bioinformatics approaches for the study of molecular sex differences in complex diseases. It highlights that, beyond classical statistical methods, approaches are needed that integrate prior knowledge of relevant hormone signaling interactions, gene regulatory networks, and sex linkage of genes to provide a mechanistic interpretation of sex-dependent alterations in disease. The review examines and compares the advantages, pitfalls and limitations of various conventional statistical and systems-level mechanistic analyses for this purpose, including tailored pathway and network analysis techniques. Overall, this survey highlights the potential of specialized bioinformatics techniques to systematically investigate molecular sex differences in complex diseases, to inform biomarker signature modeling, and to guide more personalized treatment approaches.
Assuntos
Biologia Computacional , Caracteres Sexuais , Humanos , Biologia Computacional/métodos , Masculino , Feminino , Redes Reguladoras de GenesRESUMO
Heritability is a fundamental concept in genetic studies, measuring the genetic contribution to complex traits and bringing insights about disease mechanisms. The advance of high-throughput technologies has provided many resources for heritability estimation. Linkage disequilibrium (LD) score regression (LDSC) estimates both heritability and confounding biases, such as cryptic relatedness and population stratification, among single-nucleotide polymorphisms (SNPs) by using only summary statistics released from genome-wide association studies. However, only partial information in the LD matrix is utilized in LDSC, leading to loss in precision. In this study, we propose LD eigenvalue regression (LDER), an extension of LDSC, by making full use of the LD information. Compared to state-of-the-art heritability estimating methods, LDER provides more accurate estimates of SNP heritability and better distinguishes the inflation caused by polygenicity and confounding effects. We demonstrate the advantages of LDER both theoretically and with extensive simulations. We applied LDER to 814 complex traits from UK Biobank, and LDER identified 363 significantly heritable phenotypes, among which 97 were not identified by LDSC.
Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Estudo de Associação Genômica Ampla/métodos , Humanos , Desequilíbrio de Ligação , Modelos Genéticos , Herança Multifatorial/genética , Fenótipo , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
Complex traits are influenced by genetic risk factors, lifestyle, and environmental variables, so-called exposures. Some exposures, e.g., smoking or lipid levels, have common genetic modifiers identified in genome-wide association studies. Because measurements are often unfeasible, exposure polygenic risk scores (ExPRSs) offer an alternative to study the influence of exposures on various phenotypes. Here, we collected publicly available summary statistics for 28 exposures and applied four common PRS methods to generate ExPRSs in two large biobanks: the Michigan Genomics Initiative and the UK Biobank. We established ExPRSs for 27 exposures and demonstrated their applicability in phenome-wide association studies and as predictors for common chronic conditions. Especially the addition of multiple ExPRSs showed, for several chronic conditions, an improvement compared to prediction models that only included traditional, disease-focused PRSs. To facilitate follow-up studies, we share all ExPRS constructs and generated results via an online repository called ExPRSweb.
Assuntos
Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Humanos , Lipídeos , Herança Multifatorial/genética , Fatores de RiscoRESUMO
Currently, there exist no generally accepted strategies of evaluating computational models for microRNA-disease associations (MDAs). Though K-fold cross validations and case studies seem to be must-have procedures, the value of K, the evaluation metrics, and the choice of query diseases as well as the inclusion of other procedures (such as parameter sensitivity tests, ablation studies and computational cost reports) are all determined on a case-by-case basis and depending on the researchers' choices. In the current review, we include a comprehensive analysis on how 29 state-of-the-art models for predicting MDAs were evaluated. Based on the analytical results, we recommend a feasible evaluation workflow that would suit any future model to facilitate fair and systematic assessment of predictive performance.
Assuntos
MicroRNAs , MicroRNAs/genética , Biologia Computacional/métodos , Algoritmos , Simulação por ComputadorRESUMO
MicroRNAs (miRNAs) are gene regulators involved in the pathogenesis of complex diseases such as cancers, and thus serve as potential diagnostic markers and therapeutic targets. The prerequisite for designing effective miRNA therapies is accurate discovery of miRNA-disease associations (MDAs), which has attracted substantial research interests during the last 15 years, as reflected by more than 55 000 related entries available on PubMed. Abundant experimental data gathered from the wealth of literature could effectively support the development of computational models for predicting novel associations. In 2017, Chen et al. published the first-ever comprehensive review on MDA prediction, presenting various relevant databases, 20 representative computational models, and suggestions for building more powerful ones. In the current review, as the continuation of the previous study, we revisit miRNA biogenesis, detection techniques and functions; summarize recent experimental findings related to common miRNA-associated diseases; introduce recent updates of miRNA-relevant databases and novel database releases since 2017, present mainstream webservers and new webserver releases since 2017 and finally elaborate on how fusion of diverse data sources has contributed to accurate MDA prediction.
Assuntos
MicroRNAs , Neoplasias , Humanos , MicroRNAs/genética , Bases de Dados Genéticas , Neoplasias/genética , PubMed , Biologia Computacional/métodos , Predisposição Genética para Doença , AlgoritmosRESUMO
Since the problem proposed in late 2000s, microRNA-disease association (MDA) predictions have been implemented based on the data fusion paradigm. Integrating diverse data sources gains a more comprehensive research perspective, and brings a challenge to algorithm design for generating accurate, concise and consistent representations of the fused data. After more than a decade of research progress, a relatively simple algorithm like the score function or a single computation layer may no longer be sufficient for further improving predictive performance. Advanced model design has become more frequent in recent years, particularly in the form of reasonably combing multiple algorithms, a process known as model fusion. In the current review, we present 29 state-of-the-art models and introduce the taxonomy of computational models for MDA prediction based on model fusion and non-fusion. The new taxonomy exhibits notable changes in the algorithmic architecture of models, compared with that of earlier ones in the 2017 review by Chen et al. Moreover, we discuss the progresses that have been made towards overcoming the obstacles to effective MDA prediction since 2017 and elaborated on how future models can be designed according to a set of new schemas. Lastly, we analysed the strengths and weaknesses of each model category in the proposed taxonomy and proposed future research directions from diverse perspectives for enhancing model performance.
Assuntos
MicroRNAs , Algoritmos , Biologia Computacional , Simulação por Computador , MicroRNAs/genéticaRESUMO
Circular RNAs (circRNAs) are a category of novelty discovered competing endogenous non-coding RNAs that have been proved to implicate many human complex diseases. A large number of circRNAs have been confirmed to be involved in cancer progression and are expected to become promising biomarkers for tumor diagnosis and targeted therapy. Deciphering the underlying relationships between circRNAs and diseases may provide new insights for us to understand the pathogenesis of complex diseases and further characterize the biological functions of circRNAs. As traditional experimental methods are usually time-consuming and laborious, computational models have made significant progress in systematically exploring potential circRNA-disease associations, which not only creates new opportunities for investigating pathogenic mechanisms at the level of circRNAs, but also helps to significantly improve the efficiency of clinical trials. In this review, we first summarize the functions and characteristics of circRNAs and introduce some representative circRNAs related to tumorigenesis. Then, we mainly investigate the available databases and tools dedicated to circRNA and disease studies. Next, we present a comprehensive review of computational methods for predicting circRNA-disease associations and classify them into five categories, including network propagating-based, path-based, matrix factorization-based, deep learning-based and other machine learning methods. Finally, we further discuss the challenges and future researches in this field.
Assuntos
Neoplasias , RNA Circular , Algoritmos , Biologia Computacional/métodos , Humanos , Aprendizado de Máquina , Neoplasias/genéticaRESUMO
OBJECTIVES: Systemic sclerosis (SSc) is a heterogeneous disease, complicating its management. Its complexity and the insufficiency of clinical manifestations alone to delineate homogeneous patient groups further challenge this task. However, autoantibodies could serve as relevant markers for the pathophysiological mechanisms driving the disease. Identifying specific immunological mechanisms based on patients' serological statuses might facilitate a deeper understanding of the diversity of the disease. METHODS: A cohort of 206 patients with SSc enrolled in the PRECISESADS cross-sectional study was examined. Patients were stratified based on their anti-centromere (ACA) and anti-SCL70 (SCL70) antibody statuses. Comprehensive omics analyses including transcriptomic, flow cytometric, cytokine and metabolomic data were analysed to characterise the differences between these patient groups. RESULTS: Patients with SCL70 antibodies showed severe clinical features such as diffuse cutaneous sclerosis and pulmonary fibrosis and were biologically distinguished by unique transcriptomic profiles. They exhibit a pro-inflammatory and fibrotic signature associated with impaired tissue remodelling and increased carnitine metabolism. Conversely, ACA-positive patients exhibited an immunomodulation and tissue homeostasis signature and increased phospholipid metabolism. CONCLUSIONS: Patients with SSc display varying biological profiles based on their serological status. The findings highlight the potential utility of serological status as a discriminating factor in disease severity and suggest its relevance in tailoring treatment strategies and future research directions.
RESUMO
OBJECTIVES: Hypocomplementaemia is common in patients with IgG4-related disease (IgG4-RD). We aimed to determine the IgG4-RD features associated with hypocomplementaemia and investigate mechanisms of complement activation in this disease. METHODS: We performed a single-centre cross-sectional study of 279 patients who fulfilled the IgG4-RD classification criteria, using unadjusted and multivariable-adjusted logistic regression to identify factors associated with hypocomplementaemia. RESULTS: Hypocomplementaemia was observed in 90 (32%) patients. In the unadjusted model, the number of organs involved (OR 1.42, 95% CI 1.23 to 1.63) and involvement of the lymph nodes (OR 3.87, 95% CI 2.19 to 6.86), lungs (OR 3.81, 95% CI 2.10 to 6.89), pancreas (OR 1.66, 95% CI 1.001 to 2.76), liver (OR 2.73, 95% CI 1.17 to 6.36) and kidneys (OR 2.48, 95% CI 1.47 to 4.18) were each associated with hypocomplementaemia. After adjusting for age, sex and number of organs involved, only lymph node (OR 2.59, 95% CI 1.36 to 4.91) and lung (OR 2.56, 95% CI 1.35 to 4.89) involvement remained associated with hypocomplementaemia while the association with renal involvement was attenuated (OR 1.6, 95% CI 0.92 to 2.98). Fibrotic disease manifestations (OR 0.43, 95% CI 0.21 to 0.87) and lacrimal gland involvement (OR 0.53, 95% CI 0.28 to 0.999) were inversely associated with hypocomplementaemia in the adjusted analysis. Hypocomplementaemia was associated with higher concentrations of all IgG subclasses and IgE (all p<0.05). After adjusting for serum IgG1 and IgG3, only IgG1 but not IgG4 remained strongly associated with hypocomplementaemia. CONCLUSIONS: Hypocomplementaemia in IgG4-RD is not unique to patients with renal involvement and may reflect the extent of disease. IgG1 independently correlates with hypocomplementaemia in IgG4-RD, but IgG4 does not. Complement activation is likely involved in IgG4-RD pathophysiology.
RESUMO
A hallmark of rheumatoid arthritis (RA) is the increased levels of autoantibodies preceding the onset and contributing to the classification of the disease. These autoantibodies, mainly anti-citrullinated protein antibody (ACPA) and rheumatoid factor, have been assumed to be pathogenic and many attempts have been made to link them to the development of bone erosion, pain and arthritis. We and others have recently discovered that most cloned ACPA protect against experimental arthritis in the mouse. In addition, we have identified suppressor B cells in healthy individuals, selected in response to collagen type II, and these cells decrease in numbers in RA. These findings provide a new angle on how to explain the development of RA and maybe also other complex autoimmune diseases preceded by an increased autoimmune response.
Assuntos
Artrite Reumatoide , Doenças Autoimunes , Animais , Camundongos , Autoimunidade , Autoanticorpos , Anticorpos Antiproteína CitrulinadaRESUMO
BACKGROUND: Patient heterogeneity poses significant challenges for managing individuals and designing clinical trials, especially in complex diseases. Existing classifications rely on outcome-predicting scores, potentially overlooking crucial elements contributing to heterogeneity without necessarily impacting prognosis. METHODS: To address patient heterogeneity, we developed ClustALL, a computational pipeline that simultaneously faces diverse clinical data challenges like mixed types, missing values, and collinearity. ClustALL enables the unsupervised identification of patient stratifications while filtering for stratifications that are robust against minor variations in the population (population-based) and against limited adjustments in the algorithm's parameters (parameter-based). RESULTS: Applied to a European cohort of patients with acutely decompensated cirrhosis (n = 766), ClustALL identified five robust stratifications, using only data at hospital admission. All stratifications included markers of impaired liver function and number of organ dysfunction or failure, and most included precipitating events. When focusing on one of these stratifications, patients were categorized into three clusters characterized by typical clinical features; notably, the 3-cluster stratification showed a prognostic value. Re-assessment of patient stratification during follow-up delineated patients' outcomes, with further improvement of the prognostic value of the stratification. We validated these findings in an independent prospective multicentre cohort of patients from Latin America (n = 580). CONCLUSIONS: By applying ClustALL to patients with acutely decompensated cirrhosis, we identified three patient clusters. Following these clusters over time offers insights that could guide future clinical trial design. ClustALL is a novel and robust stratification method capable of addressing the multiple challenges of patient stratification in most complex diseases.
Assuntos
Cirrose Hepática , Humanos , Masculino , Feminino , Análise por Conglomerados , Pessoa de Meia-Idade , Prognóstico , Doença Aguda , Algoritmos , Idoso , Estudos de CoortesRESUMO
Single-cell multimodal omics (scMulti-omics) technologies have made it possible to trace cellular lineages during differentiation and to identify new cell types in heterogeneous cell populations. The derived information is especially promising for computing cell-type-specific biological networks encoded in complex diseases and improving our understanding of the underlying gene regulatory mechanisms. The integration of these networks could, therefore, give rise to a heterogeneous regulatory landscape (HRL) in support of disease diagnosis and drug therapeutics. In this review, we provide an overview of this field and pay particular attention to how diverse biological networks can be inferred in a specific cell type based on integrative methods. Then, we discuss how HRL can advance our understanding of regulatory mechanisms underlying complex diseases and aid in the prediction of prognosis and therapeutic responses. Finally, we outline challenges and future trends that will be central to bringing the field of HRL in complex diseases forward.
Assuntos
Biologia Computacional/métodos , Doença/genética , Redes Reguladoras de Genes , Análise de Célula Única/métodos , Animais , HumanosRESUMO
Complex diseases are caused by a combination of genetic, lifestyle, and environmental factors and comprise common noncommunicable diseases, including allergies, cardiovascular disease, and psychiatric and metabolic disorders. More than 25% of Europeans suffer from a complex disease, and together these diseases account for 70% of all deaths. The use of genomic, molecular, or imaging data to develop accurate diagnostic tools for treatment recommendations and preventive strategies, and for disease prognosis and prediction, is an important step toward precision medicine. However, for complex diseases, precision medicine is associated with several challenges. There is a significant heterogeneity between patients of a specific disease-both with regards to symptoms and underlying causal mechanisms-and the number of underlying genetic and nongenetic risk factors is often high. Here, we summarize precision medicine approaches for complex diseases and highlight the current breakthroughs as well as the challenges. We conclude that genomic-based precision medicine has been used mainly for patients with highly penetrant monogenic disease forms, such as cardiomyopathies. However, for most complex diseases-including psychiatric disorders and allergies-available polygenic risk scores are more probabilistic than deterministic and have not yet been validated for clinical utility. However, subclassifying patients of a specific disease into discrete homogenous subtypes based on molecular or phenotypic data is a promising strategy for improving diagnosis, prediction, treatment, prevention, and prognosis. The availability of high-throughput molecular technologies, together with large collections of health data and novel data-driven approaches, offers promise toward improved individual health through precision medicine.
Assuntos
Transtornos Mentais , Medicina de Precisão , Humanos , Medicina de Precisão/métodos , Genômica/métodos , Fatores de RiscoRESUMO
Immune deposits/complexes are detected in a multitude of tissues in autoimmune disorders, but no organ has attracted as much attention as the kidney. Several kidney diseases are characterised by the presence of specific configurations of such deposits, and many of them are under a 'shared care' between rheumatologists and nephrologists. This review focuses on five different diseases commonly encountered in rheumatological and nephrological practice, namely IgA vasculitis, lupus nephritis, cryoglobulinaemia, anti-glomerular basement membrane disease and anti-neutrophil cytoplasm-antibody glomerulonephritis. They differ in disease aetiopathogenesis, but also the potential speed of kidney function decline, the responsiveness to immunosuppression/immunomodulation and the deposition of immune deposits/complexes. To date, it remains unclear if deposits are causing a specific disease or aim to abrogate inflammatory cascades responsible for tissue damage, such as neutrophil extracellular traps or the complement system. In principle, immunosuppressive therapies have not been developed to tackle immune deposits/complexes, and repeated kidney biopsy studies found persistence of deposits despite reduction of active inflammation, again highlighting the uncertainty about their involvement in tissue damage. In these studies, a progression of active lesions to chronic changes such as glomerulosclerosis was frequently reported. Novel therapeutic approaches aim to mitigate these changes more efficiently and rapidly. Several new agents, such as avacopan, an oral C5aR1 inhibitor, or imlifidase, that dissolves IgG within minutes, are more specifically reducing inflammatory cascades in the kidney and repeat tissue sampling might help to understand their impact on immune cell deposition and finally kidney function recovery and potential impact of immune complexes/deposits.
Assuntos
Glomerulonefrite , Nefropatias , Nefrite Lúpica , Humanos , Rim/patologia , Nefropatias/diagnóstico , Nefropatias/etiologia , Nefrite Lúpica/patologia , Glomerulonefrite/patologia , Complexo Antígeno-AnticorpoRESUMO
OBJECTIVE: Calcium pyrophosphate deposition (CPPD) disease is prevalent and has diverse presentations, but there are no validated classification criteria for this symptomatic arthritis. The American College of Rheumatology (ACR) and EULAR have developed the first-ever validated classification criteria for symptomatic CPPD disease. METHODS: Supported by the ACR and EULAR, a multinational group of investigators followed established methodology to develop these disease classification criteria. The group generated lists of candidate items and refined their definitions, collected de-identified patient profiles, evaluated strengths of associations between candidate items and CPPD disease, developed a classification criteria framework, and used multi-criterion decision analysis to define criteria weights and a classification threshold score. The criteria were validated in an independent cohort. RESULTS: Among patients with joint pain, swelling, or tenderness (entry criterion) whose symptoms are not fully explained by an alternative disease (exclusion criterion), the presence of crowned dens syndrome or calcium pyrophosphate crystals in synovial fluid are sufficient to classify a patient as having CPPD disease. In the absence of these findings, a score>56 points using weighted criteria, comprising clinical features, associated metabolic disorders, and results of laboratory and imaging investigations, can be used to classify as CPPD disease. These criteria had a sensitivity of 92.2% and specificity of 87.9% in the derivation cohort (190 CPPD cases, 148 mimickers), whereas sensitivity was 99.2% and specificity was 92.5% in the validation cohort (251 CPPD cases, 162 mimickers). CONCLUSION: The 2023 ACR/EULAR CPPD disease classification criteria have excellent performance characteristics and will facilitate research in this field.
Assuntos
Calcinose , Condrocalcinose , Reumatologia , Humanos , Estados Unidos , Condrocalcinose/diagnóstico por imagem , Pirofosfato de Cálcio , SíndromeRESUMO
OBJECTIVE: Whereas genetic susceptibility for systemic lupus erythematosus (SLE) has been well explored, the triggers for clinical disease flares remain elusive. To investigate relationships between microbiota community resilience and disease activity, we performed the first longitudinal analyses of lupus gut-microbiota communities. METHODS: In an observational study, taxononomic analyses, including multivariate analysis of ß-diversity, assessed time-dependent alterations in faecal communities from patients and healthy controls. From gut blooms, strains were isolated, with genomes and associated glycans analysed. RESULTS: Multivariate analyses documented that, unlike healthy controls, significant temporal community-wide ecological microbiota instability was common in SLE patients, and transient intestinal growth spikes of several pathogenic species were documented. Expansions of only the anaerobic commensal, Ruminococcus (blautia) gnavus (RG) occurred at times of high-disease activity, and were detected in almost half of patients during lupus nephritis (LN) disease flares. Whole genome sequence analysis of RG strains isolated during these flares documented 34 genes postulated to aid adaptation and expansion within a host with an inflammatory condition. Yet, the most specific feature of strains found during lupus flares was the common expression of a novel type of cell membrane-associated lipoglycan. These lipoglycans share conserved structural features documented by mass spectroscopy, and highly immunogenic repetitive antigenic-determinants, recognised by high-level serum IgG2 antibodies, that spontaneously arose, concurrent with RG blooms and lupus flares. CONCLUSIONS: Our findings rationalise how blooms of the RG pathobiont may be common drivers of clinical flares of often remitting-relapsing lupus disease, and highlight the potential pathogenic properties of specific strains isolated from active LN patients.
Assuntos
Microbioma Gastrointestinal , Lúpus Eritematoso Sistêmico , Nefrite Lúpica , Microbiota , Humanos , Microbioma Gastrointestinal/genética , Exacerbação dos Sintomas , Fezes , Nefrite Lúpica/genéticaRESUMO
Complex diseases are caused by a variety of factors, and their diagnosis, treatment and prognosis are usually difficult. Proteins play an indispensable role in living organisms and perform specific biological functions by interacting with other proteins or biomolecules, their dysfunction may lead to diseases, it is a natural way to mine disease-related biomarkers from protein-protein interaction network. AUC, the area under the receiver operating characteristics (ROC) curve, is regarded as a gold standard to evaluate the effectiveness of a binary classifier, which measures the classification ability of an algorithm under arbitrary distribution or any misclassification cost. In this study, we have proposed a network-based multi-biomarker identification method by AUC optimization (NetAUC), which integrates gene expression and the network information to identify biomarkers for the complex disease analysis. The main purpose is to optimize two objectives simultaneously: maximizing AUC and minimizing the number of selected features. We have applied NetAUC to two types of disease analysis: 1) prognosis of breast cancer, 2) classification of similar diseases. The results show that NetAUC can identify a small panel of disease-related biomarkers which have the powerful classification ability and the functional interpretability.
Assuntos
Algoritmos , Neoplasias da Mama , Área Sob a Curva , Biomarcadores , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética , Feminino , Humanos , Curva ROCRESUMO
Genome-wide association studies have proven their ability to improve human health outcomes by identifying genotypes associated with phenotypes. Various works have attempted to predict the risk of diseases for individuals based on genotype data. This prediction can either be considered as an analysis model that can lead to a better understanding of gene functions that underlie human disease or as a black box in order to be used in decision support systems and in early disease detection. Deep learning techniques have gained more popularity recently. In this work, we propose a deep-learning framework for disease risk prediction. The proposed framework employs a multilayer perceptron (MLP) in order to predict individuals' disease status. The proposed framework was applied to the Wellcome Trust Case-Control Consortium (WTCCC), the UK National Blood Service (NBS) Control Group, and the 1958 British Birth Cohort (58C) datasets. The performance comparison of the proposed framework showed that the proposed approach outperformed the other methods in predicting disease risk, achieving an area under the curve (AUC) up to 0.94.
Assuntos
Aprendizado Profundo , Humanos , Estudo de Associação Genômica Ampla , Redes Neurais de Computação , Genótipo , GenômicaRESUMO
Though additive forms of heritability are primarily studied in genetics, nonlinear, non-additive gene-gene interactions, that is, epistasis, could explain a portion of the missing heritability in complex human diseases including cancer. In recent years, powerful computational methods have been introduced to understand multivariable genetic factors of these complex human diseases in extremely high-dimensional genome-wide data. In this study, we investigated the performance of three powerful methods, BOolean Operation-based Screening and Testing (BOOST), FastEpistasis, and Tree-based Epistasis Association Mapping (TEAM) to identify interacting genetic risk factors of colorectal cancer (CRC) for genome-wide association studies (GWAS). After quality-control based data preprocessing, we applied these three algorithms to a CRC GWAS data set, and selected the top-ranked 100 single-nucleotide polymorphism (SNP) pairs identified by each method (251 SNPs in total), among which 74 pairs were common between FastEpistasis and BOOST. The identified SNPs by BOOST, FastEpistasis, and TEAM mapped to 58, 57, and 62 genes, respectively. Some genes highlighted by our study, including MACF1, USP49, SMAD2, SMAD3, TGFBR1, and RHOA, have been detected in previous CRC-related research. We also identified some new genes with potential biological relevance to CRC such as CCDC32. Furthermore, we constructed the network of these top SNP pairs for three methods, and the patterns identified in the networks show that some SNPs including rs2412531, rs349699, and rs17142011 play a crucial role in the classification of disease status in our study.