RESUMEN
Understanding the genetic background of complex diseases and disorders plays an essential role in the promising precision medicine. The evaluation of candidate genes, however, requires time-consuming and expensive experiments given a large number of possibilities. Thus, computational methods have seen increasing applications in predicting gene-disease associations. We proposed a bioinformatics framework, Prioritization of Autism-genes using Network-based Deep-learning Approach (PANDA). Our approach aims to identify autism-genes across the human genome based on patterns of gene-gene interactions and topological similarity of genes in the interaction network. PANDA trains a graph deep learning classifier using the input of the human molecular interaction network and predicts and ranks the probability of autism association of every node (gene) in the network. PANDA was able to achieve a high classification accuracy of 89%, outperforming three other commonly used machine learning algorithms. Moreover, the gene prioritization ranking list produced by PANDA was evaluated and validated using an independent large-scale exome-sequencing study. The top 10% of PANDA-ranked genes were found significantly enriched for autism association.
Asunto(s)
Trastorno Autístico/genética , Aprendizaje Profundo , Redes Reguladoras de Genes , Trastorno Autístico/patología , Estudios de Asociación Genética , Genoma Humano , HumanosRESUMEN
Links between environmental conditions (e.g., meteorological factors and air quality) and COVID-19 severity have been reported worldwide. However, the existing frameworks of data analysis are insufficient or inefficient to investigate the potential causality behind the associations involving multidimensional factors and complicated interrelationships. Thus, a causal inference framework equipped with the structural causal model aided by machine learning methods was proposed and applied to examine the potential causal relationships between COVID-19 severity and 10 environmental factors (NO2, O3, PM2.5, PM10, SO2, CO, average air temperature, atmospheric pressure, relative humidity, and wind speed) in 166 Chinese cities. The cities were grouped into three clusters based on the socio-economic features. Time-series data from these cities in each cluster were analyzed in different pandemic phases. The robustness check refuted most potential causal relationships' estimations (89 out of 90). Only one potential relationship about air temperature passed the final test with a causal effect of 0.041 under a specific cluster-phase condition. The results indicate that the environmental factors are unlikely to cause noticeable aggravation of the COVID-19 pandemic. This study also demonstrated the high value and potential of the proposed method in investigating causal problems with observational data in environmental or other fields.
Asunto(s)
Contaminación del Aire , COVID-19 , Humanos , Aprendizaje Automático , Pandemias , SARS-CoV-2RESUMEN
The heritability of complex diseases including cancer is often attributed to multiple interacting genetic alterations. Such a non-linear, non-additive gene-gene interaction effect, that is, epistasis, renders univariable analysis methods ineffective for genome-wide association studies. In recent years, network science has seen increasing applications in modeling epistasis to characterize the complex relationships between a large number of genetic variations and the phenotypic outcome. In this study, by constructing a statistical epistasis network of colorectal cancer (CRC), we proposed to use multiple network measures to prioritize genes that influence the disease risk of CRC through synergistic interaction effects. We computed and analyzed several global and local properties of the large CRC epistasis network. We utilized topological properties of network vertices such as the edge strength, vertex centrality, and occurrence at different graphlets to identify genes that may be of potential biological relevance to CRC. We found 512 top-ranked single-nucleotide polymorphisms, among which COL22A1, RGS7, WWOX, and CELF2 were the four susceptibility genes prioritized by all described metrics as the most influential on CRC.
Asunto(s)
Redes Reguladoras de Genes , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Epistasis Genética , Humanos , Aprendizaje Automático , Polimorfismo de Nucleótido Simple/genética , Reproducibilidad de los Resultados , Estadística como AsuntoRESUMEN
Covert timing channels are an important alternative for transmitting information in the world of the Internet of Things (IoT). In covert timing channels data are encoded in inter-arrival times between consecutive packets based on modifying the transmission time of legitimate traffic. Typically, the modification of time takes place by delaying the transmitted packets on the sender side. A key aspect in covert timing channels is to find the threshold of packet delay that can accurately distinguish covert traffic from legitimate traffic. Based on that we can assess the level of dangerous of security threats or the quality of transferred sensitive information secretly. In this paper, we study the inter-arrival time behavior of covert timing channels in two different network configurations based on statistical metrics, in addition we investigate the packet delaying threshold value. Our experiments show that the threshold is approximately equal to or greater than double the mean of legitimate inter-arrival times. In this case covert timing channels become detectable as strong anomalies.
RESUMEN
The nonlinear interaction effect among multiple genetic factors, i.e. epistasis, has been recognized as a key component in understanding the underlying genetic basis of complex human diseases and phenotypic traits. Due to the statistical and computational complexity, most epistasis studies are limited to interactions with an order of two. We developed ViSEN to analyze and visualize epistatic interactions of both two-way and three-way. ViSEN not only identifies strong interactions among pairs or trios of genetic attributes, but also provides a global interaction map that shows neighborhood and clustering structures. This visualized information could be very helpful to infer the underlying genetic architecture of complex diseases and to generate plausible hypotheses for further biological validations. ViSEN is implemented in Java and freely available at https://sourceforge.net/projects/visen/.
Asunto(s)
Gráficos por Computador , Epistasis Genética , Modelos Estadísticos , Programas Informáticos , Humanos , Fenotipo , Lenguajes de ProgramaciónRESUMEN
This study introduces a novel approach to transport modelling by integrating experimentally derived causal priors into neural networks. We illustrate this paradigm using a case study of metformin, a ubiquitous pharmaceutical emerging pollutant, and its transport behaviour in sandy media. Specifically, data from metformin's sandy column transport experiment was used to estimate unobservable parameters through a physics-based model Hydrus-1D, followed by a data augmentation to produce a more comprehensive dataset. A causal graph incorporating key variables was constructed, aiding in identifying impactful variables and estimating their causal dynamics or "causal prior." The causal priors extracted from the augmented dataset included underexplored system parameters such as the type-1 sorption fraction F, first-order reaction rate coefficient α, and transport system scale. Their moderate impact on the transport process has been quantitatively evaluated (normalized causal effect 0.0423, -0.1447 and -0.0351, respectively) with adequate confounders considered for the first time. The prior was later embedded into multilayer neural networks via two methods: causal weight initialization and causal prior regularization. Based on the results from AutoML hyperparameter tuning experiments, using two embedding methods simultaneously emerged as a more advantageous practice since our proposed causal weight initialization technique can enhance model stability, particularly when used in conjunction with causal prior regularization. amongst those experiments utilizing both techniques, the R-squared values peaked at 0.881. This study demonstrates a balanced approach between expert knowledge and data-driven methods, providing enhanced interpretability in black-box models such as neural networks for environmental modelling.
Asunto(s)
Metformina , Redes Neurales de la Computación , Porosidad , Contaminantes Químicos del Agua/químicaRESUMEN
Motivation: The interaction between genetic variables is one of the major barriers to characterizing the genetic architecture of complex traits. To consider epistasis, network science approaches are increasingly being used in research to elucidate the genetic architecture of complex diseases. Network science approaches associate genetic variables' disease susceptibility to their topological importance in the network. However, this network only represents genetic interactions and does not describe how these interactions attribute to disease association at the subject-scale. We propose the Network-based Subject Portrait Approach (NSPA) and an accompanying feature transformation method to determine the collective risk impact of multiple genetic interactions for each subject. Results: The feature transformation method converts genetic variants of subjects into new values that capture how genetic variables interact with others to attribute to a subject's disease association. We apply this approach to synthetic and genetic datasets and learn that (1) the disease association can be captured using multiple disjoint sets of genetic interactions and (2) the feature transformation method based on NSPA improves predictive performance comparing with using the original genetic variables. Our findings confirm the role of genetic interaction in complex disease and provide a novel approach for gene-disease association studies to identify genetic architecture in the context of epistasis. Availability and implementation: The codes of NSPA are now available in: https://github.com/MIB-Lab/Network-based-Subject-Portrait-Approach. Contact: ting.hu@queensu.ca. Supplementary information: Supplementary data are available at Bioinformatics Advances online.
RESUMEN
Walkability is an important measure with strong ties to our health. However, there are existing gaps in the literature. Our previous work proposed new approaches to address existing limitations. This paper explores new ways of applying transferability using transfer-learning. Road networks, POIs, and road-related characteristics grow/change over time. Moreover, calculating walkability for all locations in all cities is very time-consuming. Transferability enables reuse of already-learned knowledge for continued learning, reduce training time, resource consumption, training labels and improve prediction accuracy. We propose ALF-Score++, that reuses trained models to generate transferable models capable of predicting walkability score for cities not seen in the process. We trained transfer-learned models for St. John's NL and Montréal QC and used them to predict walkability scores for Kingston ON and Vancouver BC. MAE error of 13.87 units (ranging 0-100) was achieved for transfer-learning using MLP and 4.56 units for direct-training (random forest) on personalized clusters.
Asunto(s)
Características de la Residencia , CiudadesRESUMEN
Walkability is a term that describes various aspects of the built and social environment and has been associated with physical activity and public health. Walkability is subjective and although multiple definitions of walkability exist, there is no single agreed upon definition. Road networks are integral parts of mobility and should be an important part of walkability. However, using the road structure as nodes is not widely discussed in existing methods. Most walkability measures only provide area-based scores with low spatial resolution, have a one-size-fits-all approach, and do not consider individuals opinion. Active Living Feature Score (ALF-Score) is a network-based walkability measure that incorporates road network structures as a core component. It also utilizes user opinion to build a high-confidence ground-truth that is used in our machine learning pipeline to generate models capable of estimating walkability. We found combination of network features with road embedding and points of interest features creates a complimentary feature set enabling us to train our models with an accuracy of over 87% while maintaining a conversion consistency of over 98%. Our proposed approach outperforms existing measures by introducing a novel method to estimate walkability scores that are representative of users opinion with a high spatial resolution, for any point on the road.
Asunto(s)
Características de la Residencia , Caminata , Planificación Ambiental , Ejercicio Físico , HumanosRESUMEN
Complex networks in the real world are often with heterogeneous degree distributions. The structure and function of nodes can vary significantly, with vital nodes playing a crucial role in information spread and other spreading phenomena. Identifying and taking action on vital nodes enables change to the network's structure and function more efficiently. Previous work either redefines metrics used to measure the nodes' importance or focuses on developing algorithms to efficiently find vital nodes. These approaches typically rely on global knowledge of the network and assume that the structure of the network does not change over time, both of which are difficult to achieve in the real world. In this paper, we propose a localized strategy that can find vital nodes without global knowledge of the network. Our joint nomination (JN) strategy selects a random set of nodes along with a set of nodes connected to those nodes, and together they nominate the vital node set. Experiments are conducted on 12 network datasets that include synthetic and real-world networks, and undirected and directed networks. Results show that average degree of the identified node set is about 3-8 times higher than that of the full node set, and higher-degree nodes take larger proportions in the degree distribution of the identified vital node set. Removal of vital nodes increases the average shortest path length by 20-70% over the original network, or about 8-15% longer than the other decentralized strategies. Immunization based on JN is more efficient than other strategies, consuming around 12-40% less immunization resources to raise the epidemic threshold to [Formula: see text]. Susceptible-infected-recovered simulations on networks with 30% vital nodes removed using JN delays the arrival time of infection peak significantly and reduce the total infection scale to 15%. The proposed strategy can effectively identify vital nodes using only local information and is feasible to implement in the real world to cope with time-critical scenarios such as the sudden outbreak of COVID-19.
RESUMEN
BACKGROUND: Epistasis has been historically used to describe the phenomenon that the effect of a given gene on a phenotype can be dependent on one or more other genes, and is an essential element for understanding the association between genetic and phenotypic variations. Quantifying epistasis of orders higher than two is very challenging due to both the computational complexity of enumerating all possible combinations in genome-wide data and the lack of efficient and effective methodologies. OBJECTIVES: In this study, we propose a fast, non-parametric, and model-free measure for three-way epistasis. METHODS: Such a measure is based on information gain, and is able to separate all lower order effects from pure three-way epistasis. RESULTS: Our method was verified on synthetic data and applied to real data from a candidate-gene study of tuberculosis in a West African population. In the tuberculosis data, we found a statistically significant pure three-way epistatic interaction effect that was stronger than any lower-order associations. CONCLUSION: Our study provides a methodological basis for detecting and characterizing high-order gene-gene interactions in genetic association studies.