RESUMEN
Interactions between protein kinases and their substrates are critical for the modulation of complex signaling pathways. Currently, there is a large amount of information available about kinases and their substrates in disparate public databases. However, these data are difficult to interpret in the context of cellular systems, which can be facilitated by examining interactions among multiple proteins at once, such as the network of interactions that constitute a signaling pathway. We present KiNet, a user-friendly web portal that integrates and shares information about kinase-substrate interactions from multiple databases of post-translational modifications. KiNet enables the visual exploration of these interactions in systems contexts, such as pathways, domain families, and custom protein set inputs, in an interactive fashion. We expect KiNet to be useful as a knowledge discovery tool for kinase-substrate interactions, and the aggregated KiNet dataset to be useful for protein kinase studies and systems-level analyses. The portal is available at https://kinet.kinametrix.com/ .
Asunto(s)
Bases de Datos de Proteínas , Internet , Proteínas Quinasas , Transducción de Señal , Proteínas Quinasas/metabolismo , Transducción de Señal/fisiología , Humanos , Procesamiento Proteico-Postraduccional , Especificidad por Sustrato , Programas Informáticos , Biología Computacional/métodos , Interfaz Usuario-Computador , FosforilaciónRESUMEN
Protein kinase function and interactions with drugs are controlled in part by the movement of the DFG and ÉC-Helix motifs that are related to the catalytic activity of the kinase. Small molecule ligands elicit therapeutic effects with distinct selectivity profiles and residence times that often depend on the active or inactive kinase conformation(s) they bind. Modern AI-based structural modeling methods have the potential to expand upon the limited availability of experimentally determined kinase structures in inactive states. Here, we first explored the conformational space of kinases in the PDB and models generated by AlphaFold2 (AF2) and ESMFold, two prominent AI-based protein structure prediction methods. Our investigation of AF2's ability to explore the conformational diversity of the kinome at various multiple sequence alignment (MSA) depths showed a bias within the predicted structures of kinases in DFG-in conformations, particularly those controlled by the DFG motif, based on their overabundance in the PDB. We demonstrate that predicting kinase structures using AF2 at lower MSA depths explored these alternative conformations more extensively, including identifying previously unobserved conformations for 398 kinases. Ligand enrichment analyses for 23 kinases showed that, on average, docked models distinguished between active molecules and decoys better than random (average AUC (avgAUC) of 64.58), but select models perform well (e.g., avgAUCs for PTK2 and JAK2 were 79.28 and 80.16, respectively). Further analysis explained the ligand enrichment discrepancy between low- and high-performing kinase models as binding site occlusions that would preclude docking. The overall results of our analyses suggested that, although AF2 explored previously uncharted regions of the kinase conformational space and select models exhibited enrichment scores suitable for rational drug discovery, rigorous refinement of AF2 models is likely still necessary for drug discovery campaigns.
Asunto(s)
Biología Computacional , Conformación Proteica , Proteínas Quinasas , Proteínas Quinasas/química , Proteínas Quinasas/metabolismo , Modelos Moleculares , Ligandos , Inhibidores de Proteínas Quinasas/química , Inhibidores de Proteínas Quinasas/farmacología , Bases de Datos de Proteínas , Humanos , Alineación de SecuenciaRESUMEN
BACKGROUND: Allergic rhinitis is a common inflammatory condition of the nasal mucosa that imposes a considerable health burden. Air pollution has been observed to increase the risk of developing allergic rhinitis. We addressed the hypotheses that early life exposure to air toxics is associated with developing allergic rhinitis, and that these effects are mediated by DNA methylation and gene expression in the nasal mucosa. METHODS: In a case-control cohort of 505 participants, we geocoded participants' early life exposure to air toxics using data from the US Environmental Protection Agency, assessed physician diagnosis of allergic rhinitis by questionnaire, and collected nasal brushings for whole-genome DNA methylation and transcriptome profiling. We then performed a series of analyses including differential expression, Mendelian randomization, and causal mediation analyses to characterize relationships between early life air toxics, nasal DNA methylation, nasal gene expression, and allergic rhinitis. RESULTS: Among the 505 participants, 275 had allergic rhinitis. The mean age of the participants was 16.4 years (standard deviation = 9.5 years). Early life exposure to air toxics such as acrylic acid, phosphine, antimony compounds, and benzyl chloride was associated with developing allergic rhinitis. These air toxics exerted their effects by altering the nasal DNA methylation and nasal gene expression levels of genes involved in respiratory ciliary function, mast cell activation, pro-inflammatory TGF-ß1 signaling, and the regulation of myeloid immune cell function. CONCLUSIONS: Our results expand the range of air pollutants implicated in allergic rhinitis and shed light on their underlying biological mechanisms in nasal mucosa.
Asunto(s)
Contaminantes Atmosféricos , Metilación de ADN , Mucosa Nasal , Rinitis Alérgica , Humanos , Mucosa Nasal/inmunología , Mucosa Nasal/metabolismo , Rinitis Alérgica/etiología , Rinitis Alérgica/inmunología , Masculino , Femenino , Contaminantes Atmosféricos/efectos adversos , Adolescente , Estudios de Casos y Controles , Adulto Joven , Exposición a Riesgos Ambientales/efectos adversos , Perfilación de la Expresión Génica , Adulto , Transcriptoma , Niño , Genómica/métodos , MultiómicaRESUMEN
BACKGROUND: The prevalence of type 2 diabetes mellitus (DM) and pre-diabetes mellitus (pre-DM) has been increasing among youth in recent decades in the United States, prompting an urgent need for understanding and identifying their associated risk factors. Such efforts, however, have been hindered by the lack of easily accessible youth pre-DM/DM data. OBJECTIVE: We aimed to first build a high-quality, comprehensive epidemiological data set focused on youth pre-DM/DM. Subsequently, we aimed to make these data accessible by creating a user-friendly web portal to share them and the corresponding codes. Through this, we hope to address this significant gap and facilitate youth pre-DM/DM research. METHODS: Building on data from the National Health and Nutrition Examination Survey (NHANES) from 1999 to 2018, we cleaned and harmonized hundreds of variables relevant to pre-DM/DM (fasting plasma glucose level ≥100 mg/dL or glycated hemoglobin ≥5.7%) for youth aged 12-19 years (N=15,149). We identified individual factors associated with pre-DM/DM risk using bivariate statistical analyses and predicted pre-DM/DM status using our Ensemble Integration (EI) framework for multidomain machine learning. We then developed a user-friendly web portal named Prediabetes/diabetes in youth Online Dashboard (POND) to share the data and codes. RESULTS: We extracted 95 variables potentially relevant to pre-DM/DM risk organized into 4 domains (sociodemographic, health status, diet, and other lifestyle behaviors). The bivariate analyses identified 27 significant correlates of pre-DM/DM (P<.001, Bonferroni adjusted), including race or ethnicity, health insurance, BMI, added sugar intake, and screen time. Among these factors, 16 factors were also identified based on the EI methodology (Fisher P of overlap=7.06×106). In addition to those, the EI approach identified 11 additional predictive variables, including some known (eg, meat and fruit intake and family income) and less recognized factors (eg, number of rooms in homes). The factors identified in both analyses spanned across all 4 of the domains mentioned. These data and results, as well as other exploratory tools, can be accessed on POND. CONCLUSIONS: Using NHANES data, we built one of the largest public epidemiological data sets for studying youth pre-DM/DM and identified potential risk factors using complementary analytical approaches. Our results align with the multifactorial nature of pre-DM/DM with correlates across several domains. Also, our data-sharing platform, POND, facilitates a wide range of applications to inform future youth pre-DM/DM studies.
Asunto(s)
Diabetes Mellitus Tipo 2 , Internet , Encuestas Nutricionales , Humanos , Adolescente , Niño , Femenino , Masculino , Diabetes Mellitus Tipo 2/epidemiología , Estados Unidos/epidemiología , Adulto Joven , Estado Prediabético/epidemiología , Factores de Riesgo , Conjuntos de Datos como Asunto , PrevalenciaRESUMEN
Air toxics are atmospheric pollutants with hazardous effects on health and the environment. Although methodological constraints have limited the number of air toxics assessed for associations with health and disease, advances in machine learning (ML) enable the assessment of a much larger set of environmental exposures. We used ML methods to conduct a retrospective study to identify combinations of 109 air toxics associated with asthma symptoms among 269 elementary school students in Spokane, Washington. Data on the frequency of asthma symptoms for these children were obtained from Spokane Public Schools. Their exposure to air toxics was estimated by using the Environmental Protection Agency's Air Toxics Screening Assessment and National Air Toxics Assessment. We defined three exposure periods: the most recent year (2019), the last three years (2017-2019), and the last five years (2014-2019). We analyzed the data using the ML-based Data-driven ExposurE Profile (DEEP) extraction method. DEEP identified 25 air toxic combinations associated with asthma symptoms in at least one exposure period. Three combinations (1,1,1-trichloroethane, 2-nitropropane, and 2,4,6-trichlorophenol) were significantly associated with asthma symptoms in all three exposure periods. Four air toxics (1,1,1-trichloroethane, 1,1,2,2-tetrachloroethane, BIS (2-ethylhexyl) phthalate (DEHP), and 2,4-dinitrophenol) were associated only in combination with other toxics, and would not have been identified by traditional statistical methods. The application of DEEP also identified a vulnerable subpopulation of children who were exposed to 13 of the 25 significant combinations in at least one exposure period. On average, these children experienced the largest number of asthma symptoms in our sample. By providing evidence on air toxic combinations associated with childhood asthma, our findings may contribute to the regulation of these toxics to improve children's respiratory health.
Asunto(s)
Contaminantes Atmosféricos , Contaminación del Aire , Asma , Tricloroetanos , Niño , Humanos , Contaminantes Atmosféricos/toxicidad , Contaminantes Atmosféricos/análisis , Washingtón/epidemiología , Estudios Retrospectivos , Asma/inducido químicamente , Asma/epidemiología , Exposición a Riesgos AmbientalesRESUMEN
Protein kinase function and interactions with drugs are controlled in part by the movement of the DFG and ÉC-Helix motifs, which enable kinases to adopt various conformational states. Small molecule ligands elicit therapeutic effects with distinct selectivity profiles and residence times that often depend on the kinase conformation(s) they bind. However, the limited availability of experimentally determined structural data for kinases in inactive states restricts drug discovery efforts for this major protein family. Modern AI-based structural modeling methods hold potential for exploring the previously experimentally uncharted druggable conformational space for kinases. Here, we first evaluated the currently explored conformational space of kinases in the PDB and models generated by AlphaFold2 (AF2) (1) and ESMFold (2), two prominent AI-based structure prediction methods. We then investigated AF2's ability to predict kinase structures in different conformations at various multiple sequence alignment (MSA) depths, based on this parameter's ability to explore conformational diversity. Our results showed a bias within the PDB and predicted structural models generated by AF2 and ESMFold toward structures of kinases in the active state over alternative conformations, particularly those conformations controlled by the DFG motif. Finally, we demonstrate that predicting kinase structures using AF2 at lower MSA depths allows the exploration of the space of these alternative conformations, including identifying previously unobserved conformations for 398 kinases. The results of our analysis of structural modeling by AF2 create a new avenue for the pursuit of new therapeutic agents against a notoriously difficult-to-target family of proteins. Significance Statement: Greater abundance of kinase structural data in inactive conformations, currently lacking in structural databases, would improve our understanding of how protein kinases function and expand drug discovery and development for this family of therapeutic targets. Modern approaches utilizing artificial intelligence and machine learning have potential for efficiently capturing novel protein conformations. We provide evidence for a bias within AlphaFold2 and ESMFold to predict structures of kinases in their active states, similar to their overrepresentation in the PDB. We show that lowering the AlphaFold2 algorithm's multiple sequence alignment depth can help explore kinase conformational space more broadly. It can also enable the prediction of hundreds of kinase structures in novel conformations, many of whose models are likely viable for drug discovery.
RESUMEN
The prevalence of type 2 diabetes mellitus (DM) and prediabetes (preDM) is rapidly increasing among youth, posing significant health and economic consequences. To address this growing concern, we created the most comprehensive youth-focused diabetes dataset to date derived from National Health and Nutrition Examination Survey (NHANES) data from 1999 to 2018. The dataset, consisting of 15,149 youth aged 12 to 19 years, encompasses preDM/DM relevant variables from sociodemographic, health status, diet, and other lifestyle behavior domains. An interactive web portal, POND (Prediabetes/diabetes in youth ONline Dashboard), was developed to provide public access to the dataset, allowing users to explore variables potentially associated with youth preDM/DM. Leveraging statistical and machine learning methods, we conducted two case studies, revealing established and lesser-known variables linked to youth preDM/DM. This dataset and portal can facilitate future studies to inform prevention and management strategies for youth prediabetes and diabetes.
RESUMEN
One of the promising opportunities of digital health is its potential to lead to more holistic understandings of diseases by interacting with the daily life of patients and through the collection of large amounts of real-world data. Validating and benchmarking indicators of disease severity in the home setting is difficult, however, given the large number of confounders present in the real world and the challenges in collecting ground truth data in the home. Here we leverage two datasets collected from patients with Parkinson's disease, which couples continuous wrist-worn accelerometer data with frequent symptom reports in the home setting, to develop digital biomarkers of symptom severity. Using these data, we performed a public benchmarking challenge in which participants were asked to build measures of severity across 3 symptoms (on/off medication, dyskinesia, and tremor). 42 teams participated and performance was improved over baseline models for each subchallenge. Additional ensemble modeling across submissions further improved performance, and the top models validated in a subset of patients whose symptoms were observed and rated by trained clinicians.
RESUMEN
Motivation: Integrating multimodal data represents an effective approach to predicting biomedical characteristics, such as protein functions and disease outcomes. However, existing data integration approaches do not sufficiently address the heterogeneous semantics of multimodal data. In particular, early and intermediate approaches that rely on a uniform integrated representation reinforce the consensus among the modalities but may lose exclusive local information. The alternative late integration approach that can address this challenge has not been systematically studied for biomedical problems. Results: We propose Ensemble Integration (EI) as a novel systematic implementation of the late integration approach. EI infers local predictive models from the individual data modalities using appropriate algorithms and uses heterogeneous ensemble algorithms to integrate these local models into a global predictive model. We also propose a novel interpretation method for EI models. We tested EI on the problems of predicting protein function from multimodal STRING data and mortality due to coronavirus disease 2019 (COVID-19) from multimodal data in electronic health records. We found that EI accomplished its goal of producing significantly more accurate predictions than each individual modality. It also performed better than several established early integration methods for each of these problems. The interpretation of a representative EI model for COVID-19 mortality prediction identified several disease-relevant features, such as laboratory test (blood urea nitrogen and calcium) and vital sign measurements (minimum oxygen saturation) and demographics (age). These results demonstrated the effectiveness of the EI framework for biomedical data integration and predictive modeling. Availability and implementation: Code and data are available at https://github.com/GauravPandeyLab/ensemble_integration. Supplementary information: Supplementary data are available at Bioinformatics Advances online.
RESUMEN
Motivation: Integrating multimodal data represents an effective approach to predicting biomedical characteristics, such as protein functions and disease outcomes. However, existing data integration approaches do not sufficiently address the heterogeneous semantics of multimodal data. In particular, early and intermediate approaches that rely on a uniform integrated representation reinforce the consensus among the modalities, but may lose exclusive local information. The alternative late integration approach that can address this challenge has not been systematically studied for biomedical problems. Results: We propose Ensemble Integration (EI) as a novel systematic implementation of the late integration approach. EI infers local predictive models from the individual data modalities using appropriate algorithms, and uses effective heterogeneous ensemble algorithms to integrate these local models into a global predictive model. We also propose a novel interpretation method for EI models. We tested EI on the problems of predicting protein function from multimodal STRING data, and mortality due to COVID-19 from multimodal data in electronic health records. We found that EI accomplished its goal of producing significantly more accurate predictions than each individual modality. It also performed better than several established early integration methods for each of these problems. The interpretation of a representative EI model for COVID-19 mortality prediction identified several disease-relevant features, such as laboratory test (blood urea nitrogen (BUN) and calcium) and vital sign measurements (minimum oxygen saturation) and demographics (age). These results demonstrated the effectiveness of the EI framework for biomedical data integration and predictive modeling. Availability: Code and data are available at https://github.com/GauravPandeyLab/ensemble_integration . Contact: gaurav.pandey@mssm.edu.
RESUMEN
Air pollution is a well-known contributor to asthma. Air toxics are hazardous air pollutants that cause or may cause serious health effects. Although individual air toxics have been associated with asthma, only a limited number of studies have specifically examined combinations of air toxics associated with the disease. We geocoded air toxic levels from the US National Air Toxics Assessment (NATA) to residential locations for participants of our AiRway in Asthma (ARIA) study. We then applied Data-driven ExposurE Profile extraction (DEEP), a machine learning-based method, to discover combinations of early-life air toxics associated with current use of daily asthma controller medication, lifetime emergency department visit for asthma, and lifetime overnight hospitalization for asthma. We discovered 20 multi-air toxic combinations and 18 single air toxics associated with at least 1 outcome. The multi-air toxic combinations included those containing acrylic acid, ethylidene dichloride, and hydroquinone, and they were significantly associated with asthma outcomes. Several air toxic members of the combinations would not have been identified by single air toxic analyses, supporting the use of machine learning-based methods designed to detect combinatorial effects. Our findings provide knowledge about air toxic combinations associated with childhood asthma.
Asunto(s)
Contaminantes Atmosféricos/efectos adversos , Asma/etiología , Aprendizaje Automático , Acrilatos/efectos adversos , Adolescente , Contaminantes Atmosféricos/análisis , Niño , Cloruro de Etilo/efectos adversos , Femenino , Humanos , Hidroquinonas/efectos adversos , Masculino , Factores de RiesgoRESUMEN
Background: The COVID-19 pandemic has affected millions of individuals and caused hundreds of thousands of deaths worldwide. Predicting mortality among patients with COVID-19 who present with a spectrum of complications is very difficult, hindering the prognostication and management of the disease. We aimed to develop an accurate prediction model of COVID-19 mortality using unbiased computational methods, and identify the clinical features most predictive of this outcome. Methods: In this prediction model development and validation study, we applied machine learning techniques to clinical data from a large cohort of patients with COVID-19 treated at the Mount Sinai Health System in New York City, NY, USA, to predict mortality. We analysed patient-level data captured in the Mount Sinai Data Warehouse database for individuals with a confirmed diagnosis of COVID-19 who had a health system encounter between March 9 and April 6, 2020. For initial analyses, we used patient data from March 9 to April 5, and randomly assigned (80:20) the patients to the development dataset or test dataset 1 (retrospective). Patient data for those with encounters on April 6, 2020, were used in test dataset 2 (prospective). We designed prediction models based on clinical features and patient characteristics during health system encounters to predict mortality using the development dataset. We assessed the resultant models in terms of the area under the receiver operating characteristic curve (AUC) score in the test datasets. Findings: Using the development dataset (n=3841) and a systematic machine learning framework, we developed a COVID-19 mortality prediction model that showed high accuracy (AUC=0·91) when applied to test datasets of retrospective (n=961) and prospective (n=249) patients. This model was based on three clinical features: patient's age, minimum oxygen saturation over the course of their medical encounter, and type of patient encounter (inpatient vs outpatient and telehealth visits). Interpretation: An accurate and parsimonious COVID-19 mortality prediction model based on three features might have utility in clinical settings to guide the management and prognostication of patients affected by this disease. External validation of this prediction model in other populations is needed. Funding: National Institutes of Health.
Asunto(s)
COVID-19/mortalidad , Reglas de Decisión Clínica , Factores de Edad , Anciano , COVID-19/patología , Conjuntos de Datos como Asunto , Femenino , Humanos , Modelos Logísticos , Masculino , Persona de Mediana Edad , Modelos Estadísticos , Ciudad de Nueva York/epidemiología , Curva ROC , Reproducibilidad de los Resultados , Factores de RiesgoRESUMEN
BACKGROUND: The coronavirus disease 2019 (COVID-19) pandemic has affected over millions of individuals and caused hundreds of thousands of deaths worldwide. It can be difficult to accurately predict mortality among COVID-19 patients presenting with a spectrum of complications, hindering the prognostication and management of the disease. METHODS: We applied machine learning techniques to clinical data from a large cohort of 5,051 COVID-19 patients treated at the Mount Sinai Health System in New York City, the global COVID-19 epicenter, to predict mortality. Predictors were designed to classify patients into Deceased or Alive mortality classes and were evaluated in terms of the area under the receiver operating characteristic (ROC) curve (AUC score). FINDINGS: Using a development cohort (n=3,841) and a systematic machine learning framework, we identified a COVID-19 mortality predictor that demonstrated high accuracy (AUC=0.91) when applied to test sets of retrospective (n= 961) and prospective (n=249) patients. This mortality predictor was based on five clinical features: age, minimum O2 saturation during encounter, type of patient encounter (inpatient vs. various types of outpatient and telehealth encounters), hydroxychloroquine use, and maximum body temperature. INTERPRETATION: An accurate and parsimonious COVID-19 mortality predictor based on five features may have utility in clinical settings to guide the management and prognostication of patients affected by this disease.