RESUMO
INTRODUCTION: Intracerebral hemorrhage represents 15 % of all strokes and it is associated with a high risk of post-stroke epilepsy. However, there are no reliable methods to accurately predict those at higher risk for developing seizures despite their importance in planning treatments, allocating resources, and advancing post-stroke seizure research. Existing risk models have limitations and have not taken advantage of readily available real-world data and artificial intelligence. This study aims to evaluate the performance of Machine-learning-based models to predict post-stroke seizures at 1 year and 5 years after an intracerebral hemorrhage in unselected patients across multiple healthcare organizations. DESIGN/METHODS: We identified patients with intracerebral hemorrhage (ICH) without a prior diagnosis of seizures from 2015 until inception (11/01/22) in the TriNetX Diamond Network, using the International Classification of Diseases, Tenth Revision (ICD-10) I61 (I61.0, I61.1, I61.2, I61.3, I61.4, I61.5, I61.6, I61.8, and I61.9). The outcome of interest was any ICD-10 diagnosis of seizures (G40/G41) at 1 year and 5 years following the first occurrence of the diagnosis of intracerebral hemorrhage. We applied a conventional logistic regression and a Light Gradient Boosted Machine (LGBM) algorithm, and the performance of the model was assessed using the area under the receiver operating characteristics (AUROC), the area under the precision-recall curve (AUPRC), the F1 statistic, model accuracy, balanced-accuracy, precision, and recall, with and without seizure medication use in the models. RESULTS: A total of 85,679 patients had an ICD-10 code of intracerebral hemorrhage and no prior diagnosis of seizures, constituting our study cohort. Seizures were present in 4.57 % and 6.27 % of patients within 1 and 5 years after ICH, respectively. At 1-year, the AUROC, AUPRC, F1 statistic, accuracy, balanced-accuracy, precision, and recall were respectively 0.7051 (standard error: 0.0132), 0.1143 (0.0068), 0.1479 (0.0055), 0.6708 (0.0076), 0.6491 (0.0114), 0.0839 (0.0032), and 0.6253 (0.0216). Corresponding metrics at 5 years were 0.694 (0.009), 0.1431 (0.0039), 0.1859 (0.0064), 0.6603 (0.0059), 0.6408 (0.0119), 0.1094 (0.0037) and 0.6186 (0.0264). These numerical values indicate that the statistical models fit the data very well. CONCLUSION: Machine learning models applied to electronic health records can improve the prediction of post-hemorrhagic stroke epilepsy, presenting a real opportunity to incorporate risk assessments into clinical decision-making in post-stroke care clinical care and improve patients' selection for post-stroke epilepsy research.
Assuntos
Hemorragia Cerebral , Aprendizado de Máquina , Convulsões , Humanos , Hemorragia Cerebral/complicações , Hemorragia Cerebral/diagnóstico , Convulsões/diagnóstico , Convulsões/etiologia , Masculino , Feminino , Idoso , Pessoa de Meia-Idade , Idoso de 80 Anos ou maisRESUMO
MOTIVATION: Protein complexes play critical roles in many aspects of biological functions. Three-dimensional (3D) structures of protein complexes are critical for gaining insights into structural bases of interactions and their roles in the biomolecular pathways that orchestrate key cellular processes. Because of the expense and effort associated with experimental determinations of 3D protein complex structures, computational docking has evolved as a valuable tool to predict 3D structures of biomolecular complexes. Despite recent progress, reliably distinguishing near-native docking conformations from a large number of candidate conformations, the so-called scoring problem, remains a major challenge. RESULTS: Here we present iScore, a novel approach to scoring docked conformations that combines HADDOCK energy terms with a score obtained using a graph representation of the protein-protein interfaces and a measure of evolutionary conservation. It achieves a scoring performance competitive with, or superior to, that of state-of-the-art scoring functions on two independent datasets: (i) Docking software-specific models and (ii) the CAPRI score set generated by a wide variety of docking approaches (i.e. docking software-non-specific). iScore ranks among the top scoring approaches on the CAPRI score set (13 targets) when compared with the 37 scoring groups in CAPRI. The results demonstrate the utility of combining evolutionary, topological and energetic information for scoring docked conformations. This work represents the first successful demonstration of graph kernels to protein interfaces for effective discrimination of near-native and non-native conformations of protein complexes. AVAILABILITY AND IMPLEMENTATION: The iScore code is freely available from Github: https://github.com/DeepRank/iScore (DOI: 10.5281/zenodo.2630567). And the docking models used are available from SBGrid: https://data.sbgrid.org/dataset/684). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Biologia Computacional , Simulação de Acoplamento Molecular , Proteínas , Biologia Computacional/métodos , Simulação de Acoplamento Molecular/métodos , Ligação Proteica , Conformação Proteica , Proteínas/química , Proteínas/metabolismo , SoftwareRESUMO
RNA-protein interactions play essential roles in regulating gene expression. While some RNA-protein interactions are "specific", that is, the RNA-binding proteins preferentially bind to particular RNA sequence or structural motifs, others are "non-RNA specific." Deciphering the protein-RNA recognition code is essential for comprehending the functional implications of these interactions and for developing new therapies for many diseases. Because of the high cost of experimental determination of protein-RNA interfaces, there is a need for computational methods to identify RNA-binding residues in proteins. While most of the existing computational methods for predicting RNA-binding residues in RNA-binding proteins are oblivious to the characteristics of the partner RNA, there is growing interest in methods for partner-specific prediction of RNA binding sites in proteins. In this work, we assess the performance of two recently published partner-specific protein-RNA interface prediction tools, PS-PRIP, and PRIdictor, along with our own new tools. Specifically, we introduce a novel metric, RNA-specificity metric (RSM), for quantifying the RNA-specificity of the RNA binding residues predicted by such tools. Our results show that the RNA-binding residues predicted by previously published methods are oblivious to the characteristics of the putative RNA binding partner. Moreover, when evaluated using partner-agnostic metrics, RNA partner-specific methods are outperformed by the state-of-the-art partner-agnostic methods. We conjecture that either (a) the protein-RNA complexes in PDB are not representative of the protein-RNA interactions in nature, or (b) the current methods for partner-specific prediction of RNA-binding residues in proteins fail to account for the differences in RNA partner-specific versus partner-agnostic protein-RNA interactions, or both.
Assuntos
Biologia Computacional , Proteínas/química , Proteínas de Ligação a RNA/genética , RNA/genética , Sequência de Aminoácidos/genética , Sequência de Bases/genética , Sítios de Ligação/genética , Modelos Moleculares , Ligação Proteica/genética , Conformação Proteica , Proteínas/genética , RNA/química , Motivos de Ligação ao RNA/genética , Proteínas de Ligação a RNA/química , Análise de Sequência de Proteína , SoftwareRESUMO
Although many advanced and sophisticated ab initio approaches for modeling protein-protein complexes have been proposed in past decades, template-based modeling (TBM) remains the most accurate and widely used approach, given a reliable template is available. However, there are many different ways to exploit template information in the modeling process. Here, we systematically evaluate and benchmark a TBM method that uses conserved interfacial residue pairs as docking distance restraints [referred to as alpha carbon-alpha carbon (CA-CA)-guided docking]. We compare it with two other template-based protein-protein modeling approaches, including a conserved non-pairwise interfacial residue restrained docking approach [referred to as the ambiguous interaction restraint (AIR)-guided docking] and a simple superposition-based modeling approach. Our results show that, for most cases, the CA-CA-guided docking method outperforms both superposition with refinement and the AIR-guided docking method. We emphasize the superiority of the CA-CA-guided docking on cases with medium to large conformational changes, and interactions mediated through loops, tails or disordered regions. Our results also underscore the importance of a proper refinement of superimposition models to reduce steric clashes. In summary, we provide a benchmarked TBM protocol that uses conserved pairwise interface distance as restraints in generating realistic 3D protein-protein interaction models, when reliable templates are available. The described CA-CA-guided docking protocol is based on the HADDOCK platform, which allows users to incorporate additional prior knowledge of the target system to further improve the quality of the resulting models.
Assuntos
Proteínas/metabolismo , Modelos Moleculares , Ligação ProteicaRESUMO
Accurate and comprehensive identification of surface-exposed proteins (SEPs) in parasites is a key step in developing novel subunit vaccines. However, the reliability of MS-based high-throughput methods for proteome-wide mapping of SEPs continues to be limited due to high rates of false positives (i.e., proteins mistakenly identified as surface exposed) as well as false negatives (i.e., SEPs not detected due to low expression or other technical limitations). We propose a framework called PlasmoSEP for the reliable identification of SEPs using a novel semisupervised learning algorithm that combines SEPs identified by high-throughput experiments and expert annotation of high-throughput data to augment labeled data for training a predictive model. Our experiments using high-throughput data from the Plasmodium falciparum surface-exposed proteome provide several novel high-confidence predictions of SEPs in P. falciparum and also confirm expert annotations for several others. Furthermore, PlasmoSEP predicts that 25 of 37 experimentally identified SEPs in Plasmodium yoelii salivary gland sporozoites are likely to be SEPs. Finally, PlasmoSEP predicts several novel SEPs in P. yoelii and Plasmodium vivax malaria parasites that can be validated for further vaccine studies. Our computational framework can be easily adapted to improve the interpretation of data from high-throughput studies.
Assuntos
Algoritmos , Proteínas de Membrana/análise , Plasmodium falciparum/química , Proteômica/métodos , Proteínas de Protozoários/análise , Ensaios de Triagem em Larga Escala/métodos , Humanos , Proteínas de Membrana/metabolismo , Modelos Teóricos , Plasmodium vivax/metabolismo , Plasmodium vivax/patogenicidade , Plasmodium yoelii/química , Proteínas de Protozoários/metabolismo , Glândulas Salivares/metabolismoRESUMO
Selecting near-native conformations from the immense number of conformations generated by docking programs remains a major challenge in molecular docking. We introduce DockRank, a novel approach to scoring docked conformations based on the degree to which the interface residues of the docked conformation match a set of predicted interface residues. DockRank uses interface residues predicted by partner-specific sequence homology-based protein-protein interface predictor (PS-HomPPI), which predicts the interface residues of a query protein with a specific interaction partner. We compared the performance of DockRank with several state-of-the-art docking scoring functions using Success Rate (the percentage of cases that have at least one near-native conformation among the top m conformations) and Hit Rate (the percentage of near-native conformations that are included among the top m conformations). In cases where it is possible to obtain partner-specific (PS) interface predictions from PS-HomPPI, DockRank consistently outperforms both (i) ZRank and IRAD, two state-of-the-art energy-based scoring functions (improving Success Rate by up to 4-fold); and (ii) Variants of DockRank that use predicted interface residues obtained from several protein interface predictors that do not take into account the binding partner in making interface predictions (improving success rate by up to 39-fold). The latter result underscores the importance of using partner-specific interface residues in scoring docked conformations. We show that DockRank, when used to re-rank the conformations returned by ClusPro, improves upon the original ClusPro rankings in terms of both Success Rate and Hit Rate. DockRank is available as a server at http://einstein.cs.iastate.edu/DockRank/.
Assuntos
Simulação de Acoplamento Molecular , Software , Ligantes , Domínios e Motivos de Interação entre Proteínas , Estrutura Quaternária de Proteína , Receptores de Superfície Celular/química , Homologia de Sequência de Aminoácidos , Homologia Estrutural de Proteína , TermodinâmicaRESUMO
Objective: To develop an artificial intelligence, machine learning prediction model for estimating the risk of seizures 1 year and 5 years after ischemic stroke (IS) using a large dataset from Electronic Health Records. Background: Seizures are frequent after ischemic strokes and are associated with increased mortality, poor functional outcomes, and lower quality of life. Separating patients at high risk of seizures from those at low risk of seizures is needed for treatment and clinical trial planning, but remains challenging. Machine learning (ML) is a potential approach to solve this paradigm. Design/Methods: We identified patients (aged ≥18 years) with IS without a prior diagnosis of seizures from 2015 until inception (08/09/22) in the TriNetX Research Network, using the International Classification of Diseases, Tenth Revision (ICD-10) I63, excluding I63.6 (venous infarction). The outcome of interest was any ICD-10 diagnosis of seizures (G40/G41) at 1 year and 5 years following the index IS. We applied a conventional logistic regression and a Light Gradient Boosted Machine algorithm to predict the risk of seizures at 1 year and 5 years. The performance of the model was assessed using the area under the receiver operating characteristics (AUROC), the area under the precision-recall curve (AUPRC), F1 statistic, model accuracy, balanced accuracy, precision, and recall, with and without anti-seizure medication use in the models. Results: Our study cohort included 430,254 IS patients. Seizures were present in 18,502 (4.3%) and (5.3%) patients within 1 and 5 years after IS, respectively. At 1-year, the AUROC, AUPRC, F1 statistic, accuracy, balanced-accuracy, precision, and recall were respectively 0.7854 (standard error: 0.0038), 0.2426 (0.0048), 0.2299 (0.0034), 0.8236 (0.001), 0.7226 (0.0049), 0.1415 (0.0021), and 0.6122, (0.0095). Corresponding metrics at 5 years were 0.7607 (0.0031), 0.247 (0.0064), 0.2441 (0.0032), 0.8125 (0.0013), 0.7001 (0.0045), 0.155 (0.002) and 0.5745 (0.0095). Conclusion: Our findings suggest that ML models show good model performance for predicting seizures after IS.
RESUMO
The Protein-RNA Interface Database (PRIDB) is a comprehensive database of protein-RNA interfaces extracted from complexes in the Protein Data Bank (PDB). It is designed to facilitate detailed analyses of individual protein-RNA complexes and their interfaces, in addition to automated generation of user-defined data sets of protein-RNA interfaces for statistical analyses and machine learning applications. For any chosen PDB complex or list of complexes, PRIDB rapidly displays interfacial amino acids and ribonucleotides within the primary sequences of the interacting protein and RNA chains. PRIDB also identifies ProSite motifs in protein chains and FR3D motifs in RNA chains and provides links to these external databases, as well as to structure files in the PDB. An integrated JMol applet is provided for visualization of interacting atoms and residues in the context of the 3D complex structures. The current version of PRIDB contains structural information regarding 926 protein-RNA complexes available in the PDB (as of 10 October 2010). Atomic- and residue-level contact information for the entire data set can be downloaded in a simple machine-readable format. Also, several non-redundant benchmark data sets of protein-RNA complexes are provided. The PRIDB database is freely available online at http://bindr.gdcb.iastate.edu/PRIDB.
Assuntos
Bases de Dados de Proteínas , Proteínas de Ligação a RNA/química , RNA/química , Aminoácidos/química , Sítios de Ligação , Conformação de Ácido Nucleico , Conformação Proteica , Ribonucleotídeos/química , Interface Usuário-ComputadorRESUMO
Protein-protein interactions play a ubiquitous role in biological function. Knowledge of the three-dimensional (3D) structures of the complexes they form is essential for understanding the structural basis of those interactions and how they orchestrate key cellular processes. Computational docking has become an indispensable alternative to the expensive and time-consuming experimental approaches for determining the 3D structures of protein complexes. Despite recent progress, identifying near-native models from a large set of conformations sampled by docking-the so-called scoring problem-still has considerable room for improvement. We present MetaScore, a new machine-learning-based approach to improve the scoring of docked conformations. MetaScore utilizes a random forest (RF) classifier trained to distinguish near-native from non-native conformations using their protein-protein interfacial features. The features include physicochemical properties, energy terms, interaction-propensity-based features, geometric properties, interface topology features, evolutionary conservation, and also scores produced by traditional scoring functions (SFs). MetaScore scores docked conformations by simply averaging the score produced by the RF classifier with that produced by any traditional SF. We demonstrate that (i) MetaScore consistently outperforms each of the nine traditional SFs included in this work in terms of success rate and hit rate evaluated over conformations ranked among the top 10; (ii) an ensemble method, MetaScore-Ensemble, that combines 10 variants of MetaScore obtained by combining the RF score with each of the traditional SFs outperforms each of the MetaScore variants. We conclude that the performance of traditional SFs can be improved upon by using machine learning to judiciously leverage protein-protein interfacial features and by using ensemble methods to combine multiple scoring functions.
Assuntos
Aprendizado de Máquina , Proteínas , Proteínas/química , Ligação Proteica , Ligantes , Conformação ProteicaRESUMO
GOAL AND AIMS: Commonly used actigraphy algorithms are designed to operate within a known in-bed interval. However, in free-living scenarios this interval is often unknown. We trained and evaluated a sleep/wake classifier that operates on actigraphy over â¼24-hour intervals, without knowledge of in-bed timing. FOCUS TECHNOLOGY: Actigraphy counts from ActiWatch Spectrum devices. REFERENCE TECHNOLOGY: Sleep staging derived from polysomnography, supplemented by observation of wakefulness outside of the staged interval. Classifications from the Oakley actigraphy algorithm were additionally used as performance reference. SAMPLE: Adults, sleeping in either a home or laboratory environment. DESIGN: Machine learning was used to train and evaluate a sleep/wake classifier in a supervised learning paradigm. The classifier is a temporal convolutional network, a form of deep neural network. CORE ANALYTICS: Performance was evaluated across â¼24 hours, and additionally restricted to only in-bed intervals, both in terms of epoch-by-epoch performance, and the discrepancy of summary statistics within the intervals. ADDITIONAL ANALYTICS AND EXPLORATORY ANALYSES: Performance of the trained model applied to the Multi-Ethnic Study of Atherosclerosis dataset. CORE OUTCOMES: Over â¼24 hours, the temporal convolutional network classifier produced the same or better performance as the Oakley classifier on all measures tested. When restricting analysis to the in-bed interval, the temporal convolutional network remained favorable on several metrics. IMPORTANT SUPPLEMENTAL OUTCOMES: Performance decreased on the Multi-Ethnic Study of Atherosclerosis dataset, especially when restricting analysis to the in-bed interval. CORE CONCLUSION: A classifier using data labeled over â¼24-hour intervals allows for the continuous classification of sleep/wake without knowledge of in-bed intervals. Further development should focus on improving generalization performance.
Assuntos
Actigrafia , Aterosclerose , Adulto , Humanos , Sono , Polissonografia , DescansoRESUMO
The increasing reliance on online communities for healthcare information by patients and caregivers has led to the increase in the spread of misinformation, or subjective, anecdotal and inaccurate or non-specific recommendations, which, if acted on, could cause serious harm to the patients. Hence, there is an urgent need to connect users with accurate and tailored health information in a timely manner to prevent such harm. This article proposes an innovative approach to suggesting reliable information to participants in online communities as they move through different stages in their disease or treatment. We hypothesize that patients with similar histories of disease progression or course of treatment would have similar information needs at comparable stages. Specifically, we pose the problem of predicting topic tags or keywords that describe the future information needs of users based on their profiles, traces of their online interactions within the community (past posts, replies) and the profiles and traces of online interactions of other users with similar profiles and similar traces of past interaction with the target users. The result is a variant of the collaborative information filtering or recommendation system tailored to the needs of users of online health communities. We report results of our experiments on two unique datasets from two different social media platforms which demonstrates the superiority of the proposed approach over the state of the art baselines with respect to accurate and timely prediction of topic tags (and hence information sources of interest).
Assuntos
Informação de Saúde ao Consumidor , Mídias Sociais , HumanosRESUMO
BACKGROUND: Identification of the residues in protein-protein interaction sites has a significant impact in problems such as drug discovery. Motivated by the observation that the set of interface residues of a protein tend to be conserved even among remote structural homologs, we introduce PrISE, a family of local structural similarity-based computational methods for predicting protein-protein interface residues. RESULTS: We present a novel representation of the surface residues of a protein in the form of structural elements. Each structural element consists of a central residue and its surface neighbors. The PrISE family of interface prediction methods uses a representation of structural elements that captures the atomic composition and accessible surface area of the residues that make up each structural element. Each of the members of the PrISE methods identifies for each structural element in the query protein, a collection of similar structural elements in its repository of structural elements and weights them according to their similarity with the structural element of the query protein. PrISEL relies on the similarity between structural elements (i.e. local structural similarity). PrISEG relies on the similarity between protein surfaces (i.e. general structural similarity). PrISEC, combines local structural similarity and general structural similarity to predict interface residues. These predictors label the central residue of a structural element in a query protein as an interface residue if a weighted majority of the structural elements that are similar to it are interface residues, and as a non-interface residue otherwise. The results of our experiments using three representative benchmark datasets show that the PrISEC outperforms PrISEL and PrISEG; and that PrISEC is highly competitive with state-of-the-art structure-based methods for predicting protein-protein interface residues. Our comparison of PrISEC with PredUs, a recently developed method for predicting interface residues of a query protein based on the known interface residues of its (global) structural homologs, shows that performance superior or comparable to that of PredUs can be obtained using only local surface structural similarity. PrISEC is available as a Web server at http://prise.cs.iastate.edu/ CONCLUSIONS: Local surface structural similarity based methods offer a simple, efficient, and effective approach to predict protein-protein interface residues.
Assuntos
Domínios e Motivos de Interação entre Proteínas , Proteínas/química , Software , Algoritmos , Modelos Moleculares , Conformação Proteica , Proteínas/metabolismoRESUMO
BACKGROUND: RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. RESULTS: We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naïve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. CONCLUSIONS: Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.
Assuntos
Inteligência Artificial , Proteínas de Ligação a RNA/química , RNA/química , Algoritmos , Aminoácidos/química , Teorema de Bayes , Humanos , Matrizes de Pontuação de Posição Específica , Conformação Proteica , RNA/metabolismo , Proteínas de Ligação a RNA/metabolismo , Análise de Sequência de Proteína , Máquina de Vetores de SuporteRESUMO
Biweekly county COVID-19 data were linked with Longitudinal Employer-Household Dynamics data to analyze population risk exposures enabled by pre-pandemic, country-wide commuter networks. Results from fixed-effects, spatial, and computational statistical approaches showed that commuting network exposure to COVID-19 predicted an area's COVID-19 cases and deaths, indicating spillovers. Commuting spillovers between counties were independent from geographic contiguity, pandemic-time mobility, or social media ties. Results suggest that commuting connections form enduring social linkages with effects on health that can withstand mobility disruptions. Findings contribute to a growing relational view of health and place, with implications for neighborhood effects research and place-based policies.
Assuntos
COVID-19 , Mídias Sociais , COVID-19/epidemiologia , Humanos , Pandemias , Características de Residência , Meios de TransporteRESUMO
In this critical review, we examine the application of predictive models, for example, classifiers, trained using machine learning (ML) to assist in interpretation of functional neuroimaging data. Our primary goal is to summarize how ML is being applied and critically assess common practices. Our review covers 250 studies published using ML and resting-state functional MRI (fMRI) to infer various dimensions of the human functional connectome. Results for holdout ("lockbox") performance was, on average, â¼13% less accurate than performance measured through cross-validation alone, highlighting the importance of lockbox data, which was included in only 16% of the studies. There was also a concerning lack of transparency across the key steps in training and evaluating predictive models. The summary of this literature underscores the importance of the use of a lockbox and highlights several methodological pitfalls that can be addressed by the imaging community. We argue that, ideally, studies are motivated both by the reproducibility and generalizability of findings as well as the potential clinical significance of the insights. We offer recommendations for principled integration of machine learning into the clinical neurosciences with the goal of advancing imaging biomarkers of brain disorders, understanding causative determinants for health risks, and parsing heterogeneous patient outcomes.
RESUMO
BACKGROUND: RNA-protein interactions (RPIs) play important roles in a wide variety of cellular processes, ranging from transcriptional and post-transcriptional regulation of gene expression to host defense against pathogens. High throughput experiments to identify RNA-protein interactions are beginning to provide valuable information about the complexity of RNA-protein interaction networks, but are expensive and time consuming. Hence, there is a need for reliable computational methods for predicting RNA-protein interactions. RESULTS: We propose RPISeq, a family of classifiers for predicting RNA-protein interactions using only sequence information. Given the sequences of an RNA and a protein as input, RPIseq predicts whether or not the RNA-protein pair interact. The RNA sequence is encoded as a normalized vector of its ribonucleotide 4-mer composition, and the protein sequence is encoded as a normalized vector of its 3-mer composition, based on a 7-letter reduced alphabet representation. Two variants of RPISeq are presented: RPISeq-SVM, which uses a Support Vector Machine (SVM) classifier and RPISeq-RF, which uses a Random Forest classifier. On two non-redundant benchmark datasets extracted from the Protein-RNA Interface Database (PRIDB), RPISeq achieved an AUC (Area Under the Receiver Operating Characteristic (ROC) curve) of 0.96 and 0.92. On a third dataset containing only mRNA-protein interactions, the performance of RPISeq was competitive with that of a published method that requires information regarding many different features (e.g., mRNA half-life, GO annotations) of the putative RNA and protein partners. In addition, RPISeq classifiers trained using the PRIDB data correctly predicted the majority (57-99%) of non-coding RNA-protein interactions in NPInter-derived networks from E. coli, S. cerevisiae, D. melanogaster, M. musculus, and H. sapiens. CONCLUSIONS: Our experiments with RPISeq demonstrate that RNA-protein interactions can be reliably predicted using only sequence-derived information. RPISeq offers an inexpensive method for computational construction of RNA-protein interaction networks, and should provide useful insights into the function of non-coding RNAs. RPISeq is freely available as a web-based server at http://pridb.gdcb.iastate.edu/RPISeq/.
Assuntos
Algoritmos , Mapas de Interação de Proteínas , Proteínas/metabolismo , Proteínas de Ligação a RNA/metabolismo , RNA/química , Análise de Sequência de RNA , Animais , Bases de Dados de Proteínas , Drosophila melanogaster/metabolismo , Escherichia coli/metabolismo , Humanos , Camundongos , RNA/metabolismo , Estabilidade de RNA , Saccharomyces cerevisiae/metabolismo , Software , Máquina de Vetores de SuporteRESUMO
BACKGROUND: Although homology-based methods are among the most widely used methods for predicting the structure and function of proteins, the question as to whether interface sequence conservation can be effectively exploited in predicting protein-protein interfaces has been a subject of debate. RESULTS: We studied more than 300,000 pair-wise alignments of protein sequences from structurally characterized protein complexes, including both obligate and transient complexes. We identified sequence similarity criteria required for accurate homology-based inference of interface residues in a query protein sequence.Based on these analyses, we developed HomPPI, a class of sequence homology-based methods for predicting protein-protein interface residues. We present two variants of HomPPI: (i) NPS-HomPPI (Non partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner; and (ii) PS-HomPPI (Partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein.Our experiments on a benchmark dataset of obligate homodimeric complexes show that NPS-HomPPI can reliably predict protein-protein interface residues in a given protein, with an average correlation coefficient (CC) of 0.76, sensitivity of 0.83, and specificity of 0.78, when sequence homologs of the query protein can be reliably identified. NPS-HomPPI also reliably predicts the interface residues of intrinsically disordered proteins. Our experiments suggest that NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. The partner-specific classifier, PS-HomPPI can, on a large dataset of transient complexes, predict the interface residues of a query protein with a specific target, with a CC of 0.65, sensitivity of 0.69, and specificity of 0.70, when homologs of both the query and the target can be reliably identified. The HomPPI web server is available at http://homppi.cs.iastate.edu/. CONCLUSIONS: Sequence homology-based methods offer a class of computationally efficient and reliable approaches for predicting the protein-protein interface residues that participate in either obligate or transient interactions. For query proteins involved in transient interactions, the reliability of interface residue prediction can be improved by exploiting knowledge of putative interaction partners.
Assuntos
Proteínas/metabolismo , Análise de Sequência de Proteína/métodos , Software , Algoritmos , Sequência de Aminoácidos , Humanos , Proteínas/química , Homologia de SequênciaRESUMO
BACKGROUND: Determination of protein subcellular localization plays an important role in understanding protein function. Knowledge of the subcellular localization is also essential for genome annotation and drug discovery. Supervised machine learning methods for predicting the localization of a protein in a cell rely on the availability of large amounts of labeled data. However, because of the high cost and effort involved in labeling the data, the amount of labeled data is quite small compared to the amount of unlabeled data. Hence, there is a growing interest in developing semi-supervised methods for predicting protein subcellular localization from large amounts of unlabeled data together with small amounts of labeled data. RESULTS: In this paper, we present an Abstraction Augmented Markov Model (AAMM) based approach to semi-supervised protein subcellular localization prediction problem. We investigate the effectiveness of AAMMs in exploiting unlabeled data. We compare semi-supervised AAMMs with: (i) Markov models (MMs) (which do not take advantage of unlabeled data); (ii) an expectation maximization (EM); and (iii) a co-training based approaches to semi-supervised training of MMs (that make use of unlabeled data). CONCLUSIONS: The results of our experiments on three protein subcellular localization data sets show that semi-supervised AAMMs: (i) can effectively exploit unlabeled data; (ii) are more accurate than both the MMs and the EM based semi-supervised MMs; and (iii) are comparable in performance, and in some cases outperform, the co-training based semi-supervised MMs.
Assuntos
Biologia Computacional/métodos , Cadeias de Markov , Proteínas/classificação , Proteínas/metabolismo , Frações Subcelulares/metabolismo , Algoritmos , Inteligência Artificial , Análise por Conglomerados , Bases de Dados de Proteínas , Modelos Biológicos , Proteínas de Plantas/química , Proteínas de Plantas/metabolismo , Proteínas/química , Reprodutibilidade dos Testes , Frações Subcelulares/químicaRESUMO
BACKGROUND: Ortholog detection methods present a powerful approach for finding genes that participate in similar biological processes across different organisms, extending our understanding of interactions between genes across different pathways, and understanding the evolution of gene families. RESULTS: We exploit features derived from the alignment of protein-protein interaction networks and gene-coexpression networks to reconstruct KEGG orthologs for Drosophila melanogaster, Saccharomyces cerevisiae, Mus musculus and Homo sapiens protein-protein interaction networks extracted from the DIP repository and Mus musculus and Homo sapiens and Sus scrofa gene coexpression networks extracted from NCBI's Gene Expression Omnibus using the decision tree, Naive-Bayes and Support Vector Machine classification algorithms. CONCLUSIONS: The performance of our classifiers in reconstructing KEGG orthologs is compared against a basic reciprocal BLAST hit approach. We provide implementations of the resulting algorithms as part of BiNA, an open source biomolecular network alignment toolkit.
Assuntos
Algoritmos , Bases de Dados Genéticas , Expressão Gênica , Genômica/métodos , Mapeamento de Interação de Proteínas/métodos , Animais , Inteligência Artificial , Teorema de Bayes , Árvores de Decisões , Drosophila melanogaster , Humanos , Camundongos , Proteínas/genética , Curva ROC , Saccharomyces cerevisiae , Alinhamento de Sequência , SuínosRESUMO
Computational docking is a promising tool to model three-dimensional (3D) structures of protein-protein complexes, which provides fundamental insights of protein functions in the cellular life. Singling out near-native models from the huge pool of generated docking models (referred to as the scoring problem) remains as a major challenge in computational docking. We recently published iScore, a novel graph kernel based scoring function. iScore ranks docking models based on their interface graph similarities to the training interface graph set. iScore uses a support vector machine approach with random-walk graph kernels to classify and rank protein-protein interfaces. Here, we present the software for iScore. The software provides executable scripts that fully automate the computational workflow. In addition, the creation and analysis of the interface graph can be distributed across different processes using Message Passing interface (MPI) and can be offloaded to GPUs thanks to dedicated CUDA kernels.