RESUMO
Toxicological information as needed for risk assessments of chemical compounds is often sparse. Unfortunately, gathering new toxicological information experimentally often involves animal testing. Simulated alternatives, e.g., quantitative structure-activity relationship (QSAR) models, are preferred to infer the toxicity of new compounds. Aquatic toxicity data collections consist of many related tasksâeach predicting the toxicity of new compounds on a given species. Since many of these tasks are inherently low-resource, i.e., involve few associated compounds, this is challenging. Meta-learning is a subfield of artificial intelligence that can lead to more accurate models by enabling the utilization of information across tasks. In our work, we benchmark various state-of-the-art meta-learning techniques for building QSAR models, focusing on knowledge sharing between species. Specifically, we employ and compare transformational machine learning, model-agnostic meta-learning, fine-tuning, and multi-task models. Our experiments show that established knowledge-sharing techniques outperform single-task approaches. We recommend the use of multi-task random forest models for aquatic toxicity modeling, which matched or exceeded the performance of other approaches and robustly produced good results in the low-resource settings we studied. This model functions on a species level, predicting toxicity for multiple species across various phyla, with flexible exposure duration and on a large chemical applicability domain.
Assuntos
Inteligência Artificial , Relação Quantitativa Estrutura-Atividade , Animais , PeixesRESUMO
Given the common problem of missing data in real-world applications from various fields, such as remote sensing, ecology and meteorology, the interpolation of missing spatial and spatio-temporal data can be of tremendous value. Existing methods for spatial interpolation, most notably Gaussian processes and spatial autoregressive models, tend to suffer from (a) a trade-off between modelling local or global spatial interaction, (b) the assumption there is only one possible path between two points, and (c) the assumption of homogeneity of intermediate locations between points. Addressing these issues, we propose a value propagation-based spatial interpolation method called VPint, inspired by Markov reward processes (MRPs), and introduce two variants thereof: (i) a static discount (SD-MRP) and (ii) a data-driven weight prediction (WP-MRP) variant. Both these interpolation variants operate locally, while implicitly accounting for global spatial relationships in the entire system through recursion. We evaluated our proposed methods by comparing the mean absolute error, root mean squared error, peak signal-to-noise ratio and structural similarity of interpolated grid cells to those of 8 common baselines. Our analysis involved detailed experiments on a synthetic and two real-world datasets, as well as experiments on convergence and scalability. Empirical results demonstrate the competitive advantage of VPint on randomly missing data, where it performed better than baselines in terms of mean absolute error and structural similarity, as well as spatially clustered missing data, where it performed best on 2 out of 3 datasets.
RESUMO
BACKGROUND: Predicting the onset and course of mood and anxiety disorders is of clinical importance but remains difficult. We compared the predictive performances of traditional logistic regression, basic probabilistic machine learning (ML) methods, and automated ML (Auto-sklearn). METHODS: Data were derived from the Netherlands Study of Depression and Anxiety. We compared how well multinomial logistic regression, a naïve Bayes classifier, and Auto-sklearn predicted depression and anxiety diagnoses at a 2-, 4-, 6-, and 9-year follow up, operationalized as binary or categorical variables. Predictor sets included demographic and self-report data, which can be easily collected in clinical practice at two initial time points (baseline and 1-year follow up). RESULTS: At baseline, participants were 42.2 years old, 66.5% were women, and 53.6% had a current mood or anxiety disorder. The three methods were similarly successful in predicting (mental) health status, with correct predictions for up to 79% (95% CI 75-81%). However, Auto-sklearn was superior when assessing a more complex dataset with individual item scores. CONCLUSIONS: Automated ML methods added only limited value, compared to traditional data modelling when predicting the onset and course of depression and anxiety. However, they hold potential for automatization and may be better suited for complex datasets.
Assuntos
Transtornos de Ansiedade , Aprendizado de Máquina , Adulto , Ansiedade/diagnóstico , Transtornos de Ansiedade/diagnóstico , Teorema de Bayes , Feminino , Humanos , Modelos LogísticosRESUMO
A volunteer effort by Artificial Intelligence (AI) researchers has shown it can deliver significant research outcomes rapidly to help tackle COVID-19. Within two months, CLAIRE's self-organising volunteers delivered the World's first comprehensive curated repository of COVID-19-related datasets useful for drug-repurposing, drafted review papers on the role CT/X-ray scan analysis and robotics could play, and progressed research in other areas. Given the pace required and nature of voluntary efforts, the teams faced a number of challenges. These offer insights in how better to prepare for future volunteer scientific efforts and large scale, data-dependent AI collaborations in general. We offer seven recommendations on how to best leverage such efforts and collaborations in the context of managing future crises.
RESUMO
Kinases are frequently studied in the context of anticancer drugs. Their involvement in cell responses, such as proliferation, differentiation, and apoptosis, makes them interesting subjects in multitarget drug design. In this study, a workflow is presented that models the bioactivity spectra for two panels of kinases: (1) inhibition of RET, BRAF, SRC, and S6K, while avoiding inhibition of MKNK1, TTK, ERK8, PDK1, and PAK3, and (2) inhibition of AURKA, PAK1, FGFR1, and LKB1, while avoiding inhibition of PAK3, TAK1, and PIK3CA. Both statistical and structure-based models were included, which were thoroughly benchmarked and optimized. A virtual screening was performed to test the workflow for one of the main targets, RET kinase. This resulted in 5 novel and chemically dissimilar RET inhibitors with remaining RET activity of <60% (at a concentration of 10 µM) and similarities with known RET inhibitors from 0.18 to 0.29 (Tanimoto, ECFP6). The four more potent inhibitors were assessed in a concentration range and proved to be modestly active with a pIC50 value of 5.1 for the most active compound. The experimental validation of inhibitors for RET strongly indicates that the multitarget workflow is able to detect novel inhibitors for kinases, and hence, this workflow can potentially be applied in polypharmacology modeling. We conclude that this approach can identify new chemical matter for existing targets. Moreover, this workflow can easily be applied to other targets as well.
Assuntos
Antineoplásicos , Proteínas Proto-Oncogênicas c-ret , Antineoplásicos/farmacologia , Desenho de Fármacos , Polifarmacologia , Inibidores de Proteínas Quinases/farmacologiaRESUMO
Automatic algorithm configuration (AAC) is becoming a key ingredient in the design of high-performance solvers for challenging optimisation problems. However, most existing work on AAC deals with configuration procedures that optimise a single performance metric of a given, single-objective algorithm. Of course, these configurators can also be used to optimise the performance of multi-objective algorithms, as measured by a single performance indicator. In this work, we demonstrate that better results can be obtained by using a native, multi-objective algorithm configuration procedure. Specifically, we compare three AAC approaches: one considering only the hypervolume indicator, a second optimising the weighted sum of hypervolume and spread, and a third that simultaneously optimises these complementary indicators, using a genuinely multi-objective approach. We assess these approaches by applying them to a highly-parametric local search framework for two widely studied multi-objective optimisation problems, the bi-objective permutation flowshop and travelling salesman problems. Our results show that multi-objective algorithms are indeed best configured using a multi-objective configurator.
Assuntos
Algoritmos , Simulação por Computador , Armazenamento e Recuperação da Informação/métodos , Modelos Teóricos , Reconhecimento Automatizado de Padrão/métodos , Resolução de Problemas , HumanosRESUMO
It has long been observed that for practically any computational problem that has been intensely studied, different instances are best solved using different algorithms. This is particularly pronounced for computationally hard problems, where in most cases, no single algorithm defines the state of the art; instead, there is a set of algorithms with complementary strengths. This performance complementarity can be exploited in various ways, one of which is based on the idea of selecting, from a set of given algorithms, for each problem instance to be solved the one expected to perform best. The task of automatically selecting an algorithm from a given set is known as the per-instance algorithm selection problem and has been intensely studied over the past 15 years, leading to major improvements in the state of the art in solving a growing number of discrete combinatorial problems, including propositional satisfiability and AI planning. Per-instance algorithm selection also shows much promise for boosting performance in solving continuous and mixed discrete/continuous optimisation problems. This survey provides an overview of research in automated algorithm selection, ranging from early and seminal works to recent and promising application areas. Different from earlier work, it covers applications to discrete and continuous problems, and discusses algorithm selection in context with conceptually related approaches, such as algorithm configuration, scheduling, or portfolio selection. Since informative and cheaply computable problem instance features provide the basis for effective per-instance algorithm selection systems, we also provide an overview of such features for discrete and continuous problems. Finally, we provide perspectives on future work in the area and discuss a number of open research challenges.
Assuntos
Algoritmos , Simulação por Computador , Armazenamento e Recuperação da Informação/métodos , Reconhecimento Automatizado de Padrão/métodos , Técnicas de Apoio para a Decisão , Humanos , Inquéritos e QuestionáriosRESUMO
The Travelling Salesperson Problem (TSP) is one of the best-studied NP-hard problems. Over the years, many different solution approaches and solvers have been developed. For the first time, we directly compare five state-of-the-art inexact solvers-namely, LKH, EAX, restart variants of those, and MAOS-on a large set of well-known benchmark instances and demonstrate complementary performance, in that different instances may be solved most effectively by different algorithms. We leverage this complementarity to build an algorithm selector, which selects the best TSP solver on a per-instance basis and thus achieves significantly improved performance compared to the single best solver, representing an advance in the state of the art in solving the Euclidean TSP. Our in-depth analysis of the selectors provides insight into what drives this performance improvement.
RESUMO
We present a significantly improved version of the flowType and RchyOptimyx BioConductor-based pipeline that is both 14 times faster and can accommodate multiple levels of biomarker expression for up to 96 markers. With these improvements, the pipeline is positioned to be an integral part of data analysis for high-throughput experiments on high-dimensional single-cell assay platforms, including flow cytometry, mass cytometry and single-cell RT-qPCR.
Assuntos
Citometria de Fluxo/métodos , Antígenos CD/análise , Biomarcadores/análise , SoftwareRESUMO
Our homes and workspaces are filled with collections of dozens of artifacts laid out on surfaces such as shelves, counters, and mantles. The content and layout of these arrangements reflect both context, e.g., kitchen or living room, and style, e.g., neat or messy. Manually assembling such arrangements in virtual scenes is highly time consuming, especially when one needs to generate multiple diverse arrangements for numerous support surfaces and living spaces. We present a data-driven method especially designed for artifact arrangement which automatically populates empty surfaces with diverse believable arrangements of artifacts in a given style. The input to our method is an annotated photograph or a 3D model of an exemplar arrangement, that reflects the desired context and style. Our method leverages this exemplar to generate diverse arrangements reflecting the exemplar style for arbitrary furniture setups and layout dimensions. To simultaneously achieve scalability, diversity and style preservation, we define a valid solution space of arrangements that reflect the input style. We obtain solutions within this space using barrier functions and stochastic optimization.
RESUMO
BACKGROUND: Accurate structure prediction methods play an important role for the understanding of RNA function. Energy-based, pseudoknot-free secondary structure prediction is one of the most widely used and versatile approaches, and improved methods for this task have received much attention over the past five years. Despite the impressive progress that as been achieved in this area, existing evaluations of the prediction accuracy achieved by various algorithms do not provide a comprehensive, statistically sound assessment. Furthermore, while there is increasing evidence that no prediction algorithm consistently outperforms all others, no work has been done to exploit the complementary strengths of multiple approaches. RESULTS: In this work, we present two contributions to the area of RNA secondary structure prediction. Firstly, we use state-of-the-art, resampling-based statistical methods together with a previously published and increasingly widely used dataset of high-quality RNA structures to conduct a comprehensive evaluation of existing RNA secondary structure prediction procedures. The results from this evaluation clarify the performance relationship between ten well-known existing energy-based pseudoknot-free RNA secondary structure prediction methods and clearly demonstrate the progress that has been achieved in recent years. Secondly, we introduce AveRNA, a generic and powerful method for combining a set of existing secondary structure prediction procedures into an ensemble-based method that achieves significantly higher prediction accuracies than obtained from any of its component procedures. CONCLUSIONS: Our new, ensemble-based method, AveRNA, improves the state of the art for energy-based, pseudoknot-free RNA secondary structure prediction by exploiting the complementary strengths of multiple existing prediction procedures, as demonstrated using a state-of-the-art statistical resampling approach. In addition, AveRNA allows an intuitive and effective control of the trade-off between false negative and false positive base pair predictions. Finally, AveRNA can make use of arbitrary sets of secondary structure prediction procedures and can therefore be used to leverage improvements in prediction accuracy offered by algorithms and energy models developed in the future. Our data, MATLAB software and a web-based version of AveRNA are publicly available at http://www.cs.ubc.ca/labs/beta/Software/AveRNA.
Assuntos
Algoritmos , RNA/química , Conformação de Ácido Nucleico , SoftwareRESUMO
Traditional methods for flow cytometry (FCM) data processing rely on subjective manual gating. Recently, several groups have developed computational methods for identifying cell populations in multidimensional FCM data. The Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP) challenges were established to compare the performance of these methods on two tasks: (i) mammalian cell population identification, to determine whether automated algorithms can reproduce expert manual gating and (ii) sample classification, to determine whether analysis pipelines can identify characteristics that correlate with external variables (such as clinical outcome). This analysis presents the results of the first FlowCAP challenges. Several methods performed well as compared to manual gating or external variables using statistical performance measures, which suggests that automated methods have reached a sufficient level of maturity and accuracy for reliable use in FCM data analysis.
Assuntos
Biologia Computacional , Citometria de Fluxo/métodos , Processamento de Imagem Assistida por Computador , Algoritmos , Animais , Análise por Conglomerados , Interpretação Estatística de Dados , Citometria de Fluxo/normas , Citometria de Fluxo/estatística & dados numéricos , Doença Enxerto-Hospedeiro/sangue , Doença Enxerto-Hospedeiro/patologia , Humanos , Leucócitos Mononucleares/patologia , Leucócitos Mononucleares/virologia , Linfoma Difuso de Grandes Células B/sangue , Linfoma Difuso de Grandes Células B/patologia , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Software , Febre do Nilo Ocidental/sangue , Febre do Nilo Ocidental/patologia , Febre do Nilo Ocidental/virologiaRESUMO
Analysis of high-dimensional flow cytometry datasets can reveal novel cell populations with poorly understood biology. Following discovery, characterization of these populations in terms of the critical markers involved is an important step, as this can help to both better understand the biology of these populations and aid in designing simpler marker panels to identify them on simpler instruments and with fewer reagents (i.e., in resource poor or highly regulated clinical settings). However, current tools to design panels based on the biological characteristics of the target cell populations work exclusively based on technical parameters (e.g., instrument configurations, spectral overlap, and reagent availability). To address this shortcoming, we developed RchyOptimyx (cellular hieraRCHY OPTIMization), a computational tool that constructs cellular hierarchies by combining automated gating with dynamic programming and graph theory to provide the best gating strategies to identify a target population to a desired level of purity or correlation with a clinical outcome, using the simplest possible marker panels. RchyOptimyx can assess and graphically present the trade-offs between marker choice and population specificity in high-dimensional flow or mass cytometry datasets. We present three proof-of-concept use cases for RchyOptimyx that involve 1) designing a panel of surface markers for identification of rare populations that are primarily characterized using their intracellular signature; 2) simplifying the gating strategy for identification of a target cell population; 3) identification of a non-redundant marker set to identify a target cell population.
Assuntos
Células da Medula Óssea/citologia , Citometria de Fluxo/métodos , Software , Algoritmos , Antígenos CD/análise , Antígenos CD/imunologia , Biomarcadores/análise , Células da Medula Óssea/imunologia , Biologia Computacional/métodos , Infecções por HIV/imunologia , Humanos , Imunofenotipagem/métodos , Interleucina-7/imunologia , Lipopolissacarídeos/imunologia , Fenótipo , Coloração e Rotulagem , Linfócitos T/citologia , Linfócitos T/imunologiaRESUMO
MOTIVATION: Polychromatic flow cytometry (PFC), has enormous power as a tool to dissect complex immune responses (such as those observed in HIV disease) at a single cell level. However, analysis tools are severely lacking. Although high-throughput systems allow rapid data collection from large cohorts, manual data analysis can take months. Moreover, identification of cell populations can be subjective and analysts rarely examine the entirety of the multidimensional dataset (focusing instead on a limited number of subsets, the biology of which has usually already been well-described). Thus, the value of PFC as a discovery tool is largely wasted. RESULTS: To address this problem, we developed a computational approach that automatically reveals all possible cell subsets. From tens of thousands of subsets, those that correlate strongly with clinical outcome are selected and grouped. Within each group, markers that have minimal relevance to the biological outcome are removed, thereby distilling the complex dataset into the simplest, most clinically relevant subsets. This allows complex information from PFC studies to be translated into clinical or resource-poor settings, where multiparametric analysis is less feasible. We demonstrate the utility of this approach in a large (n=466), retrospective, 14-parameter PFC study of early HIV infection, where we identify three T-cell subsets that strongly predict progression to AIDS (only one of which was identified by an initial manual analysis). AVAILABILITY: The 'flowType: Phenotyping Multivariate PFC Assays' package is available through Bioconductor. Additional documentation and examples are available at: www.terryfoxlab.ca/flowsite/flowType/ SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. CONTACT: rbrinkman@bccrc.ca.
Assuntos
Biologia Computacional/métodos , Citometria de Fluxo , Infecções por HIV/imunologia , Subpopulações de Linfócitos T/imunologia , Biomarcadores/análise , Humanos , Imunofenotipagem/métodos , Valor Preditivo dos Testes , Modelos de Riscos Proporcionais , Estudos Retrospectivos , Subpopulações de Linfócitos T/citologiaRESUMO
BACKGROUND: RNA molecules play critical roles in the cells of organisms, including roles in gene regulation, catalysis, and synthesis of proteins. Since RNA function depends in large part on its folded structures, much effort has been invested in developing accurate methods for prediction of RNA secondary structure from the base sequence. Minimum free energy (MFE) predictions are widely used, based on nearest neighbor thermodynamic parameters of Mathews, Turner et al. or those of Andronescu et al. Some recently proposed alternatives that leverage partition function calculations find the structure with maximum expected accuracy (MEA) or pseudo-expected accuracy (pseudo-MEA) methods. Advances in prediction methods are typically benchmarked using sensitivity, positive predictive value and their harmonic mean, namely F-measure, on datasets of known reference structures. Since such benchmarks document progress in improving accuracy of computational prediction methods, it is important to understand how measures of accuracy vary as a function of the reference datasets and whether advances in algorithms or thermodynamic parameters yield statistically significant improvements. Our work advances such understanding for the MFE and (pseudo-)MEA-based methods, with respect to the latest datasets and energy parameters. RESULTS: We present three main findings. First, using the bootstrap percentile method, we show that the average F-measure accuracy of the MFE and (pseudo-)MEA-based algorithms, as measured on our largest datasets with over 2000 RNAs from diverse families, is a reliable estimate (within a 2% range with high confidence) of the accuracy of a population of RNA molecules represented by this set. However, average accuracy on smaller classes of RNAs such as a class of 89 Group I introns used previously in benchmarking algorithm accuracy is not reliable enough to draw meaningful conclusions about the relative merits of the MFE and MEA-based algorithms. Second, on our large datasets, the algorithm with best overall accuracy is a pseudo MEA-based algorithm of Hamada et al. that uses a generalized centroid estimator of base pairs. However, between MFE and other MEA-based methods, there is no clear winner in the sense that the relative accuracy of the MFE versus MEA-based algorithms changes depending on the underlying energy parameters. Third, of the four parameter sets we considered, the best accuracy for the MFE-, MEA-based, and pseudo-MEA-based methods is 0.686, 0.680, and 0.711, respectively (on a scale from 0 to 1 with 1 meaning perfect structure predictions) and is obtained with a thermodynamic parameter set obtained by Andronescu et al. called BL* (named after the Boltzmann likelihood method by which the parameters were derived). CONCLUSIONS: Large datasets should be used to obtain reliable measures of the accuracy of RNA structure prediction algorithms, and average accuracies on specific classes (such as Group I introns and Transfer RNAs) should be interpreted with caution, considering the relatively small size of currently available datasets for such classes. The accuracy of the MEA-based methods is significantly higher when using the BL* parameter set of Andronescu et al. than when using the parameters of Mathews and Turner, and there is no significant difference between the accuracy of MEA-based methods and MFE when using the BL* parameters. The pseudo-MEA-based method of Hamada et al. with the BL* parameter set significantly outperforms all other MFE and MEA-based algorithms on our large data sets.
Assuntos
Algoritmos , RNA/química , Conformação de Ácido Nucleico , Valor Preditivo dos Testes , Ribonuclease P/química , TermodinâmicaRESUMO
We have developed flowMeans, a time-efficient and accurate method for automated identification of cell populations in flow cytometry (FCM) data based on K-means clustering. Unlike traditional K-means, flowMeans can identify concave cell populations by modelling a single population with multiple clusters. flowMeans uses a change point detection algorithm to determine the number of sub-populations, enabling the method to be used in high throughput FCM data analysis pipelines. Our approach compares favorably to manual analysis by human experts and current state-of-the-art automated gating algorithms. flowMeans is freely available as an open source R package through Bioconductor.
Assuntos
Citometria de Fluxo/métodos , Citometria de Fluxo/estatística & dados numéricos , Algoritmos , Automação , Análise por Conglomerados , Doença Enxerto-Hospedeiro/sangue , Humanos , Linfoma Difuso de Grandes Células B/patologia , Modelos EstatísticosRESUMO
Methods for efficient and accurate prediction of RNA structure are increasingly valuable, given the current rapid advances in understanding the diverse functions of RNA molecules in the cell. To enhance the accuracy of secondary structure predictions, we developed and refined optimization techniques for the estimation of energy parameters. We build on two previous approaches to RNA free-energy parameter estimation: (1) the Constraint Generation (CG) method, which iteratively generates constraints that enforce known structures to have energies lower than other structures for the same molecule; and (2) the Boltzmann Likelihood (BL) method, which infers a set of RNA free-energy parameters that maximize the conditional likelihood of a set of reference RNA structures. Here, we extend these approaches in two main ways: We propose (1) a max-margin extension of CG, and (2) a novel linear Gaussian Bayesian network that models feature relationships, which effectively makes use of sparse data by sharing statistical strength between parameters. We obtain significant improvements in the accuracy of RNA minimum free-energy pseudoknot-free secondary structure prediction when measured on a comprehensive set of 2518 RNA molecules with reference structures. Our parameters can be used in conjunction with software that predicts RNA secondary structures, RNA hybridization, or ensembles of structures. Our data, software, results, and parameter sets in various formats are freely available at http://www.cs.ubc.ca/labs/beta/Projects/RNA-Params.
Assuntos
Biologia Computacional/métodos , Metabolismo Energético/fisiologia , RNA/química , RNA/metabolismo , Estatística como Assunto/métodos , Algoritmos , Animais , Composição de Bases , Sequência de Bases , Biologia Computacional/estatística & dados numéricos , Humanos , Modelos Teóricos , Dados de Sequência Molecular , Conformação de Ácido Nucleico , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Análise de Sequência de RNARESUMO
BACKGROUND: The ability to access, search and analyse secondary structures of a large set of known RNA molecules is very important for deriving improved RNA energy models, for evaluating computational predictions of RNA secondary structures and for a better understanding of RNA folding. Currently there is no database that can easily provide these capabilities for almost all RNA molecules with known secondary structures. RESULTS: In this paper we describe RNA STRAND - the RNA secondary STRucture and statistical ANalysis Database, a curated database containing known secondary structures of any type and organism. Our new database provides a wide collection of known RNA secondary structures drawn from public databases, searchable and downloadable in a common format. Comprehensive statistical information on the secondary structures in our database is provided using the RNA Secondary Structure Analyser, a new tool we have developed to analyse RNA secondary structures. The information thus obtained is valuable for understanding to which extent and with which probability certain structural motifs can appear. We outline several ways in which the data provided in RNA STRAND can facilitate research on RNA structure, including the improvement of RNA energy models and evaluation of secondary structure prediction programs. In order to keep up-to-date with new RNA secondary structure experiments, we offer the necessary tools to add solved RNA secondary structures to our database and invite researchers to contribute to RNA STRAND. CONCLUSION: RNA STRAND is a carefully assembled database of trusted RNA secondary structures, with easy on-line tools for searching, analyzing and downloading user selected entries, and is publicly available at http://www.rnasoft.ca/strand.
Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Modelos Químicos , Modelos Moleculares , RNA/química , RNA/ultraestrutura , Interface Usuário-Computador , Gráficos por Computador , Simulação por Computador , Armazenamento e Recuperação da Informação/métodos , Conformação de Ácido NucleicoRESUMO
BACKGROUND: Secondary structure interactions within introns have been shown to be essential for efficient splicing of several yeast genes. The nature of these base-pairing interactions and their effect on splicing efficiency were most extensively studied in ribosomal protein gene RPS17B (previously known as RP51B). It was determined that complementary pairing between two sequence segments located downstream of the 5' splice site and upstream of the branchpoint sequence promotes efficient splicing of the RPS17B pre-mRNA, presumably by shortening the branchpoint distance. However, no attempts were made to compute a shortened, 'structural' branchpoint distance and thus the functional relationship between this distance and the splicing efficiency remains unknown. RESULTS: In this paper we use computational RNA secondary structure prediction to analyze the secondary structure of the RPS17B intron. We show that it is necessary to consider suboptimal structure predictions and to compute the structural branchpoint distances in order to explain previously published splicing efficiency results. Our study reveals that there is a tight correlation between this distance and splicing efficiency levels of intron mutants described in the literature. We experimentally test this correlation on additional RPS17B mutants and intron mutants within two other yeast genes. CONCLUSION: The proposed model of secondary structure requirements for efficient splicing is the first attempt to specify the functional relationship between pre-mRNA secondary structure and splicing. Our findings provide further insights into the role of pre-mRNA secondary structure in gene splicing in yeast and also offer basis for improvement of computational methods for splice site identification and gene-finding.
Assuntos
Íntrons , Precursores de RNA/genética , Splicing de RNA , RNA Mensageiro/genética , Saccharomyces cerevisiae/genética , Algoritmos , Pareamento de Bases , Biologia Computacional , Genes Fúngicos , Genoma Fúngico , Mutação , Conformação de Ácido Nucleico , RNA Fúngico/genética , Proteínas Ribossômicas/genética , Proteínas de Saccharomyces cerevisiae/genéticaRESUMO
BACKGROUND: The ab initio protein folding problem consists of predicting protein tertiary structure from a given amino acid sequence by minimizing an energy function; it is one of the most important and challenging problems in biochemistry, molecular biology and biophysics. The ab initio protein folding problem is computationally challenging and has been shown to be NuRho -hard even when conformations are restricted to a lattice. In this work, we implement and evaluate the replica exchange Monte Carlo (REMC) method, which has already been applied very successfully to more complex protein models and other optimization problems with complex energy landscapes, in combination with the highly effective pull move neighbourhood in two widely studied Hydrophobic Polar (HP) lattice models. RESULTS: We demonstrate that REMC is highly effective for solving instances of the square (2D)and cubic (3D) HP protein folding problem. When using the pull move neighbourhood, REMCoutperforms current state-of-the-art algorithms for most benchmark instances. Additionally, we show that this new algorithm provides a larger ensemble of ground-state structures than the existing state-of-the-art methods. Furthermore, it scales well with sequence length, and it finds significantly better conformations on long biological sequences and sequences with a provably unique ground-state structure, which is believed to be a characteristic of real proteins. We also present evidence that our REMC algorithm can fold sequences which exhibit significant interaction between termini in the hydrophobic core relatively easily. CONCLUSION: We demonstrate that REMC utilizing the pull move neighbourhood significantly outperforms current state-of-the-art methods for protein structure prediction in the HP model on 2D and 3D lattices. This is particularly noteworthy, since so far, the state-of-the-art methods for2D and 3D HP protein folding - in particular, the pruned-enriched Rosenbluth method (PERM) and,to some extent, Ant Colony Optimisation (ACO) - were based on chain growth mechanisms. To the best of our knowledge, this is the first application of REMC to HP protein folding on the cubic lattice, and the first extension of the pull move neighbourhood to a 3D lattice.