Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 4.246
Filtrar
Mais filtros

Tipo de documento
Intervalo de ano de publicação
1.
Mol Cell ; 82(9): 1708-1723.e10, 2022 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-35320755

RESUMO

7SK is a conserved noncoding RNA that regulates transcription by sequestering the transcription factor P-TEFb. 7SK function entails complex changes in RNA structure, but characterizing RNA dynamics in cells remains an unsolved challenge. We developed a single-molecule chemical probing strategy, DANCE-MaP (deconvolution and annotation of ribonucleic conformational ensembles), that defines per-nucleotide reactivity, direct base pairing interactions, tertiary interactions, and thermodynamic populations for each state in RNA structural ensembles from a single experiment. DANCE-MaP reveals that 7SK RNA encodes a large-scale structural switch that couples dissolution of the P-TEFb binding site to structural remodeling at distal release factor binding sites. The 7SK structural equilibrium shifts in response to cell growth and stress and can be targeted to modulate expression of P-TEFbresponsive genes. Our study reveals that RNA structural dynamics underlie 7SK function as an integrator of diverse cellular signals to control transcription and establishes the power of DANCE-MaP to define RNA dynamics in cells.


Assuntos
Fator B de Elongação Transcricional Positiva , Proteínas de Ligação a RNA , Sítios de Ligação/genética , Células HeLa , Humanos , Fator B de Elongação Transcricional Positiva/genética , RNA Nuclear Pequeno/genética , RNA não Traduzido , Proteínas de Ligação a RNA/genética
2.
Am J Hum Genet ; 111(7): 1431-1447, 2024 07 11.
Artigo em Inglês | MEDLINE | ID: mdl-38908374

RESUMO

Methods of estimating polygenic scores (PGSs) from genome-wide association studies are increasingly utilized. However, independent method evaluation is lacking, and method comparisons are often limited. Here, we evaluate polygenic scores derived via seven methods in five biobank studies (totaling about 1.2 million participants) across 16 diseases and quantitative traits, building on a reference-standardized framework. We conducted meta-analyses to quantify the effects of method choice, hyperparameter tuning, method ensembling, and the target biobank on PGS performance. We found that no single method consistently outperformed all others. PGS effect sizes were more variable between biobanks than between methods within biobanks when methods were well tuned. Differences between methods were largest for the two investigated autoimmune diseases, seropositive rheumatoid arthritis and type 1 diabetes. For most methods, cross-validation was more reliable for tuning hyperparameters than automatic tuning (without the use of target data). For a given target phenotype, elastic net models combining PGS across methods (ensemble PGS) tuned in the UK Biobank provided consistent, high, and cross-biobank transferable performance, increasing PGS effect sizes (ß coefficients) by a median of 5.0% relative to LDpred2 and MegaPRS (the two best-performing single methods when tuned with cross-validation). Our interactively browsable online-results and open-source workflow prspipe provide a rich resource and reference for the analysis of polygenic scoring methods across biobanks.


Assuntos
Bancos de Espécimes Biológicos , Estudo de Associação Genômica Ampla , Herança Multifatorial , Humanos , Herança Multifatorial/genética , Fenótipo , Diabetes Mellitus Tipo 1/genética , Polimorfismo de Nucleotídeo Único , Aprendizado de Máquina
3.
Proc Natl Acad Sci U S A ; 121(33): e2403210121, 2024 Aug 13.
Artigo em Inglês | MEDLINE | ID: mdl-39110727

RESUMO

Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary-level data (ALL-Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL-Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large-scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL-Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20-fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL-Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL-Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state-of-the-art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL-Sum is available as a user-friendly R software package with publicly available reference data for streamlined analysis.


Assuntos
Estudo de Associação Genômica Ampla , Herança Multifatorial , Humanos , Herança Multifatorial/genética , Estudo de Associação Genômica Ampla/métodos , Aprendizado de Máquina , Predisposição Genética para Doença , Polimorfismo de Nucleotídeo Único
4.
Proc Natl Acad Sci U S A ; 121(15): e2312573121, 2024 Apr 09.
Artigo em Inglês | MEDLINE | ID: mdl-38557185

RESUMO

Predicting the temporal and spatial patterns of South Asian monsoon rainfall within a season is of critical importance due to its impact on agriculture, water availability, and flooding. The monsoon intraseasonal oscillation (MISO) is a robust northward-propagating mode that determines the active and break phases of the monsoon and much of the regional distribution of rainfall. However, dynamical atmospheric forecast models predict this mode poorly. Data-driven methods for MISO prediction have shown more skill, but only predict the portion of the rainfall corresponding to MISO rather than the full rainfall signal. Here, we combine state-of-the-art ensemble precipitation forecasts from a high-resolution atmospheric model with data-driven forecasts of MISO. The ensemble members of the detailed atmospheric model are projected onto a lower-dimensional subspace corresponding to the MISO dynamics and are then weighted according to their distance from the data-driven MISO forecast in this subspace. We thereby achieve improvements in rainfall forecasts over India, as well as the broader monsoon region, at 10- to 30-d lead times, an interval that is generally considered to be a predictability gap. The temporal correlation of rainfall forecasts is improved by up to 0.28 in this time range. Our results demonstrate the potential of leveraging the predictability of intraseasonal oscillations to improve extended-range forecasts; more generally, they point toward a future of combining dynamical and data-driven forecasts for Earth system prediction.

5.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38557674

RESUMO

Quality control in quantitative proteomics is a persistent challenge, particularly in identifying and managing outliers. Unsupervised learning models, which rely on data structure rather than predefined labels, offer potential solutions. However, without clear labels, their effectiveness might be compromised. Single models are susceptible to the randomness of parameters and initialization, which can result in a high rate of false positives. Ensemble models, on the other hand, have shown capabilities in effectively mitigating the impacts of such randomness and assisting in accurately detecting true outliers. Therefore, we introduced SEAOP, a Python toolbox that utilizes an ensemble mechanism by integrating multi-round data management and a statistics-based decision pipeline with multiple models. Specifically, SEAOP uses multi-round resampling to create diverse sub-data spaces and employs outlier detection methods to identify candidate outliers in each space. Candidates are then aggregated as confirmed outliers via a chi-square test, adhering to a 95% confidence level, to ensure the precision of the unsupervised approaches. Additionally, SEAOP introduces a visualization strategy, specifically designed to intuitively and effectively display the distribution of both outlier and non-outlier samples. Optimal hyperparameter models of SEAOP for outlier detection were identified by using a gradient-simulated standard dataset and Mann-Kendall trend test. The performance of the SEAOP toolbox was evaluated using three experimental datasets, confirming its reliability and accuracy in handling quantitative proteomics.


Assuntos
Gerenciamento de Dados , Proteômica , Reprodutibilidade dos Testes , Controle de Qualidade , Interpretação Estatística de Dados
6.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38701413

RESUMO

With the emergence of large amount of single-cell RNA sequencing (scRNA-seq) data, the exploration of computational methods has become critical in revealing biological mechanisms. Clustering is a representative for deciphering cellular heterogeneity embedded in scRNA-seq data. However, due to the diversity of datasets, none of the existing single-cell clustering methods shows overwhelming performance on all datasets. Weighted ensemble methods are proposed to integrate multiple results to improve heterogeneity analysis performance. These methods are usually weighted by considering the reliability of the base clustering results, ignoring the performance difference of the same base clustering on different cells. In this paper, we propose a high-order element-wise weighting strategy based self-representative ensemble learning framework: scEWE. By assigning different base clustering weights to individual cells, we construct and optimize the consensus matrix in a careful and exquisite way. In addition, we extracted the high-order information between cells, which enhanced the ability to represent the similarity relationship between cells. scEWE is experimentally shown to significantly outperform the state-of-the-art methods, which strongly demonstrates the effectiveness of the method and supports the potential applications in complex single-cell data analytical problems.


Assuntos
Análise de Sequência de RNA , Análise de Célula Única , Análise de Célula Única/métodos , Análise por Conglomerados , Análise de Sequência de RNA/métodos , Algoritmos , Biologia Computacional/métodos , Humanos , RNA-Seq/métodos
7.
Brief Bioinform ; 25(4)2024 May 23.
Artigo em Inglês | MEDLINE | ID: mdl-38842509

RESUMO

Peptide- and protein-based therapeutics are becoming a promising treatment regimen for myriad diseases. Toxicity of proteins is the primary hurdle for protein-based therapies. Thus, there is an urgent need for accurate in silico methods for determining toxic proteins to filter the pool of potential candidates. At the same time, it is imperative to precisely identify non-toxic proteins to expand the possibilities for protein-based biologics. To address this challenge, we proposed an ensemble framework, called VISH-Pred, comprising models built by fine-tuning ESM2 transformer models on a large, experimentally validated, curated dataset of protein and peptide toxicities. The primary steps in the VISH-Pred framework are to efficiently estimate protein toxicities taking just the protein sequence as input, employing an under sampling technique to handle the humongous class-imbalance in the data and learning representations from fine-tuned ESM2 protein language models which are then fed to machine learning techniques such as Lightgbm and XGBoost. The VISH-Pred framework is able to correctly identify both peptides/proteins with potential toxicity and non-toxic proteins, achieving a Matthews correlation coefficient of 0.737, 0.716 and 0.322 and F1-score of 0.759, 0.696 and 0.713 on three non-redundant blind tests, respectively, outperforming other methods by over $10\%$ on these quality metrics. Moreover, VISH-Pred achieved the best accuracy and area under receiver operating curve scores on these independent test sets, highlighting the robustness and generalization capability of the framework. By making VISH-Pred available as an easy-to-use web server, we expect it to serve as a valuable asset for future endeavors aimed at discerning the toxicity of peptides and enabling efficient protein-based therapeutics.


Assuntos
Proteínas , Proteínas/metabolismo , Proteínas/química , Aprendizado de Máquina , Bases de Dados de Proteínas , Biologia Computacional/métodos , Humanos , Peptídeos/toxicidade , Peptídeos/química , Simulação por Computador , Algoritmos , Software
8.
Proc Natl Acad Sci U S A ; 120(38): e2308338120, 2023 09 19.
Artigo em Inglês | MEDLINE | ID: mdl-37695919

RESUMO

Allostery is a major driver of biological processes requiring coordination. Thus, it is one of the most fundamental and remarkable phenomena in nature, and there is motivation to understand and manipulate it to a multitude of ends. Today, it is often described in terms of two phenomenological models proposed more than a half-century ago involving only T(tense) or R(relaxed) conformations. Here, methyl-based NMR provides extensive detail on a dynamic T to R switch in the classical dimeric allosteric protein, yeast chorismate mutase (CM), that occurs in the absence of substrate, but only with the activator bound. Switching of individual subunits is uncoupled based on direct observation of mixed TR states in the dimer. This unique finding excludes both classic models and solves the paradox of a coexisting hyperbolic binding curve and highly skewed substrate-free T-R equilibrium. Surprisingly, structures of the activator-bound and effector-free forms of CM appear the same by NMR, providing another example of the need to account for dynamic ensembles. The apo enzyme, which has a sigmoidal activity profile, is shown to switch, not to R, but to a related high-energy state. Thus, the conformational repertoire of CM does not just change as a matter of degree depending on the allosteric input, be it effector and/or substrate. Rather, the allosteric model appears to completely change in different contexts, which is only consistent with modern ensemble-based frameworks.


Assuntos
Motivação , Polímeros , Saccharomyces cerevisiae
9.
J Neurosci ; 44(19)2024 May 08.
Artigo em Inglês | MEDLINE | ID: mdl-38561224

RESUMO

Coordinated neuronal activity has been identified to play an important role in information processing and transmission in the brain. However, current research predominantly focuses on understanding the properties and functions of neuronal coordination in hippocampal and cortical areas, leaving subcortical regions relatively unexplored. In this study, we use single-unit recordings in female Sprague Dawley rats to investigate the properties and functions of groups of neurons exhibiting coordinated activity in the auditory thalamus-the medial geniculate body (MGB). We reliably identify coordinated neuronal ensembles (cNEs), which are groups of neurons that fire synchronously, in the MGB. cNEs are shown not to be the result of false-positive detections or by-products of slow-state oscillations in anesthetized animals. We demonstrate that cNEs in the MGB have enhanced information-encoding properties over individual neurons. Their neuronal composition is stable between spontaneous and evoked activity, suggesting limited stimulus-induced ensemble dynamics. These MGB cNE properties are similar to what is observed in cNEs in the primary auditory cortex (A1), suggesting that ensembles serve as a ubiquitous mechanism for organizing local networks and play a fundamental role in sensory processing within the brain.


Assuntos
Estimulação Acústica , Corpos Geniculados , Neurônios , Ratos Sprague-Dawley , Animais , Feminino , Ratos , Neurônios/fisiologia , Corpos Geniculados/fisiologia , Estimulação Acústica/métodos , Vias Auditivas/fisiologia , Potenciais de Ação/fisiologia , Córtex Auditivo/fisiologia , Córtex Auditivo/citologia , Tálamo/fisiologia , Tálamo/citologia , Potenciais Evocados Auditivos/fisiologia
10.
Brief Bioinform ; 24(2)2023 03 19.
Artigo em Inglês | MEDLINE | ID: mdl-36681902

RESUMO

Identification of potential targets for known bioactive compounds and novel synthetic analogs is of considerable significance. In silico target fishing (TF) has become an alternative strategy because of the expensive and laborious wet-lab experiments, explosive growth of bioactivity data and rapid development of high-throughput technologies. However, these TF methods are based on different algorithms, molecular representations and training datasets, which may lead to different results when predicting the same query molecules. This can be confusing for practitioners in practical applications. Therefore, this study systematically evaluated nine popular ligand-based TF methods based on target and ligand-target pair statistical strategies, which will help practitioners make choices among multiple TF methods. The evaluation results showed that SwissTargetPrediction was the best method to produce the most reliable predictions while enriching more targets. High-recall similarity ensemble approach (SEA) was able to find real targets for more compounds compared with other TF methods. Therefore, SwissTargetPrediction and SEA can be considered as primary selection methods in future studies. In addition, the results showed that k = 5 was the optimal number of experimental candidate targets. Finally, a novel ensemble TF method based on consensus voting is proposed to improve the prediction performance. The precision of the ensemble TF method outperforms the individual TF method, indicating that the ensemble TF method can more effectively identify real targets within a given top-k threshold. The results of this study can be used as a reference to guide practitioners in selecting the most effective methods in computational drug discovery.


Assuntos
Algoritmos , Ligantes
11.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37405873

RESUMO

Nucleic acid-binding proteins are proteins that interact with DNA and RNA to regulate gene expression and transcriptional control. The pathogenesis of many human diseases is related to abnormal gene expression. Therefore, recognizing nucleic acid-binding proteins accurately and efficiently has important implications for disease research. To address this question, some scientists have proposed the method of using sequence information to identify nucleic acid-binding proteins. However, different types of nucleic acid-binding proteins have different subfunctions, and these methods ignore their internal differences, so the performance of the predictor can be further improved. In this study, we proposed a new method, called iDRPro-SC, to predict the type of nucleic acid-binding proteins based on the sequence information. iDRPro-SC considers the internal differences of nucleic acid-binding proteins and combines their subfunctions to build a complete dataset. Additionally, we used an ensemble learning to characterize and predict nucleic acid-binding proteins. The results of the test dataset showed that iDRPro-SC achieved the best prediction performance and was superior to the other existing nucleic acid-binding protein prediction methods. We have established a web server that can be accessed online: http://bliulab.net/iDRPro-SC.


Assuntos
Proteínas de Ligação a DNA , Proteínas de Ligação a RNA , Humanos , Proteínas de Ligação a DNA/metabolismo , Proteínas de Ligação a RNA/genética , DNA/química , Algoritmos
12.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36611253

RESUMO

Although previous studies have revealed that synonymous mutations contribute to various human diseases, distinguishing deleterious synonymous mutations from benign ones is still a challenge in medical genomics. Recently, computational tools have been introduced to predict the harmfulness of synonymous mutations. However, most of these computational tools rely on balanced training sets without considering abundant negative samples that could result in deficient performance. In this study, we propose a computational model that uses a selective ensemble to predict deleterious synonymous mutations (seDSM). We construct several candidate base classifiers for the ensemble using balanced training subsets randomly sampled from the imbalanced benchmark training sets. The diversity measures of the base classifiers are calculated by the pairwise diversity metrics, and the classifiers with the highest diversities are selected for integration using soft voting for synonymous mutation prediction. We also design two strategies for filling in missing values in the imbalanced dataset and constructing models using different pairwise diversity metrics. The experimental results show that a selective ensemble based on double fault with the ensemble strategy EKNNI for filling in missing values is the most effective scheme. Finally, using 40-dimensional biology features, we propose a novel model based on a selective ensemble for predicting deleterious synonymous mutations (seDSM). seDSM outperformed other state-of-the-art methods on the independent test sets according to multiple evaluation indicators, indicating that it has an outstanding predictive performance for deleterious synonymous mutations. We hope that seDSM will be useful for studying deleterious synonymous mutations and advancing our understanding of synonymous mutations. The source code of seDSM is freely accessible at https://github.com/xialab-ahu/seDSM.git.


Assuntos
Genômica , Mutação Silenciosa , Humanos , Genômica/métodos , Software , Algoritmos
13.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37742051

RESUMO

Single-base substitution (SBS) mutational signatures have become standard practice in cancer genomics. In lieu of de novo signature extraction, reference signature assignment allows users to estimate the activities of pre-established SBS signatures within individual malignancies. Several tools have been developed for this purpose, each with differing methodologies. However, due to a lack of standardization, there may be inter-tool variability in signature assignment. We deeply characterized three assignment strategies and five SBS signature assignment tools. We observed that assignment strategy choice can significantly influence results and interpretations. Despite varying recommendations by tools, Refit performed best by reducing overfitting and maximizing reconstruction of the original mutational spectra. Even after uniform application of Refit, tools varied remarkably in signature assignments both qualitatively (Jaccard index = 0.38-0.83) and quantitatively (Kendall tau-b = 0.18-0.76). This phenomenon was exacerbated for 'flat' signatures such as the homologous recombination deficiency signature SBS3. An ensemble approach (EnsembleFit), which leverages output from all five tools, increased SBS3 assignment accuracy in BRCA1/2-deficient breast carcinomas. After generating synthetic mutational profiles for thousands of pan-cancer tumors, EnsembleFit reduced signature activity assignment error 15.9-24.7% on average using Catalogue of Somatic Mutations In Cancer and non-standard reference signature sets. We have also released the EnsembleFit web portal (https://www.ensemblefit.pittlabgenomics.com) for users to generate or download ensemble-based SBS signature assignments using any strategy and combination of tools. Overall, we show that signature assignment heterogeneity across tools and strategies is non-negligible and propose a viable, ensemble solution.


Assuntos
Proteína BRCA1 , Proteína BRCA2 , Proteína BRCA1/genética , Proteína BRCA2/genética , Mutação
14.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37193676

RESUMO

Protein-deoxyribonucleic acid (DNA) interactions are important in a variety of biological processes. Accurately predicting protein-DNA binding affinity has been one of the most attractive and challenging issues in computational biology. However, the existing approaches still have much room for improvement. In this work, we propose an ensemble model for Protein-DNA Binding Affinity prediction (emPDBA), which combines six base models with one meta-model. The complexes are classified into four types based on the DNA structure (double-stranded or other forms) and the percentage of interface residues. For each type, emPDBA is trained with the sequence-based, structure-based and energy features from binding partners and complex structures. Through feature selection by the sequential forward selection method, it is found that there do exist considerable differences in the key factors contributing to intermolecular binding affinity. The complex classification is beneficial for the important feature extraction for binding affinity prediction. The performance comparison of our method with other peer ones on the independent testing dataset shows that emPDBA outperforms the state-of-the-art methods with the Pearson correlation coefficient of 0.53 and the mean absolute error of 1.11 kcal/mol. The comprehensive results demonstrate that our method has a good performance for protein-DNA binding affinity prediction. Availability and implementation: The source code is available at https://github.com/ChunhuaLiLab/emPDBA/.


Assuntos
Proteínas , Software , Proteínas/química , Biologia Computacional/métodos , DNA/genética , Ligação Proteica
15.
Brief Bioinform ; 24(1)2023 01 19.
Artigo em Inglês | MEDLINE | ID: mdl-36403184

RESUMO

The prediction of peptide and protein function is important for research and industrial applications, and many machine learning methods have been developed for this purpose. The existing models have encountered many challenges, including the lack of effective and comprehensive features and the limited applicability of each model. Here, we introduce an Integrated Peptide and Protein function prediction Framework based on Fused features and Ensemble models (IPPF-FE), which can accurately capture the relationship between features and labels. The results indicated that IPPF-FE outperformed existing state-of-the-art (SOTA) models on more than 8 different categories of peptide and protein tasks. In addition, t-distributed Stochastic Neighbour Embedding demonstrated the advantages of IPPF-FE. We anticipate that our method will become a versatile tool for peptide and protein prediction tasks and shed light on the future development of related models. The model is open source and available in the GitHub repository https://github.com/Luo-SynBioLab/IPPF-FE.


Assuntos
Federação Internacional de Planejamento Familiar , Proteínas , Peptídeos , Aprendizado de Máquina
16.
Brief Bioinform ; 24(2)2023 03 19.
Artigo em Inglês | MEDLINE | ID: mdl-36752363

RESUMO

Incorporating the genotypic and phenotypic of the correlated traits into the multi-trait model can significantly improve the prediction accuracy of the target trait in animal and plant breeding, as well as human genetics. However, in most cases, the phenotypic information of the correlated and target trait of the individual to be evaluated was null simultaneously, particularly for the newborn. Therefore, we propose a machine learning framework, MAK, to improve the prediction accuracy of the target trait by constructing the multi-target ensemble regression chains and selecting the assistant trait automatically, which predicted the genomic estimated breeding values of the target trait using genotypic information only. The prediction ability of MAK was significantly more robust than the genomic best linear unbiased prediction, BayesB, BayesRR and the multi trait Bayesian method in the four real animal and plant datasets, and the computational efficiency of MAK was roughly 100 times faster than BayesB and BayesRR.


Assuntos
Modelos Genéticos , Melhoramento Vegetal , Animais , Humanos , Recém-Nascido , Teorema de Bayes , Fenótipo , Genômica/métodos , Genótipo , Aprendizado de Máquina
17.
Brief Bioinform ; 24(3)2023 05 19.
Artigo em Inglês | MEDLINE | ID: mdl-37150785

RESUMO

A-to-I editing is the most prevalent RNA editing event, which refers to the change of adenosine (A) bases to inosine (I) bases in double-stranded RNAs. Several studies have revealed that A-to-I editing can regulate cellular processes and is associated with various human diseases. Therefore, accurate identification of A-to-I editing sites is crucial for understanding RNA-level (i.e. transcriptional) modifications and their potential roles in molecular functions. To date, various computational approaches for A-to-I editing site identification have been developed; however, their performance is still unsatisfactory and needs further improvement. In this study, we developed a novel stacked-ensemble learning model, ATTIC (A-To-I ediTing predICtor), to accurately identify A-to-I editing sites across three species, including Homo sapiens, Mus musculus and Drosophila melanogaster. We first comprehensively evaluated 37 RNA sequence-derived features combined with 14 popular machine learning algorithms. Then, we selected the optimal base models to build a series of stacked ensemble models. The final ATTIC framework was developed based on the optimal models improved by the feature selection strategy for specific species. Extensive cross-validation and independent tests illustrate that ATTIC outperforms state-of-the-art tools for predicting A-to-I editing sites. We also developed a web server for ATTIC, which is publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/ATTIC/. We anticipate that ATTIC can be utilized as a useful tool to accelerate the identification of A-to-I RNA editing events and help characterize their roles in post-transcriptional regulation.


Assuntos
Drosophila melanogaster , Edição de RNA , Animais , Camundongos , Humanos , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , RNA/genética , Adenosina/genética , Adenosina/metabolismo , Inosina/genética , Inosina/metabolismo
18.
Brief Bioinform ; 24(2)2023 03 19.
Artigo em Inglês | MEDLINE | ID: mdl-36892153

RESUMO

Accurate and effective drug-target interaction (DTI) prediction can greatly shorten the drug development lifecycle and reduce the cost of drug development. In the deep-learning-based paradigm for predicting DTI, robust drug and protein feature representations and their interaction features play a key role in improving the accuracy of DTI prediction. Additionally, the class imbalance problem and the overfitting problem in the drug-target dataset can also affect the prediction accuracy, and reducing the consumption of computational resources and speeding up the training process are also critical considerations. In this paper, we propose shared-weight-based MultiheadCrossAttention, a precise and concise attention mechanism that can establish the association between target and drug, making our models more accurate and faster. Then, we use the cross-attention mechanism to construct two models: MCANet and MCANet-B. In MCANet, the cross-attention mechanism is used to extract the interaction features between drugs and proteins for improving the feature representation ability of drugs and proteins, and the PolyLoss loss function is applied to alleviate the overfitting problem and the class imbalance problem in the drug-target dataset. In MCANet-B, the robustness of the model is improved by combining multiple MCANet models and prediction accuracy further increases. We train and evaluate our proposed methods on six public drug-target datasets and achieve state-of-the-art results. In comparison with other baselines, MCANet saves considerable computational resources while maintaining accuracy in the leading position; however, MCANet-B greatly improves prediction accuracy by combining multiple models while maintaining a balance between computational resource consumption and prediction accuracy.


Assuntos
Desenvolvimento de Medicamentos , Descoberta de Drogas , Descoberta de Drogas/métodos , Proteínas/metabolismo , Sistemas de Liberação de Medicamentos , Domínios Proteicos
19.
Brief Bioinform ; 24(6)2023 09 22.
Artigo em Inglês | MEDLINE | ID: mdl-37889118

RESUMO

Selecting informative features, such as accurate biomarkers for disease diagnosis, prognosis and response to treatment, is an essential task in the field of bioinformatics. Medical data often contain thousands of features and identifying potential biomarkers is challenging due to small number of samples in the data, method dependence and non-reproducibility. This paper proposes a novel ensemble feature selection method, named Filter and Wrapper Stacking Ensemble (FWSE), to identify reproducible biomarkers from high-dimensional omics data. In FWSE, filter feature selection methods are run on numerous subsets of the data to eliminate irrelevant features, and then wrapper feature selection methods are applied to rank the top features. The method was validated on four high-dimensional medical datasets related to mental illnesses and cancer. The results indicate that the features selected by FWSE are stable and statistically more significant than the ones obtained by existing methods while also demonstrating biological relevance. Furthermore, FWSE is a generic method, applicable to various high-dimensional datasets in the fields of machine intelligence and bioinformatics.


Assuntos
Transtornos Mentais , Neoplasias , Humanos , Algoritmos , Inteligência Artificial , Biomarcadores , Neoplasias/diagnóstico , Neoplasias/genética
20.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38205965

RESUMO

DNA methylation profiling is a useful tool to increase the accuracy of a cancer diagnosis. However, a comprehensive R package specially for it is lacking. Hence, we developed the R package methylClass for methylation-based classification. Within it, we provide the eSVM (ensemble-based support vector machine) model to achieve much higher accuracy in methylation data classification than the popular random forest model and overcome the time-consuming problem of the traditional SVM. In addition, some novel feature selection methods are included in the package to improve the classification. Furthermore, because methylation data can be converted to other omics, such as copy number variation data, we also provide functions for multi-omics studies. The testing of this package on four datasets shows the accurate performance of our package, especially eSVM, which can be used in both methylation and multi-omics models and outperforms other methods in both cases. methylClass is available at: https://github.com/yuabrahamliu/methylClass.


Assuntos
Variações do Número de Cópias de DNA , Metilação de DNA , Processamento de Proteína Pós-Traducional , Máquina de Vetores de Suporte
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA