Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 1.139
Filtrar
Más filtros

Bases de datos
Tipo del documento
Intervalo de año de publicación
1.
Am J Hum Genet ; 111(7): 1431-1447, 2024 07 11.
Artículo en Inglés | MEDLINE | ID: mdl-38908374

RESUMEN

Methods of estimating polygenic scores (PGSs) from genome-wide association studies are increasingly utilized. However, independent method evaluation is lacking, and method comparisons are often limited. Here, we evaluate polygenic scores derived via seven methods in five biobank studies (totaling about 1.2 million participants) across 16 diseases and quantitative traits, building on a reference-standardized framework. We conducted meta-analyses to quantify the effects of method choice, hyperparameter tuning, method ensembling, and the target biobank on PGS performance. We found that no single method consistently outperformed all others. PGS effect sizes were more variable between biobanks than between methods within biobanks when methods were well tuned. Differences between methods were largest for the two investigated autoimmune diseases, seropositive rheumatoid arthritis and type 1 diabetes. For most methods, cross-validation was more reliable for tuning hyperparameters than automatic tuning (without the use of target data). For a given target phenotype, elastic net models combining PGS across methods (ensemble PGS) tuned in the UK Biobank provided consistent, high, and cross-biobank transferable performance, increasing PGS effect sizes (ß coefficients) by a median of 5.0% relative to LDpred2 and MegaPRS (the two best-performing single methods when tuned with cross-validation). Our interactively browsable online-results and open-source workflow prspipe provide a rich resource and reference for the analysis of polygenic scoring methods across biobanks.


Asunto(s)
Bancos de Muestras Biológicas , Estudio de Asociación del Genoma Completo , Herencia Multifactorial , Humanos , Herencia Multifactorial/genética , Fenotipo , Diabetes Mellitus Tipo 1/genética , Polimorfismo de Nucleótido Simple , Aprendizaje Automático
2.
Proc Natl Acad Sci U S A ; 121(33): e2403210121, 2024 Aug 13.
Artículo en Inglés | MEDLINE | ID: mdl-39110727

RESUMEN

Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, but existing methods face several limitations, encompassing issues related to computational burden, predictive accuracy, and adaptability to a wide range of genetic architectures. To address these issues, we propose Aggregated L0Learn using Summary-level data (ALL-Sum), a fast and scalable ensemble learning method for computing PRS using summary statistics from genome-wide association studies (GWAS). ALL-Sum leverages a L0L2 penalized regression and ensemble learning across tuning parameters to flexibly model traits with diverse genetic architectures. In extensive large-scale simulations across a wide range of polygenicity and GWAS sample sizes, ALL-Sum consistently outperformed popular alternative methods in terms of prediction accuracy, runtime, and memory usage by 10%, 20-fold, and threefold, respectively, and demonstrated robustness to diverse genetic architectures. We validated the performance of ALL-Sum in real data analysis of 11 complex traits using GWAS summary statistics from nine data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen Biobank, with validation in the UK Biobank. Our results show that on average, ALL-Sum obtained PRS with 25% higher accuracy on average, with 15 times faster computation and half the memory than the current state-of-the-art methods, and had robust performance across a wide range of traits and diseases. Furthermore, our method demonstrates stable prediction when using linkage disequilibrium computed from different data sources. ALL-Sum is available as a user-friendly R software package with publicly available reference data for streamlined analysis.


Asunto(s)
Estudio de Asociación del Genoma Completo , Herencia Multifactorial , Humanos , Herencia Multifactorial/genética , Estudio de Asociación del Genoma Completo/métodos , Aprendizaje Automático , Predisposición Genética a la Enfermedad , Polimorfismo de Nucleótido Simple
3.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37405873

RESUMEN

Nucleic acid-binding proteins are proteins that interact with DNA and RNA to regulate gene expression and transcriptional control. The pathogenesis of many human diseases is related to abnormal gene expression. Therefore, recognizing nucleic acid-binding proteins accurately and efficiently has important implications for disease research. To address this question, some scientists have proposed the method of using sequence information to identify nucleic acid-binding proteins. However, different types of nucleic acid-binding proteins have different subfunctions, and these methods ignore their internal differences, so the performance of the predictor can be further improved. In this study, we proposed a new method, called iDRPro-SC, to predict the type of nucleic acid-binding proteins based on the sequence information. iDRPro-SC considers the internal differences of nucleic acid-binding proteins and combines their subfunctions to build a complete dataset. Additionally, we used an ensemble learning to characterize and predict nucleic acid-binding proteins. The results of the test dataset showed that iDRPro-SC achieved the best prediction performance and was superior to the other existing nucleic acid-binding protein prediction methods. We have established a web server that can be accessed online: http://bliulab.net/iDRPro-SC.


Asunto(s)
Proteínas de Unión al ADN , Proteínas de Unión al ARN , Humanos , Proteínas de Unión al ADN/metabolismo , Proteínas de Unión al ARN/genética , ADN/química , Algoritmos
4.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37889118

RESUMEN

Selecting informative features, such as accurate biomarkers for disease diagnosis, prognosis and response to treatment, is an essential task in the field of bioinformatics. Medical data often contain thousands of features and identifying potential biomarkers is challenging due to small number of samples in the data, method dependence and non-reproducibility. This paper proposes a novel ensemble feature selection method, named Filter and Wrapper Stacking Ensemble (FWSE), to identify reproducible biomarkers from high-dimensional omics data. In FWSE, filter feature selection methods are run on numerous subsets of the data to eliminate irrelevant features, and then wrapper feature selection methods are applied to rank the top features. The method was validated on four high-dimensional medical datasets related to mental illnesses and cancer. The results indicate that the features selected by FWSE are stable and statistically more significant than the ones obtained by existing methods while also demonstrating biological relevance. Furthermore, FWSE is a generic method, applicable to various high-dimensional datasets in the fields of machine intelligence and bioinformatics.


Asunto(s)
Trastornos Mentales , Neoplasias , Humanos , Algoritmos , Inteligencia Artificial , Biomarcadores , Neoplasias/diagnóstico , Neoplasias/genética
5.
Brief Bioinform ; 24(3)2023 05 19.
Artículo en Inglés | MEDLINE | ID: mdl-37150785

RESUMEN

A-to-I editing is the most prevalent RNA editing event, which refers to the change of adenosine (A) bases to inosine (I) bases in double-stranded RNAs. Several studies have revealed that A-to-I editing can regulate cellular processes and is associated with various human diseases. Therefore, accurate identification of A-to-I editing sites is crucial for understanding RNA-level (i.e. transcriptional) modifications and their potential roles in molecular functions. To date, various computational approaches for A-to-I editing site identification have been developed; however, their performance is still unsatisfactory and needs further improvement. In this study, we developed a novel stacked-ensemble learning model, ATTIC (A-To-I ediTing predICtor), to accurately identify A-to-I editing sites across three species, including Homo sapiens, Mus musculus and Drosophila melanogaster. We first comprehensively evaluated 37 RNA sequence-derived features combined with 14 popular machine learning algorithms. Then, we selected the optimal base models to build a series of stacked ensemble models. The final ATTIC framework was developed based on the optimal models improved by the feature selection strategy for specific species. Extensive cross-validation and independent tests illustrate that ATTIC outperforms state-of-the-art tools for predicting A-to-I editing sites. We also developed a web server for ATTIC, which is publicly available at http://web.unimelb-bioinfortools.cloud.edu.au/ATTIC/. We anticipate that ATTIC can be utilized as a useful tool to accelerate the identification of A-to-I RNA editing events and help characterize their roles in post-transcriptional regulation.


Asunto(s)
Drosophila melanogaster , Edición de ARN , Animales , Ratones , Humanos , Drosophila melanogaster/genética , Drosophila melanogaster/metabolismo , ARN/genética , Adenosina/genética , Adenosina/metabolismo , Inosina/genética , Inosina/metabolismo
6.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37874948

RESUMEN

Proteases contribute to a broad spectrum of cellular functions. Given a relatively limited amount of experimental data, developing accurate sequence-based predictors of substrate cleavage sites facilitates a better understanding of protease functions and substrate specificity. While many protease-specific predictors of substrate cleavage sites were developed, these efforts are outpaced by the growth of the protease substrate cleavage data. In particular, since data for 100+ protease types are available and this number continues to grow, it becomes impractical to publish predictors for new protease types, and instead it might be better to provide a computational platform that helps users to quickly and efficiently build predictors that address their specific needs. To this end, we conceptualized, developed, tested and released a versatile bioinformatics platform, ProsperousPlus, that empowers users, even those with no programming or little bioinformatics background, to build fast and accurate predictors of substrate cleavage sites. ProsperousPlus facilitates the use of the rapidly accumulating substrate cleavage data to train, empirically assess and deploy predictive models for user-selected substrate types. Benchmarking tests on test datasets show that our platform produces predictors that on average exceed the predictive performance of current state-of-the-art approaches. ProsperousPlus is available as a webserver and a stand-alone software package at http://prosperousplus.unimelb-biotools.cloud.edu.au/.


Asunto(s)
Aprendizaje Automático , Péptido Hidrolasas , Péptido Hidrolasas/metabolismo , Especificidad por Sustrato , Algoritmos
7.
BMC Bioinformatics ; 25(1): 120, 2024 Mar 21.
Artículo en Inglés | MEDLINE | ID: mdl-38515026

RESUMEN

BACKGROUND: Whole genome variants offer sufficient information for genetic prediction of human disease risk, and prediction of animal and plant breeding values. Many sophisticated statistical methods have been developed for enhancing the predictive ability. However, each method has its own advantages and disadvantages, so far, no one method can beat others. RESULTS: We herein propose an Ensemble Learning method for Prediction of Genetic Values (ELPGV), which assembles predictions from several basic methods such as GBLUP, BayesA, BayesB and BayesCπ, to produce more accurate predictions. We validated ELPGV with a variety of well-known datasets and a serious of simulated datasets. All revealed that ELPGV was able to significantly enhance the predictive ability than any basic methods, for instance, the comparison p-value of ELPGV over basic methods were varied from 4.853E-118 to 9.640E-20 for WTCCC dataset. CONCLUSIONS: ELPGV is able to integrate the merit of each method together to produce significantly higher predictive ability than any basic methods and it is simple to implement, fast to run, without using genotype data. is promising for wide application in genetic predictions.


Asunto(s)
Genoma , Fitomejoramiento , Animales , Humanos , Genotipo , Genómica , Aprendizaje Automático , Modelos Genéticos , Fenotipo , Polimorfismo de Nucleótido Simple , Teorema de Bayes
8.
J Proteome Res ; 23(8): 3088-3095, 2024 Aug 02.
Artículo en Inglés | MEDLINE | ID: mdl-38690713

RESUMEN

Spatial segmentation is an essential processing method for image analysis aiming to identify the characteristic suborgans or microregions from mass spectrometry imaging (MSI) data, which is critical for understanding the spatial heterogeneity of biological information and function and the underlying molecular signatures. Due to the intrinsic characteristics of MSI data including spectral nonlinearity, high-dimensionality, and large data size, the common segmentation methods lack the capability for capturing the accurate microregions associated with biological functions. Here we proposed an ensemble learning-based spatial segmentation strategy, named eLIMS, that combines a randomized unified manifold approximation and projection (r-UMAP) dimensionality reduction module for extracting significant features and an ensemble pixel clustering module for aggregating the clustering maps from r-UMAP. Three MSI datasets are used to evaluate the performance of eLIMS, including mouse fetus, human adenocarcinoma, and mouse brain. Experimental results demonstrate that the proposed method has potential in partitioning the heterogeneous tissues into several subregions associated with anatomical structure, i.e., the suborgans of the brain region in mouse fetus data are identified as dorsal pallium, midbrain, and brainstem. Furthermore, it effectively discovers critical microregions related to physiological and pathological variations offering new insight into metabolic heterogeneity.


Asunto(s)
Encéfalo , Procesamiento de Imagen Asistido por Computador , Ratones , Animales , Humanos , Encéfalo/metabolismo , Encéfalo/diagnóstico por imagen , Procesamiento de Imagen Asistido por Computador/métodos , Espectrometría de Masas/métodos , Feto/metabolismo , Algoritmos , Análisis por Conglomerados , Adenocarcinoma/metabolismo , Adenocarcinoma/patología , Aprendizaje Automático
9.
J Cell Mol Med ; 28(7): e18180, 2024 04.
Artículo en Inglés | MEDLINE | ID: mdl-38506066

RESUMEN

Circular RNA (circRNA) is a common non-coding RNA and plays an important role in the diagnosis and therapy of human diseases, circRNA-disease associations prediction based on computational methods can provide a new way for better clinical diagnosis. In this article, we proposed a novel method for circRNA-disease associations prediction based on ensemble learning, named ELCDA. First, the association heterogeneous network was constructed via collecting multiple information of circRNAs and diseases, and multiple similarity measures are adopted here, then, we use metapath, matrix factorization and GraphSAGE-based models to extract features of nodes from different views, the final comprehensive features of circRNAs and diseases via ensemble learning, finally, a soft voting ensemble strategy is used to integrate the predicted results of all classifier. The performance of ELCDA is evaluated by fivefold cross-validation and compare with other state-of-the-art methods, the experimental results show that ELCDA is outperformance than others. Furthermore, three common diseases are used as case studies, which also demonstrate that ELCDA is an effective method for predicting circRNA-disease associations.


Asunto(s)
Aprendizaje Automático , ARN Circular , Humanos , ARN Circular/genética , Biología Computacional/métodos
10.
Genet Epidemiol ; 47(1): 26-44, 2023 02.
Artículo en Inglés | MEDLINE | ID: mdl-36349692

RESUMEN

Using high-dimensional genetic variants such as single nucleotide polymorphisms (SNP) to predict complex diseases and traits has important applications in basic research and other clinical settings. For example, predicting gene expression is a necessary first step to identify (putative) causal genes in transcriptome-wide association studies. Due to weak signals, high-dimensionality, and linkage disequilibrium (correlation) among SNPs, building such a prediction model is challenging. However, functional annotations at the SNP level (e.g., as epigenomic data across multiple cell- or tissue-types) are available and could be used to inform predictor importance and aid in outcome prediction. Existing approaches to incorporate annotations have been based mainly on (generalized) linear models. Bayesian additive regression trees (BART), in contrast, is a reliable method to obtain high-quality nonlinear out of sample predictions without overfitting. Unfortunately, the default prior from BART may be too inflexible to handle sparse situations where the number of predictors approaches or surpasses the number of observations. Motivated by our real data application, this article proposes an alternative prior based on the logit normal distribution because it provides a framework that is adaptive to sparsity and can model informative functional annotations. It also provides a framework to incorporate prior information about the between SNP correlations. Computational details for carrying out inference are presented along with the results from a simulation study and a genome-wide prediction analysis of the Alzheimer's Disease Neuroimaging Initiative data.


Asunto(s)
Algoritmos , Modelos Genéticos , Humanos , Teorema de Bayes , Neuroimagen/métodos , Simulación por Computador , Polimorfismo de Nucleótido Simple , Estudio de Asociación del Genoma Completo/métodos
11.
Proteins ; 92(1): 60-75, 2024 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-37638618

RESUMEN

Proteins are played key roles in different functionalities in our daily life. All functional roles of a protein are a bit enhanced in interaction compared to individuals. Identification of essential proteins of an organism is a time consume and costly task during observation in the wet lab. The results of observation in wet lab always ensure high reliability and accuracy in the biological ground. Essential protein prediction using computational approaches is an alternative choice in research. It proves its significance rapidly in day-to-day life as well as reduces the experimental cost of wet lab effectively. Existing computational methods were implemented using Protein interaction networks (PPIN), Sequence, Gene Expression Dataset (GED), Gene Ontology (GO), Orthologous groups, and Subcellular localized datasets. Machine learning has diverse categories of features that enable to model and predict essential macromolecules of understudied organisms. A novel methodology MEM-FET (membership feature) is predicted based on features, that is, edge clustering coefficient, Average clustering coefficient, subcellular localization, and Gene Ontology within a compartment of common neighbors. The accuracy (ACC) values of the predicted true positive (TP) essential proteins are 0.79, 0.74, 0.78, and 0.71 for YHQ, YMIPS, YDIP, and YMBD datasets. An enriched set of essential proteins are also predicted using the MEM-FET algorithm. Ensemble ML also validated the proposed model with an accuracy of 60%. It has been predicted that MEM-FET algorithms outperform other existing algorithms with an ACC value of 80% for the yeast dataset.


Asunto(s)
Biología Computacional , Proteínas , Humanos , Reproducibilidad de los Resultados , Biología Computacional/métodos , Proteínas/genética , Proteínas/metabolismo , Algoritmos , Aprendizaje Automático , Saccharomyces cerevisiae/genética , Saccharomyces cerevisiae/metabolismo
12.
Neuroimage ; 299: 120825, 2024 Aug 29.
Artículo en Inglés | MEDLINE | ID: mdl-39214438

RESUMEN

As an important biomarker of neural aging, the brain age reflects the integrity and health of the human brain. Accurate prediction of brain age could help to understand the underlying mechanism of neural aging. In this study, a cross-stratified ensemble learning algorithm with staking strategy was proposed to obtain brain age and the derived predicted age difference (PAD) using T1-weighted magnetic resonance imaging (MRI) data. The approach was characterized as by implementing two modules: one was three base learners of 3D-DenseNet, 3D-ResNeXt, 3D-Inception-v4; another was 14 secondary learners of liner regressions. To evaluate performance, our method was compared with single base learners, regular ensemble learning algorithms, and state-of-the-art (SOTA) methods. The results demonstrated that our proposed model outperformed others models, with three metrics of mean absolute error (MAE), root mean-squared error (RMSE), and coefficient of determination (R2) of 2.9405 years, 3.9458 years, and 0.9597, respectively. Furthermore, there existed significant differences in PAD among the three groups of normal control (NC), mild cognitive impairment (MCI) and Alzheimer's disease (AD), with an increased trend across NC, MCI, and AD. It was concluded that the proposed algorithm could be effectively used in computing brain aging and PAD, and offering potential for early diagnosis and assessment of normal brain aging and AD.

13.
J Comput Chem ; 45(13): 953-968, 2024 May 15.
Artículo en Inglés | MEDLINE | ID: mdl-38174739

RESUMEN

In the pursuit of novel antiretroviral therapies for human immunodeficiency virus type-1 (HIV-1) proteases (PRs), recent improvements in drug discovery have embraced machine learning (ML) techniques to guide the design process. This study employs ensemble learning models to identify crucial substructures as significant features for drug development. Using molecular docking techniques, a collection of 160 darunavir (DRV) analogs was designed based on these key substructures and subsequently screened using molecular docking techniques. Chemical structures with high fitness scores were selected, combined, and one-dimensional (1D) screening based on beyond Lipinski's rule of five (bRo5) and ADME (absorption, distribution, metabolism, and excretion) prediction implemented in the Combined Analog generator Tool (CAT) program. A total of 473 screened analogs were subjected to docking analysis through convolutional neural networks scoring function against both the wild-type (WT) and 12 major mutated PRs. DRV analogs with negative changes in binding free energy ( ΔΔ G bind ) compared to DRV could be categorized into four attractive groups based on their interactions with the majority of vital PRs. The analysis of interaction profiles revealed that potent designed analogs, targeting both WT and mutant PRs, exhibited interactions with common key amino acid residues. This observation further confirms that the ML model-guided approach effectively identified the substructures that play a crucial role in potent analogs. It is expected to function as a powerful computational tool, offering valuable guidance in the identification of chemical substructures for synthesis and subsequent experimental testing.


Asunto(s)
Infecciones por VIH , Inhibidores de la Proteasa del VIH , VIH-1 , Humanos , Darunavir/farmacología , Inhibidores de la Proteasa del VIH/farmacología , Inhibidores de la Proteasa del VIH/química , Péptido Hidrolasas/farmacología , Simulación del Acoplamiento Molecular , Proteasa del VIH/química , Descubrimiento de Drogas
14.
Small ; : e2402756, 2024 Jun 20.
Artículo en Inglés | MEDLINE | ID: mdl-39031869

RESUMEN

In traditional machine learning (ML)-based material design, the defects of low prediction accuracy, overfitting and low generalization ability are mainly caused by the training of a single ML model. Here, a Soft Voting Ensemble Learning (SVEL) approach is proposed to solve the above issues by integrating multiple ML models in the same scene, thus pursuing more stable and reliable prediction. As a case study, SVEL is applied to develop the broad chemical space of novel pyrochlore electrocatalysts with the molecular formula of A2B2O7, to explore promising pyrochlore oxides and accelerate predictions of unknown pyrochlore in the periodic table. The model successfully established the structure-property relationship of pyrochlore, and selected six cost-effective pyrochlore from the periodic table with a high prediction accuracy of 91.7%, all of which showed good electrocatalytic performance. SVEL not only effectively avoids the high costs of experimentation and lengthy computations, but also addresses biases arising from data scarcity in single models. Furthermore, it has significantly reduced the research cycle of pyrochlore by ≈ 22 years, offering broad prospects for accelerating the development of materials genomics. SVEL method is intended to integrate multiple AI models to provide broader model training clues for the AI material design community.

15.
Brief Bioinform ; 23(5)2022 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-36088543

RESUMEN

Ensemble learning is a kind of machine learning method which can integrate multiple basic learners together and achieve higher accuracy. Recently, single machine learning methods have been established to predict survival for patients with cancer. However, it still lacked a robust ensemble learning model with high accuracy to pick out patients with high risks. To achieve this, we proposed a novel genetic algorithm-aided three-stage ensemble learning method (3S score) for survival prediction. During the process of constructing the 3S score, double training sets were used to avoid over-fitting; the gene-pairing method was applied to reduce batch effect; a genetic algorithm was employed to select the best basic learner combination. When used to predict the survival state of glioma patients, this model achieved the highest C-index (0.697) as well as area under the receiver operating characteristic curve (ROC-AUCs) (first year = 0.705, third year = 0.825 and fifth year = 0.839) in the combined test set (n = 1191), compared with 12 other baseline models. Furthermore, the 3S score can distinguish survival significantly in eight cohorts among the total of nine independent test cohorts (P < 0.05), achieving significant improvement of ROC-AUCs. Notably, ablation experiments demonstrated that the gene-pairing method, double training sets and genetic algorithm make sure the robustness and effectiveness of the 3S score. The performance exploration on pan-cancer showed that the 3S score has excellent ability on survival prediction in five kinds of cancers, which was verified by Cox regression, survival curves and ROC curves together. To enable its clinical adoption, we implemented the 3S score and other two clinical factors as an easy-to-use web tool for risk scoring and therapy stratification in glioma patients.


Asunto(s)
Glioma , Aprendizaje Automático , Glioma/genética , Humanos , Curva ROC , Factores de Riesgo
16.
Brief Bioinform ; 23(2)2022 03 10.
Artículo en Inglés | MEDLINE | ID: mdl-35183059

RESUMEN

Mass spectrometry-based proteomic technique has become indispensable in current exploration of complex and dynamic biological processes. Instrument development has largely ensured the effective production of proteomic data, which necessitates commensurate advances in statistical framework to discover the optimal proteomic signature. Current framework mainly emphasizes the generalizability of the identified signature in predicting the independent data but neglects the reproducibility among signatures identified from independently repeated trials on different sub-dataset. These problems seriously restricted the wide application of the proteomic technique in molecular biology and other related directions. Thus, it is crucial to enable the generalizable and reproducible discovery of the proteomic signature with the subsequent indication of phenotype association. However, no such tool has been developed and available yet. Herein, an online tool, POSREG, was therefore constructed to identify the optimal signature for a set of proteomic data. It works by (i) identifying the proteomic signature of good reproducibility and aggregating them to ensemble feature ranking by ensemble learning, (ii) assessing the generalizability of ensemble feature ranking to acquire the optimal signature and (iii) indicating the phenotype association of discovered signature. POSREG is unique in its capacity of discovering the proteomic signature by simultaneously optimizing its reproducibility and generalizability. It is now accessible free of charge without any registration or login requirement at https://idrblab.org/posreg/.


Asunto(s)
Proteómica , Proteómica/métodos , Reproducibilidad de los Resultados
17.
Brief Bioinform ; 23(5)2022 09 20.
Artículo en Inglés | MEDLINE | ID: mdl-36007240

RESUMEN

Natural products (NPs) and their derivatives are important resources for drug discovery. There are many in silico target prediction methods that have been reported, however, very few of them distinguish NPs from synthetic molecules. Considering the fact that NPs and synthetic molecules are very different in many characteristics, it is necessary to build specific target prediction models of NPs. Therefore, we collected the activity data of NPs and their derivatives from the public databases and constructed four datasets, including the NP dataset, the NPs and its first-class derivatives dataset, the NPs and all its derivatives and the ChEMBL26 compounds dataset. Conditions, including activity thresholds and input features, were explored to access the performance of eight machine learning methods of target prediction of NPs, including support vector machines (SVM), extreme gradient boosting, random forests, K-nearest neighbor, naive Bayes, feedforward neural networks (FNN), convolutional neural networks and recurrent neural networks. As a result, the NPs and all their derivatives datasets were selected to build the best NP-specific models. Furthermore, the consensus models, as well as the voting models, were additionally applied to improve the prediction performance. More evaluations were made on the external validation set and the results demonstrated that (1) the NP-specific model performed better on the target prediction of NPs than the traditional models training on the whole compounds of ChEMBL26. (2) The consensus model of FNN + SVM possessed the best overall performance, and the voting model can significantly improve recall and specificity.


Asunto(s)
Productos Biológicos , Algoritmos , Teorema de Bayes , Aprendizaje Automático , Redes Neurales de la Computación , Máquina de Vectores de Soporte
18.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34929742

RESUMEN

MOTIVATION: Accumulating evidences have indicated that microRNA (miRNA) plays a crucial role in the pathogenesis and progression of various complex diseases. Inferring disease-associated miRNAs is significant to explore the etiology, diagnosis and treatment of human diseases. As the biological experiments are time-consuming and labor-intensive, developing effective computational methods has become indispensable to identify associations between miRNAs and diseases. RESULTS: We present an Ensemble learning framework with Resampling method for MiRNA-Disease Association (ERMDA) prediction to discover potential disease-related miRNAs. Firstly, the resampling strategy is proposed for building multiple different balanced training subsets to address the challenge of sample imbalance within the database. Then, ERMDA extracts miRNA and disease feature representations by integrating miRNA-miRNA similarities, disease-disease similarities and experimentally verified miRNA-disease association information. Next, the feature selection approach is applied to reduce the redundant information and increase the diversity among these subsets. Lastly, ERMDA constructs an individual learner on each subset to yield primitive outcomes, and the soft voting method is introduced for making the final decision based on the prediction results of individual learners. A series of experimental results demonstrates that ERMDA outperforms other state-of-the-art methods on both balanced and unbalanced testing sets. Besides, case studies conducted on the three human diseases further confirm the ERMDA's prediction capability for identifying potential disease-related miRNAs. In conclusion, these experimental results demonstrate that our method can serve as an effective and reliable tool for researchers to explore the regulatory role of miRNAs in complex diseases.


Asunto(s)
Enfermedad/genética , Estudios de Asociación Genética , Aprendizaje Automático , MicroARNs/genética , Algoritmos , Biología Computacional , Predisposición Genética a la Enfermedad/genética , Humanos
19.
Brief Bioinform ; 23(6)2022 11 19.
Artículo en Inglés | MEDLINE | ID: mdl-36184189

RESUMEN

Short hairpin RNA (shRNA)-mediated gene silencing is an important technology to achieve RNA interference, in which the design of potent and reliable shRNA molecules plays a crucial role. However, efficient shRNA target selection through biological technology is expensive and time consuming. Hence, it is crucial to develop a more precise and efficient computational method to design potent and reliable shRNA molecules. In this work, we present an interpretable classification model for the shRNA target prediction using the Light Gradient Boosting Machine algorithm called ILGBMSH. Rather than utilizing only the shRNA sequence feature, we extracted 554 biological and deep learning features, which were not considered in previous shRNA prediction research. We evaluated the performance of our model compared with the state-of-the-art shRNA target prediction models. Besides, we investigated the feature explanation from the model's parameters and interpretable method called Shapley Additive Explanations, which provided us with biological insights from the model. We used independent shRNA experiment data from other resources to prove the predictive ability and robustness of our model. Finally, we used our model to design the miR30-shRNA sequences and conducted a gene knockdown experiment. The experimental result was perfectly in correspondence with our expectation with a Pearson's coefficient correlation of 0.985. In summary, the ILGBMSH model can achieve state-of-the-art shRNA prediction performance and give biological insights from the machine learning model parameters.


Asunto(s)
Algoritmos , Aprendizaje Automático , ARN Interferente Pequeño/genética
20.
Brief Bioinform ; 23(2)2022 03 10.
Artículo en Inglés | MEDLINE | ID: mdl-34981111

RESUMEN

Large metabolomics datasets inevitably contain unwanted technical variations which can obscure meaningful biological signals and affect how this information is applied to personalized healthcare. Many methods have been developed to handle unwanted variations. However, the underlying assumptions of many existing methods only hold for a few specific scenarios. Some tools remove technical variations with models trained on quality control (QC) samples which may not generalize well on subject samples. Additionally, almost none of the existing methods supports datasets with multiple types of QC samples, which greatly limits their performance and flexibility. To address these issues, a non-parametric method TIGER (Technical variation elImination with ensemble learninG architEctuRe) is developed in this study and released as an R package (https://CRAN.R-project.org/package=TIGERr). TIGER integrates the random forest algorithm into an adaptable ensemble learning architecture. Evaluation results show that TIGER outperforms four popular methods with respect to robustness and reliability on three human cohort datasets constructed with targeted or untargeted metabolomics data. Additionally, a case study aiming to identify age-associated metabolites is performed to illustrate how TIGER can be used for cross-kit adjustment in a longitudinal analysis with experimental data of three time-points generated by different analytical kits. A dynamic website is developed to help evaluate the performance of TIGER and examine the patterns revealed in our longitudinal analysis (https://han-siyu.github.io/TIGER_web/). Overall, TIGER is expected to be a powerful tool for metabolomics data analysis.


Asunto(s)
Algoritmos , Metabolómica , Humanos , Aprendizaje Automático , Metabolómica/métodos , Reproducibilidad de los Resultados , Proyectos de Investigación
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA