Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 66
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Brief Bioinform ; 24(6)2023 09 22.
Artículo en Inglés | MEDLINE | ID: mdl-37874948

RESUMEN

Proteases contribute to a broad spectrum of cellular functions. Given a relatively limited amount of experimental data, developing accurate sequence-based predictors of substrate cleavage sites facilitates a better understanding of protease functions and substrate specificity. While many protease-specific predictors of substrate cleavage sites were developed, these efforts are outpaced by the growth of the protease substrate cleavage data. In particular, since data for 100+ protease types are available and this number continues to grow, it becomes impractical to publish predictors for new protease types, and instead it might be better to provide a computational platform that helps users to quickly and efficiently build predictors that address their specific needs. To this end, we conceptualized, developed, tested and released a versatile bioinformatics platform, ProsperousPlus, that empowers users, even those with no programming or little bioinformatics background, to build fast and accurate predictors of substrate cleavage sites. ProsperousPlus facilitates the use of the rapidly accumulating substrate cleavage data to train, empirically assess and deploy predictive models for user-selected substrate types. Benchmarking tests on test datasets show that our platform produces predictors that on average exceed the predictive performance of current state-of-the-art approaches. ProsperousPlus is available as a webserver and a stand-alone software package at http://prosperousplus.unimelb-biotools.cloud.edu.au/.


Asunto(s)
Aprendizaje Automático , Péptido Hidrolasas , Péptido Hidrolasas/metabolismo , Especificidad por Sustrato , Algoritmos
2.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37291763

RESUMEN

BACKGROUND: Promoters are DNA regions that initiate the transcription of specific genes near the transcription start sites. In bacteria, promoters are recognized by RNA polymerases and associated sigma factors. Effective promoter recognition is essential for synthesizing the gene-encoded products by bacteria to grow and adapt to different environmental conditions. A variety of machine learning-based predictors for bacterial promoters have been developed; however, most of them were designed specifically for a particular species. To date, only a few predictors are available for identifying general bacterial promoters with limited predictive performance. RESULTS: In this study, we developed TIMER, a Siamese neural network-based approach for identifying both general and species-specific bacterial promoters. Specifically, TIMER uses DNA sequences as the input and employs three Siamese neural networks with the attention layers to train and optimize the models for a total of 13 species-specific and general bacterial promoters. Extensive 10-fold cross-validation and independent tests demonstrated that TIMER achieves a competitive performance and outperforms several existing methods on both general and species-specific promoter prediction. As an implementation of the proposed method, the web server of TIMER is publicly accessible at http://web.unimelb-bioinfortools.cloud.edu.au/TIMER/.


Asunto(s)
Bacterias , Redes Neurales de la Computación , Bacterias/genética , Bacterias/metabolismo , ARN Polimerasas Dirigidas por ADN/genética , ARN Polimerasas Dirigidas por ADN/metabolismo , Secuencia de Bases , Regiones Promotoras Genéticas
3.
Brief Bioinform ; 23(2)2022 03 10.
Artículo en Inglés | MEDLINE | ID: mdl-35176756

RESUMEN

Protein secretion has a pivotal role in many biological processes and is particularly important for intercellular communication, from the cytoplasm to the host or external environment. Gram-positive bacteria can secrete proteins through multiple secretion pathways. The non-classical secretion pathway has recently received increasing attention among these secretion pathways, but its exact mechanism remains unclear. Non-classical secreted proteins (NCSPs) are a class of secreted proteins lacking signal peptides and motifs. Several NCSP predictors have been proposed to identify NCSPs and most of them employed the whole amino acid sequence of NCSPs to construct the model. However, the sequence length of different proteins varies greatly. In addition, not all regions of the protein are equally important and some local regions are not relevant to the secretion. The functional regions of the protein, particularly in the N- and C-terminal regions, contain important determinants for secretion. In this study, we propose a new hybrid deep learning-based framework, referred to as ASPIRER, which improves the prediction of NCSPs from amino acid sequences. More specifically, it combines a whole sequence-based XGBoost model and an N-terminal sequence-based convolutional neural network model; 5-fold cross-validation and independent tests demonstrate that ASPIRER achieves superior performance than existing state-of-the-art approaches. The source code and curated datasets of ASPIRER are publicly available at https://github.com/yanwu20/ASPIRER/. ASPIRER is anticipated to be a useful tool for improved prediction of novel putative NCSPs from sequences information and prioritization of candidate proteins for follow-up experimental validation.


Asunto(s)
Aprendizaje Profundo , Secuencia de Aminoácidos , Biología Computacional , Redes Neurales de la Computación , Proteínas/química , Programas Informáticos
4.
Brief Bioinform ; 23(2)2022 03 10.
Artículo en Inglés | MEDLINE | ID: mdl-35021193

RESUMEN

Promoters are crucial regulatory DNA regions for gene transcriptional activation. Rapid advances in next-generation sequencing technologies have accelerated the accumulation of genome sequences, providing increased training data to inform computational approaches for both prokaryotic and eukaryotic promoter prediction. However, it remains a significant challenge to accurately identify species-specific promoter sequences using computational approaches. To advance computational support for promoter prediction, in this study, we curated 58 comprehensive, up-to-date, benchmark datasets for 7 different species (i.e. Escherichia coli, Bacillus subtilis, Homo sapiens, Mus musculus, Arabidopsis thaliana, Zea mays and Drosophila melanogaster) to assist the research community to assess the relative functionality of alternative approaches and support future research on both prokaryotic and eukaryotic promoters. We revisited 106 predictors published since 2000 for promoter identification (40 for prokaryotic promoter, 61 for eukaryotic promoter, and 5 for both). We systematically evaluated their training datasets, computational methodologies, calculated features, performance and software usability. On the basis of these benchmark datasets, we benchmarked 19 predictors with functioning webservers/local tools and assessed their prediction performance. We found that deep learning and traditional machine learning-based approaches generally outperformed scoring function-based approaches. Taken together, the curated benchmark dataset repository and the benchmarking analysis in this study serve to inform the design and implementation of computational approaches for promoter prediction and facilitate more rigorous comparison of new techniques in the future.


Asunto(s)
Drosophila melanogaster , Eucariontes , Animales , Biología Computacional/métodos , Drosophila melanogaster/genética , Células Eucariotas , Ratones , Células Procariotas , Regiones Promotoras Genéticas
5.
Brief Bioinform ; 23(6)2022 11 19.
Artículo en Inglés | MEDLINE | ID: mdl-36341591

RESUMEN

Subcellular localization of messenger RNAs (mRNAs) plays a key role in the spatial regulation of gene activity. The functions of mRNAs have been shown to be closely linked with their localizations. As such, understanding of the subcellular localizations of mRNAs can help elucidate gene regulatory networks. Despite several computational methods that have been developed to predict mRNA localizations within cells, there is still much room for improvement in predictive performance, especially for the multiple-location prediction. In this study, we proposed a novel multi-label multi-class predictor, termed Clarion, for mRNA subcellular localization prediction. Clarion was developed based on a manually curated benchmark dataset and leveraged the weighted series method for multi-label transformation. Extensive benchmarking tests demonstrated Clarion achieved competitive predictive performance and the weighted series method plays a crucial role in securing superior performance of Clarion. In addition, the independent test results indicate that Clarion outperformed the state-of-the-art methods and can secure accuracy of 81.47, 91.29, 79.77, 92.10, 89.15, 83.74, 80.74, 79.23 and 84.74% for chromatin, cytoplasm, cytosol, exosome, membrane, nucleolus, nucleoplasm, nucleus and ribosome, respectively. The webserver and local stand-alone tool of Clarion is freely available at http://monash.bioweb.cloud.edu.au/Clarion/.


Asunto(s)
Núcleo Celular , Proteínas , ARN Mensajero/genética , Núcleo Celular/genética , Biología Computacional/métodos , Bases de Datos de Proteínas
6.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34729589

RESUMEN

Conventional supervised binary classification algorithms have been widely applied to address significant research questions using biological and biomedical data. This classification scheme requires two fully labeled classes of data (e.g. positive and negative samples) to train a classification model. However, in many bioinformatics applications, labeling data is laborious, and the negative samples might be potentially mislabeled due to the limited sensitivity of the experimental equipment. The positive unlabeled (PU) learning scheme was therefore proposed to enable the classifier to learn directly from limited positive samples and a large number of unlabeled samples (i.e. a mixture of positive or negative samples). To date, several PU learning algorithms have been developed to address various biological questions, such as sequence identification, functional site characterization and interaction prediction. In this paper, we revisit a collection of 29 state-of-the-art PU learning bioinformatic applications to address various biological questions. Various important aspects are extensively discussed, including PU learning methodology, biological application, classifier design and evaluation strategy. We also comment on the existing issues of PU learning and offer our perspectives for the future development of PU learning applications. We anticipate that our work serves as an instrumental guideline for a better understanding of the PU learning framework in bioinformatics and further developing next-generation PU learning frameworks for critical biological applications.


Asunto(s)
Algoritmos , Biología Computacional , Biología Computacional/métodos , Aprendizaje Automático Supervisado
7.
Bioinformatics ; 39(3)2023 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-36794913

RESUMEN

MOTIVATION: The rapid accumulation of high-throughput sequence data demands the development of effective and efficient data-driven computational methods to functionally annotate proteins. However, most current approaches used for functional annotation simply focus on the use of protein-level information but ignore inter-relationships among annotations. RESULTS: Here, we established PFresGO, an attention-based deep-learning approach that incorporates hierarchical structures in Gene Ontology (GO) graphs and advances in natural language processing algorithms for the functional annotation of proteins. PFresGO employs a self-attention operation to capture the inter-relationships of GO terms, updates its embedding accordingly and uses a cross-attention operation to project protein representations and GO embedding into a common latent space to identify global protein sequence patterns and local functional residues. We demonstrate that PFresGO consistently achieves superior performance across GO categories when compared with 'state-of-the-art' methods. Importantly, we show that PFresGO can identify functionally important residues in protein sequences by assessing the distribution of attention weightings. PFresGO should serve as an effective tool for the accurate functional annotation of proteins and functional domains within proteins. AVAILABILITY AND IMPLEMENTATION: PFresGO is available for academic purposes at https://github.com/BioColLab/PFresGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Aprendizaje Profundo , Anotación de Secuencia Molecular , Ontología de Genes , Biología Computacional/métodos , Algoritmos , Proteínas/metabolismo
8.
Brief Bioinform ; 22(6)2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34058752

RESUMEN

Understanding how a mutation might affect protein stability is of significant importance to protein engineering and for understanding protein evolution genetic diseases. While a number of computational tools have been developed to predict the effect of missense mutations on protein stability protein stability upon mutations, they are known to exhibit large biases imparted in part by the data used to train and evaluate them. Here, we provide a comprehensive overview of predictive tools, which has provided an evolving insight into the importance and relevance of features that can discern the effects of mutations on protein stability. A diverse selection of these freely available tools was benchmarked using a large mutation-level blind dataset of 1342 experimentally characterised mutations across 130 proteins from ThermoMutDB, a second test dataset encompassing 630 experimentally characterised mutations across 39 proteins from iStable2.0 and a third blind test dataset consisting of 268 mutations in 27 proteins from the newly published ProThermDB. The performance of the methods was further evaluated with respect to the site of mutation, type of mutant residue and by ranging the pH and temperature. Additionally, the classification performance was also evaluated by classifying the mutations as stabilizing (∆∆G ≥ 0) or destabilizing (∆∆G < 0). The results reveal that the performance of the predictors is affected by the site of mutation and the type of mutant residue. Further, the results show very low performance for pH values 6-8 and temperature higher than 65 for all predictors except iStable2.0 on the S630 dataset. To illustrate how stability and structure change upon single point mutation, we considered four stabilizing, two destabilizing and two stabilizing mutations from two proteins, namely the toxin protein and bovine liver cytochrome. Overall, the results on S268, S630 and S1342 datasets show that the performance of the integrated predictors is better than the mechanistic or individual machine learning predictors. We expect that this paper will provide useful guidance for the design and development of next-generation bioinformatic tools for predicting protein stability changes upon mutations.


Asunto(s)
Biología Computacional/métodos , Mutación Missense , Estabilidad Proteica , Proteínas/química , Proteínas/genética , Programas Informáticos , Algoritmos , Bases de Datos de Proteínas , Evolución Molecular , Aprendizaje Automático , Modelos Moleculares , Conformación Proteica , Proteínas/metabolismo , Reproducibilidad de los Resultados , Relación Estructura-Actividad
9.
Brief Bioinform ; 22(5)2021 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-33454737

RESUMEN

Neopeptide-based immunotherapy has been recognised as a promising approach for the treatment of cancers. For neopeptides to be recognised by CD8+ T cells and induce an immune response, their binding to human leukocyte antigen class I (HLA-I) molecules is a necessary first step. Most epitope prediction tools thus rely on the prediction of such binding. With the use of mass spectrometry, the scale of naturally presented HLA ligands that could be used to develop such predictors has been expanded. However, there are rarely efforts that focus on the integration of these experimental data with computational algorithms to efficiently develop up-to-date predictors. Here, we present Anthem for accurate HLA-I binding prediction. In particular, we have developed a user-friendly framework to support the development of customisable HLA-I binding prediction models to meet challenges associated with the rapidly increasing availability of large amounts of immunopeptidomic data. Our extensive evaluation, using both independent and experimental datasets shows that Anthem achieves an overall similar or higher area under curve value compared with other contemporary tools. It is anticipated that Anthem will provide a unique opportunity for the non-expert user to analyse and interpret their own in-house or publicly deposited datasets.


Asunto(s)
Algoritmos , Bases de Datos de Proteínas , Epítopos , Antígenos de Histocompatibilidad Clase I , Péptidos , Programas Informáticos , Epítopos/química , Epítopos/inmunología , Antígenos de Histocompatibilidad Clase I/química , Antígenos de Histocompatibilidad Clase I/inmunología , Humanos , Inmunoterapia , Neoplasias/inmunología , Neoplasias/terapia , Péptidos/química , Péptidos/inmunología
10.
Brief Bioinform ; 22(4)2021 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-33212503

RESUMEN

Beta-lactamases (BLs) are enzymes localized in the periplasmic space of bacterial pathogens, where they confer resistance to beta-lactam antibiotics. Experimental identification of BLs is costly yet crucial to understand beta-lactam resistance mechanisms. To address this issue, we present DeepBL, a deep learning-based approach by incorporating sequence-derived features to enable high-throughput prediction of BLs. Specifically, DeepBL is implemented based on the Small VGGNet architecture and the TensorFlow deep learning library. Furthermore, the performance of DeepBL models is investigated in relation to the sequence redundancy level and negative sample selection in the benchmark dataset. The models are trained on datasets of varying sequence redundancy thresholds, and the model performance is evaluated by extensive benchmarking tests. Using the optimized DeepBL model, we perform proteome-wide screening for all reviewed bacterium protein sequences available from the UniProt database. These results are freely accessible at the DeepBL webserver at http://deepbl.erc.monash.edu.au/.


Asunto(s)
Biología Computacional , Bases de Datos de Proteínas , Aprendizaje Profundo , Proteoma , Programas Informáticos , beta-Lactamasas/genética
11.
Bioinformatics ; 38(17): 4206-4213, 2022 09 02.
Artículo en Inglés | MEDLINE | ID: mdl-35801909

RESUMEN

MOTIVATION: The molecular subtyping of gastric cancer (adenocarcinoma) into four main subtypes based on integrated multiomics profiles, as proposed by The Cancer Genome Atlas (TCGA) initiative, represents an effective strategy for patient stratification. However, this approach requires the use of multiple technological platforms, and is quite expensive and time-consuming to perform. A computational approach that uses histopathological image data to infer molecular subtypes could be a practical, cost- and time-efficient complementary tool for prognostic and clinical management purposes. RESULTS: Here, we propose a deep learning ensemble approach (called DEMoS) capable of predicting the four recognized molecular subtypes of gastric cancer directly from histopathological images. DEMoS achieved tile-level area under the receiver-operating characteristic curve (AUROC) values of 0.785, 0.668, 0.762 and 0.811 for the prediction of these four subtypes of gastric cancer [i.e. (i) Epstein-Barr (EBV)-infected, (ii) microsatellite instability (MSI), (iii) genomically stable (GS) and (iv) chromosomally unstable tumors (CIN)] using an independent test dataset, respectively. At the patient-level, it achieved AUROC values of 0.897, 0.764, 0.890 and 0.898, respectively. Thus, these four subtypes are well-predicted by DEMoS. Benchmarking experiments further suggest that DEMoS is able to achieve an improved classification performance for image-based subtyping and prevent model overfitting. This study highlights the feasibility of using a deep learning ensemble-based method to rapidly and reliably subtype gastric cancer (adenocarcinoma) solely using features from histopathological images. AVAILABILITY AND IMPLEMENTATION: All whole slide images used in this study was collected from the TCGA database. This study builds upon our previously published HEAL framework, with related documentation and tutorials available at http://heal.erc.monash.edu.au. The source code and related models are freely accessible at https://github.com/Docurdt/DEMoS.git. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Adenocarcinoma , Aprendizaje Profundo , Neoplasias Gástricas , Humanos , Neoplasias Gástricas/diagnóstico por imagen , Neoplasias Gástricas/genética , Adenocarcinoma/diagnóstico por imagen , Adenocarcinoma/genética , Inestabilidad de Microsatélites
12.
Br J Clin Pharmacol ; 89(2): 914-920, 2023 02.
Artículo en Inglés | MEDLINE | ID: mdl-36301837

RESUMEN

The COVID-19 pandemic has disrupted seeking and delivery of healthcare. Different Australian jurisdictions implemented different COVID-19 restrictions. We used Australian national pharmacy dispensing data to conduct interrupted time series analyses to examine the incidence and prevalence of opioid dispensing in different jurisdictions. Following nationwide COVID-19 restrictions, the incidence dropped by -0.40 (95% confidence interval [CI]: -0.50, -0.31), -0.33 (95% CI: -0.46, -0.21) and -0.21 (95% CI: -0.37, -0.04) per 1000 people per week and the prevalence dropped by -0.85 (95% CI: -1.39, -0.31), -0.54 (95% CI: -1.01, -0.07) and -0.62 (95% CI: -0.99, -0.25) per 1000 people per week in Victoria, New South Wales and other jurisdictions, respectively. Incidence and prevalence increased by 0.29 (95% CI: 0.13, 0.44) and 0.72 (95% CI: 0.11, 1.33) per 1000 people per week, respectively in Victoria post-lockdown; no significant changes were observed in other jurisdictions. No significant changes were observed in the initiation of long-term opioid use in any jurisdictions. More stringent restrictions coincided with more pronounced reductions in overall opioid initiation, but initiation of long-term opioid use did not change.


Asunto(s)
COVID-19 , Trastornos Relacionados con Opioides , Humanos , Analgésicos Opioides/uso terapéutico , Australia/epidemiología , Prevalencia , Incidencia , Pandemias , COVID-19/epidemiología , Control de Enfermedades Transmisibles , Trastornos Relacionados con Opioides/epidemiología , Trastornos Relacionados con Opioides/prevención & control , Trastornos Relacionados con Opioides/tratamiento farmacológico , Prescripciones de Medicamentos
13.
J Biomed Inform ; 147: 104509, 2023 11.
Artículo en Inglés | MEDLINE | ID: mdl-37827477

RESUMEN

The adoption of electronic health records (EHRs) has created opportunities to analyse historical data for predicting clinical outcomes and improving patient care. However, non-standardised data representations and anomalies pose major challenges to the use of EHRs in digital health research. To address these challenges, we have developed EHR-QC, a tool comprising two modules: the data standardisation module and the preprocessing module. The data standardisation module migrates source EHR data to a standard format using advanced concept mapping techniques, surpassing expert curation in benchmarking analysis. The preprocessing module includes several functions designed specifically to handle healthcare data subtleties. We provide automated detection of data anomalies and solutions to handle those anomalies. We believe that the development and adoption of tools like EHR-QC is critical for advancing digital health. Our ultimate goal is to accelerate clinical research by enabling rapid experimentation with data-driven observational research to generate robust, generalisable biomedical knowledge.


Asunto(s)
Benchmarking , Registros Electrónicos de Salud , Humanos , Investigación Empírica , Proyectos de Investigación
14.
Nucleic Acids Res ; 49(10): e60, 2021 06 04.
Artículo en Inglés | MEDLINE | ID: mdl-33660783

RESUMEN

Sequence-based analysis and prediction are fundamental bioinformatic tasks that facilitate understanding of the sequence(-structure)-function paradigm for DNAs, RNAs and proteins. Rapid accumulation of sequences requires equally pervasive development of new predictive models, which depends on the availability of effective tools that support these efforts. We introduce iLearnPlus, the first machine-learning platform with graphical- and web-based interfaces for the construction of machine-learning pipelines for analysis and predictions using nucleic acid and protein sequences. iLearnPlus provides a comprehensive set of algorithms and automates sequence-based feature extraction and analysis, construction and deployment of models, assessment of predictive performance, statistical analysis, and data visualization; all without programming. iLearnPlus includes a wide range of feature sets which encode information from the input sequences and over twenty machine-learning algorithms that cover several deep-learning approaches, outnumbering the current solutions by a wide margin. Our solution caters to experienced bioinformaticians, given the broad range of options, and biologists with no programming background, given the point-and-click interface and easy-to-follow design process. We showcase iLearnPlus with two case studies concerning prediction of long noncoding RNAs (lncRNAs) from RNA transcripts and prediction of crotonylation sites in protein chains. iLearnPlus is an open-source platform available at https://github.com/Superzchen/iLearnPlus/ with the webserver at http://ilearnplus.erc.monash.edu/.


Asunto(s)
Biología Computacional/métodos , Aprendizaje Automático , Análisis de Secuencia/métodos , Programas Informáticos , Secuencia de Aminoácidos , Animales , Secuencia de Bases , Humanos
15.
Brief Bioinform ; 21(5): 1676-1696, 2020 09 25.
Artículo en Inglés | MEDLINE | ID: mdl-31714956

RESUMEN

RNA post-transcriptional modifications play a crucial role in a myriad of biological processes and cellular functions. To date, more than 160 RNA modifications have been discovered; therefore, accurate identification of RNA-modification sites is fundamental for a better understanding of RNA-mediated biological functions and mechanisms. However, due to limitations in experimental methods, systematic identification of different types of RNA-modification sites remains a major challenge. Recently, more than 20 computational methods have been developed to identify RNA-modification sites in tandem with high-throughput experimental methods, with most of these capable of predicting only single types of RNA-modification sites. These methods show high diversity in their dataset size, data quality, core algorithms, features extracted and feature selection techniques and evaluation strategies. Therefore, there is an urgent need to revisit these methods and summarize their methodologies, in order to improve and further develop computational techniques to identify and characterize RNA-modification sites from the large amounts of sequence data. With this goal in mind, first, we provide a comprehensive survey on a large collection of 27 state-of-the-art approaches for predicting N1-methyladenosine and N6-methyladenosine sites. We cover a variety of important aspects that are crucial for the development of successful predictors, including the dataset quality, operating algorithms, sequence and genomic features, feature selection, model performance evaluation and software utility. In addition, we also provide our thoughts on potential strategies to improve the model performance. Second, we propose a computational approach called DeepPromise based on deep learning techniques for simultaneous prediction of N1-methyladenosine and N6-methyladenosine. To extract the sequence context surrounding the modification sites, three feature encodings, including enhanced nucleic acid composition, one-hot encoding, and RNA embedding, were used as the input to seven consecutive layers of convolutional neural networks (CNNs), respectively. Moreover, DeepPromise further combined the prediction score of the CNN-based models and achieved around 43% higher area under receiver-operating curve (AUROC) for m1A site prediction and 2-6% higher AUROC for m6A site prediction, respectively, when compared with several existing state-of-the-art approaches on the independent test. In-depth analyses of characteristic sequence motifs identified from the convolution-layer filters indicated that nucleotide presentation at proximal positions surrounding the modification sites contributed most to the classification, whereas those at distal positions also affected classification but to different extents. To maximize user convenience, a web server was developed as an implementation of DeepPromise and made publicly available at http://DeepPromise.erc.monash.edu/, with the server accepting both RNA sequences and genomic sequences to allow prediction of two types of putative RNA-modification sites.


Asunto(s)
Biología Computacional/métodos , Procesamiento Postranscripcional del ARN , ARN/genética , Análisis de Secuencia de ARN/métodos , Algoritmos , Aprendizaje Profundo
16.
Brief Bioinform ; 21(3): 1069-1079, 2020 05 21.
Artículo en Inglés | MEDLINE | ID: mdl-31161204

RESUMEN

Post-translational modifications (PTMs) play very important roles in various cell signaling pathways and biological process. Due to PTMs' extremely important roles, many major PTMs have been studied, while the functional and mechanical characterization of major PTMs is well documented in several databases. However, most currently available databases mainly focus on protein sequences, while the real 3D structures of PTMs have been largely ignored. Therefore, studies of PTMs 3D structural signatures have been severely limited by the deficiency of the data. Here, we develop PRISMOID, a novel publicly available and free 3D structure database for a wide range of PTMs. PRISMOID represents an up-to-date and interactive online knowledge base with specific focus on 3D structural contexts of PTMs sites and mutations that occur on PTMs and in the close proximity of PTM sites with functional impact. The first version of PRISMOID encompasses 17 145 non-redundant modification sites on 3919 related protein 3D structure entries pertaining to 37 different types of PTMs. Our entry web page is organized in a comprehensive manner, including detailed PTM annotation on the 3D structure and biological information in terms of mutations affecting PTMs, secondary structure features and per-residue solvent accessibility features of PTM sites, domain context, predicted natively disordered regions and sequence alignments. In addition, high-definition JavaScript packages are employed to enhance information visualization in PRISMOID. PRISMOID equips a variety of interactive and customizable search options and data browsing functions; these capabilities allow users to access data via keyword, ID and advanced options combination search in an efficient and user-friendly way. A download page is also provided to enable users to download the SQL file, computational structural features and PTM sites' data. We anticipate PRISMOID will swiftly become an invaluable online resource, assisting both biologists and bioinformaticians to conduct experiments and develop applications supporting discovery efforts in the sequence-structural-functional relationship of PTMs and providing important insight into mutations and PTM sites interaction mechanisms. The PRISMOID database is freely accessible at http://prismoid.erc.monash.edu/. The database and web interface are implemented in MySQL, JSP, JavaScript and HTML with all major browsers supported.


Asunto(s)
Bases de Datos de Proteínas , Mutación , Procesamiento Proteico-Postraduccional , Proteínas/química , Conformación Proteica
17.
Brief Bioinform ; 21(3): 1047-1057, 2020 05 21.
Artículo en Inglés | MEDLINE | ID: mdl-31067315

RESUMEN

With the explosive growth of biological sequences generated in the post-genomic era, one of the most challenging problems in bioinformatics and computational biology is to computationally characterize sequences, structures and functions in an efficient, accurate and high-throughput manner. A number of online web servers and stand-alone tools have been developed to address this to date; however, all these tools have their limitations and drawbacks in terms of their effectiveness, user-friendliness and capacity. Here, we present iLearn, a comprehensive and versatile Python-based toolkit, integrating the functionality of feature extraction, clustering, normalization, selection, dimensionality reduction, predictor construction, best descriptor/model selection, ensemble learning and results visualization for DNA, RNA and protein sequences. iLearn was designed for users that only want to upload their data set and select the functions they need calculated from it, while all necessary procedures and optimal settings are completed automatically by the software. iLearn includes a variety of descriptors for DNA, RNA and proteins, and four feature output formats are supported so as to facilitate direct output usage or communication with other computational tools. In total, iLearn encompasses 16 different types of feature clustering, selection, normalization and dimensionality reduction algorithms, and five commonly used machine-learning algorithms, thereby greatly facilitating feature analysis and predictor construction. iLearn is made freely available via an online web server and a stand-alone toolkit.


Asunto(s)
ADN/química , Aprendizaje Automático , Proteínas/química , ARN/química , Análisis de Secuencia/métodos , Algoritmos , Internet
18.
Bioinformatics ; 37(21): 3986-3988, 2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34061168

RESUMEN

MOTIVATION: Tumor tile selection is a necessary prerequisite in patch-based cancer whole slide image analysis, which is labor-intensive and requires expertise. Whole slides are annotated as tumor or tumor free, but tiles within a tumor slide are not. As all tiles within a tumor free slide are tumor free, these can be used to capture tumor-free patterns using the one-class learning strategy. RESULTS: We present a Python package, termed OCTID, which combines a pretrained convolutional neural network (CNN) model, Uniform Manifold Approximation and Projection (UMAP) and one-class support vector machine to achieve accurate tumor tile classification using a training set of tumor free tiles. Benchmarking experiments on four H&E image datasets achieved remarkable performance in terms of F1-score (0.90 ± 0.06), Matthews correlation coefficient (0.93 ± 0.05) and accuracy (0.94 ± 0.03). AVAILABILITY AND IMPLEMENTATION: Detailed information can be found in the Supplementary File. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Procesamiento de Imagen Asistido por Computador , Neoplasias , Redes Neurales de la Computación , Lenguajes de Programación , Neoplasias/diagnóstico por imagen , Humanos , Procesamiento de Imagen Asistido por Computador/métodos , Aprendizaje Automático , Conjuntos de Datos como Asunto
19.
Bioinformatics ; 37(22): 4291-4295, 2021 11 18.
Artículo en Inglés | MEDLINE | ID: mdl-34009289

RESUMEN

MOTIVATION: Digital pathology supports analysis of histopathological images using deep learning methods at a large-scale. However, applications of deep learning in this area have been limited by the complexities of configuration of the computational environment and of hyperparameter optimization, which hinder deployment and reduce reproducibility. RESULTS: Here, we propose HEAL, a deep learning-based automated framework for easy, flexible and multi-faceted histopathological image analysis. We demonstrate its utility and functionality by performing two case studies on lung cancer and one on colon cancer. Leveraging the capability of Docker, HEAL represents an ideal end-to-end tool to conduct complex histopathological analysis and enables deep learning in a broad range of applications for cancer image analysis. AVAILABILITY AND IMPLEMENTATION: The docker image of HEAL is available at https://hub.docker.com/r/docurdt/heal and related documentation and datasets are available at http://heal.erc.monash.edu.au. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Neoplasias del Colon , Aprendizaje Profundo , Humanos , Programas Informáticos , Reproducibilidad de los Resultados
20.
J Chem Inf Model ; 62(17): 4270-4282, 2022 09 12.
Artículo en Inglés | MEDLINE | ID: mdl-35973091

RESUMEN

An essential step in engineering proteins and understanding disease-causing missense mutations is to accurately model protein stability changes when such mutations occur. Here, we developed a new sequence-based predictor for the protein stability (PROST) change (Gibb's free energy change, ΔΔG) upon a single-point missense mutation. PROST extracts multiple descriptors from the most promising sequence-based predictors, such as BoostDDG, SAAFEC-SEQ, and DDGun. RPOST also extracts descriptors from iFeature and AlphaFold2. The extracted descriptors include sequence-based features, physicochemical properties, evolutionary information, evolutionary-based physicochemical properties, and predicted structural features. The PROST predictor is a weighted average ensemble model based on extreme gradient boosting (XGBoost) decision trees and an extra-trees regressor; PROST is trained on both direct and hypothetical reverse mutations using the S5294 (S2647 direct mutations + S2647 inverse mutations). The parameters for the PROST model are optimized using grid searching with 5-fold cross-validation, and feature importance analysis unveils the most relevant features. The performance of PROST is evaluated in a blinded manner, employing nine distinct data sets and existing state-of-the-art sequence-based and structure-based predictors. This method consistently performs well on frataxin, S217, S349, Ssym, S669, Myoglobin, and CAGI5 data sets in blind tests and similarly to the state-of-the-art predictors for p53 and S276 data sets. When the performance of PROST is compared with the latest predictors such as BoostDDG, SAAFEC-SEQ, ACDC-NN-seq, and DDGun, PROST dominates these predictors. A case study of mutation scanning of the frataxin protein for nine wild-type residues demonstrates the utility of PROST. Taken together, these findings indicate that PROST is a well-suited predictor when no protein structural information is available. The source code of PROST, data sets, examples, and pretrained models along with how to use PROST are available at https://github.com/ShahidIqb/PROST and https://prost.erc.monash.edu/seq.


Asunto(s)
Mutación Missense , Transferencia Intrafalopiana del Cigoto , Estabilidad Proteica , Proteínas/química , Programas Informáticos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA