Búsqueda | BVS CLAP/SMR-OPS/OMS

1.

DeepMHCI: an anchor position-aware deep interaction model for accurate MHC-I peptide binding affinity prediction.

Qu, Wei; You, Ronghui; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Bioinformatics ; 39(9)2023 09 02.

Artículo en Inglés | MEDLINE | ID: mdl-37669154

RESUMEN

MOTIVATION: Computationally predicting major histocompatibility complex class I (MHC-I) peptide binding affinity is an important problem in immunological bioinformatics, which is also crucial for the identification of neoantigens for personalized therapeutic cancer vaccines. Recent cutting-edge deep learning-based methods for this problem cannot achieve satisfactory performance, especially for non-9-mer peptides. This is because such methods generate the input by simply concatenating the two given sequences: a peptide and (the pseudo sequence of) an MHC class I molecule, which cannot precisely capture the anchor positions of the MHC binding motif for the peptides with variable lengths. We thus developed an anchor position-aware and high-performance deep model, DeepMHCI, with a position-wise gated layer and a residual binding interaction convolution layer. This allows the model to control the information flow in peptides to be aware of anchor positions and model the interactions between peptides and the MHC pseudo (binding) sequence directly with multiple convolutional kernels. RESULTS: The performance of DeepMHCI has been thoroughly validated by extensive experiments on four benchmark datasets under various settings, such as 5-fold cross-validation, validation with the independent testing set, external HPV vaccine identification, and external CD8+ epitope identification. Experimental results with visualization of binding motifs demonstrate that DeepMHCI outperformed all competing methods, especially on non-9-mer peptides binding prediction. AVAILABILITY AND IMPLEMENTATION: DeepMHCI is publicly available at https://github.com/ZhuLab-Fudan/DeepMHCI.

Asunto(s)

Algoritmos , Benchmarking , Biología Computacional , Epítopos , Péptidos

2.

Sc2Mol: a scaffold-based two-step molecule generator with variational autoencoder and transformer.

Liao, Zhirui; Xie, Lei; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Bioinformatics ; 39(1)2023 01 01.

Artículo en Inglés | MEDLINE | ID: mdl-36576008

RESUMEN

MOTIVATION: Finding molecules with desired pharmaceutical properties is crucial in drug discovery. Generative models can be an efficient tool to find desired molecules through the distribution learned by the model to approximate given training data. Existing generative models (i) do not consider backbone structures (scaffolds), resulting in inefficiency or (ii) need prior patterns for scaffolds, causing bias. Scaffolds are reasonable to use, and it is imperative to design a generative model without any prior scaffold patterns. RESULTS: We propose a generative model-based molecule generator, Sc2Mol, without any prior scaffold patterns. Sc2Mol uses SMILES strings for molecules. It consists of two steps: scaffold generation and scaffold decoration, which are carried out by a variational autoencoder and a transformer, respectively. The two steps are powerful for implementing random molecule generation and scaffold optimization. Our empirical evaluation using drug-like molecule datasets confirmed the success of our model in distribution learning and molecule optimization. Also, our model could automatically learn the rules to transform coarse scaffolds into sophisticated drug candidates. These rules were consistent with those for current lead optimization. AVAILABILITY AND IMPLEMENTATION: The code is available at https://github.com/zhiruiliao/Sc2Mol. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Descubrimiento de Drogas , Aprendizaje Automático

3.

Improving drug response prediction by integrating multiple data sources: matrix factorization, kernel and network-based approaches.

Güvenç Paltun, Betül; Mamitsuka, Hiroshi; Kaski, Samuel.

Brief Bioinform ; 22(1): 346-359, 2021 01 18.

Artículo en Inglés | MEDLINE | ID: mdl-31838491

RESUMEN

Predicting the response of cancer cell lines to specific drugs is one of the central problems in personalized medicine, where the cell lines show diverse characteristics. Researchers have developed a variety of computational methods to discover associations between drugs and cell lines, and improved drug sensitivity analyses by integrating heterogeneous biological data. However, choosing informative data sources and methods that can incorporate multiple sources efficiently is the challenging part of successful analysis in personalized medicine. The reason is that finding decisive factors of cancer and developing methods that can overcome the problems of integrating data, such as differences in data structures and data complexities, are difficult. In this review, we summarize recent advances in data integration-based machine learning for drug response prediction, by categorizing methods as matrix factorization-based, kernel-based and network-based methods. We also present a short description of relevant databases used as a benchmark in drug response prediction analyses, followed by providing a brief discussion of challenges faced in integrating and interpreting data from multiple sources. Finally, we address the advantages of combining multiple heterogeneous data sources on drug sensitivity analysis by showing an experimental comparison. Contact: betul.guvenc@aalto.fi.

Asunto(s)

Resistencia a Antineoplásicos , Genómica/métodos , Medicina de Precisión/métodos , Humanos , Aprendizaje Automático , Variantes Farmacogenómicas

4.

A survey on adverse drug reaction studies: data, tasks and machine learning methods.

Nguyen, Duc Anh; Nguyen, Canh Hao; Mamitsuka, Hiroshi.

Brief Bioinform ; 22(1): 164-177, 2021 01 18.

Artículo en Inglés | MEDLINE | ID: mdl-31838499

RESUMEN

MOTIVATION: Adverse drug reaction (ADR) or drug side effect studies play a crucial role in drug discovery. Recently, with the rapid increase of both clinical and non-clinical data, machine learning methods have emerged as prominent tools to support analyzing and predicting ADRs. Nonetheless, there are still remaining challenges in ADR studies. RESULTS: In this paper, we summarized ADR data sources and review ADR studies in three tasks: drug-ADR benchmark data creation, drug-ADR prediction and ADR mechanism analysis. We focused on machine learning methods used in each task and then compare performances of the methods on the drug-ADR prediction task. Finally, we discussed open problems for further ADR studies. AVAILABILITY: Data and code are available at https://github.com/anhnda/ADRPModels.

Asunto(s)

Biología Computacional/métodos , Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos/etiología , Aprendizaje Automático , Humanos

5.

Machine learning approaches for drug combination therapies.

Güvenç Paltun, Betül; Kaski, Samuel; Mamitsuka, Hiroshi.

Brief Bioinform ; 22(6)2021 11 05.

Artículo en Inglés | MEDLINE | ID: mdl-34368832

RESUMEN

Drug combination therapy is a promising strategy to treat complex diseases such as cancer and infectious diseases. However, current knowledge of drug combination therapies, especially in cancer patients, is limited because of adverse drug effects, toxicity and cell line heterogeneity. Screening new drug combinations requires substantial efforts since considering all possible combinations between drugs is infeasible and expensive. Therefore, building computational approaches, particularly machine learning methods, could provide an effective strategy to overcome drug resistance and improve therapeutic efficacy. In this review, we group the state-of-the-art machine learning approaches to analyze personalized drug combination therapies into three categories and discuss each method in each category. We also present a short description of relevant databases used as a benchmark in drug combination therapies and provide a list of well-known, publicly available interactive data analysis portals. We highlight the importance of data integration on the identification of drug combinations. Finally, we address the advantages of combining multiple data sources on drug combination analysis by showing an experimental comparison.

Asunto(s)

Aprendizaje Automático , Protocolos de Quimioterapia Combinada Antineoplásica/administración & dosificación , Biología Computacional/métodos , Humanos , Neoplasias/tratamiento farmacológico , Medicina de Precisión

6.

XGSEA: CROSS-species gene set enrichment analysis via domain adaptation.

Cai, Menglan; Hao Nguyen, Canh; Mamitsuka, Hiroshi; Li, Limin.

Brief Bioinform ; 22(5)2021 09 02.

Artículo en Inglés | MEDLINE | ID: mdl-33515011

RESUMEN

MOTIVATION: Gene set enrichment analysis (GSEA) has been widely used to identify gene sets with statistically significant difference between cases and controls against a large gene set. GSEA needs both phenotype labels and expression of genes. However, gene expression are assessed more often for model organisms than minor species. Also, importantly gene expression are not measured well under specific conditions for human, due to high risk of direct experiments, such as non-approved treatment or gene knockout, and then often substituted by mouse. Thus, predicting enrichment significance (on a phenotype) of a given gene set of a species (target, say human), by using gene expression measured under the same phenotype of the other species (source, say mouse) is a vital and challenging problem, which we call CROSS-species gene set enrichment problem (XGSEP). RESULTS: For XGSEP, we propose the CROSS-species gene set enrichment analysis (XGSEA), with three steps of: (1) running GSEA for a source species to obtain enrichment scores and $p$-values of source gene sets; (2) representing the relation between source and target gene sets by domain adaptation; and (3) using regression to predict $p$-values of target gene sets, based on the representation in (2). We extensively validated the XGSEA by using five regression and one classification measurements on four real data sets under various settings, proving that the XGSEA significantly outperformed three baseline methods in most cases. A case study of identifying important human pathways for T -cell dysfunction and reprogramming from mouse ATAC-Seq data further confirmed the reliability of the XGSEA. AVAILABILITY: Source code of the XGSEA is available through https://github.com/LiminLi-xjtu/XGSEA.

Asunto(s)

Neoplasias Encefálicas/genética , Aprendizaje Automático , Melanoma/genética , Neoplasias Ováricas/genética , Neoplasias Cutáneas/genética , Animales , Neoplasias Encefálicas/inmunología , Neoplasias Encefálicas/patología , Biología Computacional/métodos , Conjuntos de Datos como Asunto , Embrión de Mamíferos , Femenino , Regulación Neoplásica de la Expresión Génica , Humanos , Melanoma/inmunología , Melanoma/patología , Ratones , Neoplasias Ováricas/inmunología , Neoplasias Ováricas/patología , Neoplasias Cutáneas/inmunología , Neoplasias Cutáneas/patología , Linfocitos T/inmunología , Linfocitos T/patología , Pez Cebra

7.

HPODNets: deep graph convolutional networks for predicting human protein-phenotype associations.

Liu, Lizhi; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Bioinformatics ; 38(3): 799-808, 2022 01 12.

Artículo en Inglés | MEDLINE | ID: mdl-34672333

RESUMEN

MOTIVATION: Deciphering the relationship between human genes/proteins and abnormal phenotypes is of great importance in the prevention, diagnosis and treatment against diseases. The Human Phenotype Ontology (HPO) is a standardized vocabulary that describes the phenotype abnormalities encountered in human disorders. However, the current HPO annotations are still incomplete. Thus, it is necessary to computationally predict human protein-phenotype associations. In terms of current, cutting-edge computational methods for annotating proteins (such as functional annotation), three important features are (i) multiple network input, (ii) semi-supervised learning and (iii) deep graph convolutional network (GCN), whereas there are no methods with all these features for predicting HPO annotations of human protein. RESULTS: We develop HPODNets with all above three features for predicting human protein-phenotype associations. HPODNets adopts a deep GCN with eight layers which allows to capture high-order topological information from multiple interaction networks. Empirical results with both cross-validation and temporal validation demonstrate that HPODNets outperforms seven competing state-of-the-art methods for protein function prediction. HPODNets with the architecture of deep GCNs is confirmed to be effective for predicting HPO annotations of human protein and, more generally, node label ranking problem with multiple biomolecular networks input in bioinformatics. AVAILABILITY AND IMPLEMENTATION: https://github.com/liulizhi1996/HPODNets. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Biología Computacional , Humanos , Biología Computacional/métodos , Fenotipo

8.

DeepMHCII: a novel binding core-aware deep interaction model for accurate MHC-II peptide binding affinity prediction.

You, Ronghui; Qu, Wei; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Bioinformatics ; 38(Suppl 1): i220-i228, 2022 06 24.

Artículo en Inglés | MEDLINE | ID: mdl-35758790

RESUMEN

MOTIVATION: Computationally predicting major histocompatibility complex (MHC)-peptide binding affinity is an important problem in immunological bioinformatics. Recent cutting-edge deep learning-based methods for this problem are unable to achieve satisfactory performance for MHC class II molecules. This is because such methods generate the input by simply concatenating the two given sequences: (the estimated binding core of) a peptide and (the pseudo sequence of) an MHC class II molecule, ignoring biological knowledge behind the interactions of the two molecules. We thus propose a binding core-aware deep learning-based model, DeepMHCII, with a binding interaction convolution layer, which allows to integrate all potential binding cores (in a given peptide) with the MHC pseudo (binding) sequence, through modeling the interaction with multiple convolutional kernels. RESULTS: Extensive empirical experiments with four large-scale datasets demonstrate that DeepMHCII significantly outperformed four state-of-the-art methods under numerous settings, such as 5-fold cross-validation, leave one molecule out, validation with independent testing sets and binding core prediction. All these results and visualization of the predicted binding cores indicate the effectiveness of our model, DeepMHCII, and the importance of properly modeling biological facts in deep learning for high predictive performance and efficient knowledge discovery. AVAILABILITY AND IMPLEMENTATION: DeepMHCII is publicly available at https://github.com/yourh/DeepMHCII. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Antígenos de Histocompatibilidad Clase II , Péptidos , Algoritmos , Antígenos de Histocompatibilidad Clase II/metabolismo , Péptidos/química , Unión Proteica , Transporte de Proteínas

9.

SPARSE: a sparse hypergraph neural network for learning multiple types of latent combinations to accurately predict drug-drug interactions.

Nguyen, Duc Anh; Nguyen, Canh Hao; Petschner, Peter; Mamitsuka, Hiroshi.

Bioinformatics ; 38(Suppl 1): i333-i341, 2022 06 24.

Artículo en Inglés | MEDLINE | ID: mdl-35758803

RESUMEN

MOTIVATION: Predicting side effects of drug-drug interactions (DDIs) is an important task in pharmacology. The state-of-the-art methods for DDI prediction use hypergraph neural networks to learn latent representations of drugs and side effects to express high-order relationships among two interacting drugs and a side effect. The idea of these methods is that each side effect is caused by a unique combination of latent features of the corresponding interacting drugs. However, in reality, a side effect might have multiple, different mechanisms that cannot be represented by a single combination of latent features of drugs. Moreover, DDI data are sparse, suggesting that using a sparsity regularization would help to learn better latent representations to improve prediction performances. RESULTS: We propose SPARSE, which encodes the DDI hypergraph and drug features to latent spaces to learn multiple types of combinations of latent features of drugs and side effects, controlling the model sparsity by a sparse prior. Our extensive experiments using both synthetic and three real-world DDI datasets showed the clear predictive performance advantage of SPARSE over cutting-edge competing methods. Also, latent feature analysis over unknown top predictions by SPARSE demonstrated the interpretability advantage contributed by the model sparsity. AVAILABILITY AND IMPLEMENTATION: Code and data can be accessed at https://github.com/anhnda/SPARSE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Redes Neurales de la Computación , Interacciones Farmacológicas , Humanos

10.

DeepGraphGO: graph neural network for large-scale, multispecies protein function prediction.

You, Ronghui; Yao, Shuwei; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Bioinformatics ; 37(Suppl_1): i262-i271, 2021 07 12.

Artículo en Inglés | MEDLINE | ID: mdl-34252926

RESUMEN

MOTIVATION: Automated function prediction (AFP) of proteins is a large-scale multi-label classification problem. Two limitations of most network-based methods for AFP are (i) a single model must be trained for each species and (ii) protein sequence information is totally ignored. These limitations cause weaker performance than sequence-based methods. Thus, the challenge is how to develop a powerful network-based method for AFP to overcome these limitations. RESULTS: We propose DeepGraphGO, an end-to-end, multispecies graph neural network-based method for AFP, which makes the most of both protein sequence and high-order protein network information. Our multispecies strategy allows one single model to be trained for all species, indicating a larger number of training samples than existing methods. Extensive experiments with a large-scale dataset show that DeepGraphGO outperforms a number of competing state-of-the-art methods significantly, including DeepGOPlus and three representative network-based methods: GeneMANIA, deepNF and clusDCA. We further confirm the effectiveness of our multispecies strategy and the advantage of DeepGraphGO over so-called difficult proteins. Finally, we integrate DeepGraphGO into the state-of-the-art ensemble method, NetGO, as a component and achieve a further performance improvement. AVAILABILITY AND IMPLEMENTATION: https://github.com/yourh/DeepGraphGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Redes Neurales de la Computación , Proteínas , Secuencia de Aminoácidos

11.

HPOFiller: identifying missing protein-phenotype associations by graph convolutional network.

Liu, Lizhi; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Bioinformatics ; 37(19): 3328-3336, 2021 Oct 11.

Artículo en Inglés | MEDLINE | ID: mdl-33822886

RESUMEN

MOTIVATION: Exploring the relationship between human proteins and abnormal phenotypes is of great importance in the prevention, diagnosis and treatment of diseases. The human phenotype ontology (HPO) is a standardized vocabulary that describes the phenotype abnormalities encountered in human diseases. However, the current HPO annotations of proteins are not complete. Thus, it is important to identify missing protein-phenotype associations. RESULTS: We propose HPOFiller, a graph convolutional network (GCN)-based approach, for predicting missing HPO annotations. HPOFiller has two key GCN components for capturing embeddings from complex network structures: (i) S-GCN for both protein-protein interaction network and HPO semantic similarity network to utilize network weights; (ii) Bi-GCN for the protein-phenotype bipartite graph to conduct message passing between proteins and phenotypes. The core idea of HPOFiller is to repeat run these two GCN modules consecutively over the three networks, to refine the embeddings. Empirical results of extremely stringent evaluation avoiding potential information leakage including cross-validation and temporal validation demonstrates that HPOFiller significantly outperforms all other state-of-the-art methods. In particular, the ablation study shows that batch normalization contributes the most to the performance. The further examination offers literature evidence for highly ranked predictions. Finally using known disease-HPO term associations, HPOFiller could suggest promising, unknown disease-gene associations, presenting possible genetic causes of human disorders. AVAILABILITYAND IMPLEMENTATION: https://github.com/liulizhi1996/HPOFiller. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

12.

BERTMeSH: deep contextual representation learning for large-scale high-performance MeSH indexing with full text.

You, Ronghui; Liu, Yuxuan; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Bioinformatics ; 37(5): 684-692, 2021 05 05.

Artículo en Inglés | MEDLINE | ID: mdl-32976559

RESUMEN

MOTIVATION: With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. RESULTS: We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on â¼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Indización y Redacción de Resúmenes , Medical Subject Headings , MEDLINE , PubMed , Semántica

13.

Recent advances and prospects of computational methods for metabolite identification: a review with emphasis on machine learning approaches.

Nguyen, Dai Hai; Nguyen, Canh Hao; Mamitsuka, Hiroshi.

Brief Bioinform ; 20(6): 2028-2043, 2019 11 27.

Artículo en Inglés | MEDLINE | ID: mdl-30099485

RESUMEN

MOTIVATION: Metabolomics involves studies of a great number of metabolites, which are small molecules present in biological systems. They play a lot of important functions such as energy transport, signaling, building block of cells and inhibition/catalysis. Understanding biochemical characteristics of the metabolites is an essential and significant part of metabolomics to enlarge the knowledge of biological systems. It is also the key to the development of many applications and areas such as biotechnology, biomedicine or pharmaceuticals. However, the identification of the metabolites remains a challenging task in metabolomics with a huge number of potentially interesting but unknown metabolites. The standard method for identifying metabolites is based on the mass spectrometry (MS) preceded by a separation technique. Over many decades, many techniques with different approaches have been proposed for MS-based metabolite identification task, which can be divided into the following four groups: mass spectra database, in silico fragmentation, fragmentation tree and machine learning. In this review paper, we thoroughly survey currently available tools for metabolite identification with the focus on in silico fragmentation, and machine learning-based approaches. We also give an intensive discussion on advanced machine learning methods, which can lead to further improvement on this task.

Asunto(s)

Biología Computacional/métodos , Aprendizaje Automático , Metabolómica , Simulación por Computador , Espectroscopía de Resonancia Magnética , Espectrometría de Masas

14.

HPOLabeler: improving prediction of human protein-phenotype associations by learning to rank.

Liu, Lizhi; Huang, Xiaodi; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Bioinformatics ; 36(14): 4180-4188, 2020 08 15.

Artículo en Inglés | MEDLINE | ID: mdl-32379868

RESUMEN

MOTIVATION: Annotating human proteins by abnormal phenotypes has become an important topic. Human Phenotype Ontology (HPO) is a standardized vocabulary of phenotypic abnormalities encountered in human diseases. As of November 2019, only <4000 proteins have been annotated with HPO. Thus, a computational approach for accurately predicting protein-HPO associations would be important, whereas no methods have outperformed a simple Naive approach in the second Critical Assessment of Functional Annotation, 2013-2014 (CAFA2). RESULTS: We present HPOLabeler, which is able to use a wide variety of evidence, such as protein-protein interaction (PPI) networks, Gene Ontology, InterPro, trigram frequency and HPO term frequency, in the framework of learning to rank (LTR). LTR has been proved to be powerful for solving large-scale, multi-label ranking problems in bioinformatics. Given an input protein, LTR outputs the ranked list of HPO terms from a series of input scores given to the candidate HPO terms by component learning models (logistic regression, nearest neighbor and a Naive method), which are trained from given multiple evidence. We empirically evaluate HPOLabeler extensively through mainly two experiments of cross validation and temporal validation, for which HPOLabeler significantly outperformed all component models and competing methods including the current state-of-the-art method. We further found that (i) PPI is most informative for prediction among diverse data sources and (ii) low prediction performance of temporal validation might be caused by incomplete annotation of new proteins. AVAILABILITY AND IMPLEMENTATION: http://issubmission.sjtu.edu.cn/hpolabeler/. CONTACT: zhusf@fudan.edu.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Biología Computacional , Mapas de Interacción de Proteínas , Ontología de Genes , Humanos , Fenotipo , Proteínas/metabolismo

15.

FullMeSH: improving large-scale MeSH indexing with full text.

Dai, Suyang; You, Ronghui; Lu, Zhiyong; Huang, Xiaodi; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Bioinformatics ; 36(5): 1533-1541, 2020 03 01.

Artículo en Inglés | MEDLINE | ID: mdl-31596475

RESUMEN

MOTIVATION: With the rapidly growing biomedical literature, automatically indexing biomedical articles by Medical Subject Heading (MeSH), namely MeSH indexing, has become increasingly important for facilitating hypothesis generation and knowledge discovery. Over the past years, many large-scale MeSH indexing approaches have been proposed, such as Medical Text Indexer, MeSHLabeler, DeepMeSH and MeSHProbeNet. However, the performance of these methods is hampered by using limited information, i.e. only the title and abstract of biomedical articles. RESULTS: We propose FullMeSH, a large-scale MeSH indexing method taking advantage of the recent increase in the availability of full text articles. Compared to DeepMeSH and other state-of-the-art methods, FullMeSH has three novelties: (i) Instead of using a full text as a whole, FullMeSH segments it into several sections with their normalized titles in order to distinguish their contributions to the overall performance. (ii) FullMeSH integrates the evidence from different sections in a 'learning to rank' framework by combining the sparse and deep semantic representations. (iii) FullMeSH trains an Attention-based Convolutional Neural Network for each section, which achieves better performance on infrequent MeSH headings. FullMeSH has been developed and empirically trained on the entire set of 1.4 million full-text articles in the PubMed Central Open Access subset. It achieved a Micro F-measure of 66.76% on a test set of 10 000 articles, which was 3.3% and 6.4% higher than DeepMeSH and MeSHLabeler, respectively. Furthermore, FullMeSH demonstrated an average improvement of 4.7% over DeepMeSH for indexing Check Tags, a set of most frequently indexed MeSH headings. AVAILABILITY AND IMPLEMENTATION: The software is available upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Indización y Redacción de Resúmenes , Medical Subject Headings , MEDLINE , PubMed , Semántica , Programas Informáticos

16.

NetGO: improving large-scale protein function prediction with massive network information.

You, Ronghui; Yao, Shuwei; Xiong, Yi; Huang, Xiaodi; Sun, Fengzhu; Mamitsuka, Hiroshi; Zhu, Shanfeng.

Nucleic Acids Res ; 47(W1): W379-W387, 2019 07 02.

Artículo en Inglés | MEDLINE | ID: mdl-31106361

RESUMEN

Automated function prediction (AFP) of proteins is of great significance in biology. AFP can be regarded as a problem of the large-scale multi-label classification where a protein can be associated with multiple gene ontology terms as its labels. Based on our GOLabeler-a state-of-the-art method for the third critical assessment of functional annotation (CAFA3), in this paper we propose NetGO, a web server that is able to further improve the performance of the large-scale AFP by incorporating massive protein-protein network information. Specifically, the advantages of NetGO are threefold in using network information: (i) NetGO relies on a powerful learning to rank framework from machine learning to effectively integrate both sequence and network information of proteins; (ii) NetGO uses the massive network information of all species (>2000) in STRING (other than only some specific species) and (iii) NetGO still can use network information to annotate a protein by homology transfer, even if it is not contained in STRING. Separating training and testing data with the same time-delayed settings of CAFA, we comprehensively examined the performance of NetGO. Experimental results have clearly demonstrated that NetGO significantly outperforms GOLabeler and other competing methods. The NetGO web server is freely available at http://issubmission.sjtu.edu.cn/netgo/.

Asunto(s)

Biología Computacional/métodos , Aprendizaje Automático , Anotación de Secuencia Molecular , Proteínas/química , Programas Informáticos , Secuencia de Aminoácidos , Animales , Benchmarking , Bases de Datos de Proteínas , Ontología de Genes , Humanos , Internet , Modelos Moleculares , Plantas/genética , Células Procariotas/metabolismo , Mapeo de Interacción de Proteínas , Proteínas/fisiología , Alineación de Secuencia , Análisis de Secuencia de Proteína , Homología de Secuencia de Aminoácido , Relación Estructura-Actividad

17.

ADAPTIVE: leArning DAta-dePendenT, concIse molecular VEctors for fast, accurate metabolite identification from tandem mass spectra.

Nguyen, Dai Hai; Nguyen, Canh Hao; Mamitsuka, Hiroshi.

Bioinformatics ; 35(14): i164-i172, 2019 07 15.

Artículo en Inglés | MEDLINE | ID: mdl-31510641

RESUMEN

MOTIVATION: Metabolite identification is an important task in metabolomics to enhance the knowledge of biological systems. There have been a number of machine learning-based methods proposed for this task, which predict a chemical structure of a given spectrum through an intermediate (chemical structure) representation called molecular fingerprints. They usually have two steps: (i) predicting fingerprints from spectra; (ii) searching chemical compounds (in database) corresponding to the predicted fingerprints. Fingerprints are feature vectors, which are usually very large to cover all possible substructures and chemical properties, and therefore heavily redundant, in the sense of having many molecular (sub)structures irrelevant to the task, causing limited predictive performance and slow prediction. RESULTS: We propose ADAPTIVE, which has two parts: learning two mappings (i) from structures to molecular vectors and (ii) from spectra to molecular vectors. The first part learns molecular vectors for metabolites from given data, to be consistent with both spectra and chemical structures of metabolites. In more detail, molecular vectors are generated by a model, being parameterized by a message passing neural network, and parameters are estimated by maximizing the correlation between molecular vectors and the corresponding spectra in terms of Hilbert-Schmidt Independence Criterion. Molecular vectors generated by this model are compact and importantly adaptive (specific) to both given data and task of metabolite identification. The second part uses input output kernel regression (IOKR), the current cutting-edge method of metabolite identification. We empirically confirmed the effectiveness of ADAPTIVE by using a benchmark data, where ADAPTIVE outperformed the original IOKR in both predictive performance and computational efficiency. AVAILABILITY AND IMPLEMENTATION: The code will be accessed through http://www.bic.kyoto-u.ac.jp/pathway/tools/ADAPTIVE after the acceptance of this article.

Asunto(s)

Metabolómica , Espectrometría de Masas en Tándem , Benchmarking , Bases de Datos Factuales , Aprendizaje Automático

18.

Modelling G×E with historical weather information improves genomic prediction in new environments.

Gillberg, Jussi; Marttinen, Pekka; Mamitsuka, Hiroshi; Kaski, Samuel.

Bioinformatics ; 35(20): 4045-4052, 2019 10 15.

Artículo en Inglés | MEDLINE | ID: mdl-30977782

RESUMEN

MOTIVATION: Interaction between the genotype and the environment (G×E) has a strong impact on the yield of major crop plants. Although influential, taking G×E explicitly into account in plant breeding has remained difficult. Recently G×E has been predicted from environmental and genomic covariates, but existing works have not shown that generalization to new environments and years without access to in-season data is possible and practical applicability remains unclear. Using data from a Barley breeding programme in Finland, we construct an in silico experiment to study the viability of G×E prediction under practical constraints. RESULTS: We show that the response to the environment of a new generation of untested Barley cultivars can be predicted in new locations and years using genomic data, machine learning and historical weather observations for the new locations. Our results highlight the need for models of G×E: non-linear effects clearly dominate linear ones, and the interaction between the soil type and daily rain is identified as the main driver for G×E for Barley in Finland. Our study implies that genomic selection can be used to capture the yield potential in G×E effects for future growth seasons, providing a possible means to achieve yield improvements, needed for feeding the growing population. AVAILABILITY AND IMPLEMENTATION: The data accompanied by the method code (http://research.cs.aalto.fi/pml/software/gxe/bioinformatics_codes.zip) is available in the form of kernels to allow reproducing the results. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Genómica , Modelos Genéticos , Interacción Gen-Ambiente , Genotipo , Fenotipo , Tiempo (Meteorología)

19.

Scaled Coupled Norms and Coupled Higher-Order Tensor Completion.

Wimalawarne, Kishan; Yamada, Makoto; Mamitsuka, Hiroshi.

Neural Comput ; 32(2): 447-484, 2020 02.

Artículo en Inglés | MEDLINE | ID: mdl-31835002

RESUMEN

Recently, a set of tensor norms known as coupled norms has been proposed as a convex solution to coupled tensor completion. Coupled norms have been designed by combining low-rank inducing tensor norms with the matrix trace norm. Though coupled norms have shown good performances, they have two major limitations: they do not have a method to control the regularization of coupled modes and uncoupled modes, and they are not optimal for couplings among higher-order tensors. In this letter, we propose a method that scales the regularization of coupled components against uncoupled components to properly induce the low-rankness on the coupled mode. We also propose coupled norms for higher-order tensors by combining the square norm to coupled norms. Using the excess risk-bound analysis, we demonstrate that our proposed methods lead to lower risk bounds compared to existing coupled norms. We demonstrate the robustness of our methods through simulation and real-data experiments.

20.

Computational recognition for long non-coding RNA (lncRNA): Software and databases.

Yotsukura, Sohiya; duVerle, David; Hancock, Timothy; Natsume-Kitatani, Yayoi; Mamitsuka, Hiroshi.

Brief Bioinform ; 18(1): 9-27, 2017 01.

Artículo en Inglés | MEDLINE | ID: mdl-26839320

RESUMEN

Since the completion of the Human Genome Project, it has been widely established that most DNA is not transcribed into proteins. These non-protein-coding regions are believed to be moderators within transcriptional and post-transcriptional processes, which play key roles in the onset of diseases. Long non-coding RNAs (lncRNAs) are generally lacking in conserved motifs typically used for detection and thus hard to identify, but nonetheless present certain characteristic features that can be exploited by bioinformatics methods. By combining lncRNA detection with known miRNA, RNA-binding protein and chromatin interaction, current tools are able to recognize and functionally annotate large number of lncRNAs. This review discusses databases and platforms dedicated to cataloging and annotating lncRNAs, as well as tools geared at discovering novel sequences. We emphasize the issues posed by the diversity of lncRNAs and their complex interaction mechanisms, as well as technical issues such as lack of unified nomenclature. We hope that this wide overview of existing platforms and databases might help guide biologists toward the tools they need to analyze their experimental data, while our discussion of limitations and of current lncRNA-related methods may assist in the development of new computational tools.

Asunto(s)

ARN Largo no Codificante/genética , Biología Computacional , Bases de Datos Genéticas , Bases de Datos de Ácidos Nucleicos , Humanos , Programas Informáticos

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA