Your browser doesn't support javascript.
loading
Montrer: 20 | 50 | 100
Résultats 1 - 20 de 111
Filtrer
1.
ACS Biomater Sci Eng ; 10(4): 2165-2176, 2024 04 08.
Article de Anglais | MEDLINE | ID: mdl-38546298

RÉSUMÉ

Manipulating the three-dimensional (3D) structures of cells is important for facilitating to repair or regenerate tissues. A self-assembly system of cells with cellulose nanofibers (CNFs) and concentrated polymer brushes (CPBs) has been developed to fabricate various cell 3D structures. To further generate tissues at an implantable level, it is necessary to carry out a large number of experiments using different cell culture conditions and material properties; however this is practically intractable. To address this issue, we present a graph-neural network-based simulator (GNS) that can be trained by using assembly process images to predict the assembly status of future time steps. A total of 24 (25 steps) time-series images were recorded (four repeats for each of six different conditions), and each image was transformed into a graph by regarding the cells as nodes and the connecting neighboring cells as edges. Using the obtained data, the performances of the GNS were examined under three scenarios (i.e., changing a pair of the training and testing data) to verify the possibility of using the GNS as a predictor for further time steps. It was confirmed that the GNS could reasonably reproduce the assembly process, even under the toughest scenario, in which the experimental conditions differed between the training and testing data. Practically, this means that the GNS trained by the first 24 h images could predict the cell types obtained 3 weeks later. This result could reduce the number of experiments required to find the optimal conditions for generating cells with desired 3D structures. Ultimately, our approach could accelerate progress in regenerative medicine.


Sujet(s)
Nanofibres , Polymères , Nanofibres/composition chimique , Cellulose/composition chimique
2.
Bioinformatics ; 39(9)2023 09 02.
Article de Anglais | MEDLINE | ID: mdl-37669154

RÉSUMÉ

MOTIVATION: Computationally predicting major histocompatibility complex class I (MHC-I) peptide binding affinity is an important problem in immunological bioinformatics, which is also crucial for the identification of neoantigens for personalized therapeutic cancer vaccines. Recent cutting-edge deep learning-based methods for this problem cannot achieve satisfactory performance, especially for non-9-mer peptides. This is because such methods generate the input by simply concatenating the two given sequences: a peptide and (the pseudo sequence of) an MHC class I molecule, which cannot precisely capture the anchor positions of the MHC binding motif for the peptides with variable lengths. We thus developed an anchor position-aware and high-performance deep model, DeepMHCI, with a position-wise gated layer and a residual binding interaction convolution layer. This allows the model to control the information flow in peptides to be aware of anchor positions and model the interactions between peptides and the MHC pseudo (binding) sequence directly with multiple convolutional kernels. RESULTS: The performance of DeepMHCI has been thoroughly validated by extensive experiments on four benchmark datasets under various settings, such as 5-fold cross-validation, validation with the independent testing set, external HPV vaccine identification, and external CD8+ epitope identification. Experimental results with visualization of binding motifs demonstrate that DeepMHCI outperformed all competing methods, especially on non-9-mer peptides binding prediction. AVAILABILITY AND IMPLEMENTATION: DeepMHCI is publicly available at https://github.com/ZhuLab-Fudan/DeepMHCI.


Sujet(s)
Algorithmes , Référenciation , Biologie informatique , Épitopes , Peptides
3.
Article de Anglais | MEDLINE | ID: mdl-37018091

RÉSUMÉ

Predicting drug-drug interactions (DDIs) is the problem of predicting side effects (unwanted outcomes) of a pair of drugs using drug information and known side effects of many pairs. This problem can be formulated as predicting labels (i.e., side effects) for each pair of nodes in a DDI graph, of which nodes are drugs and edges are interacting drugs with known labels. State-of-the-art methods for this problem are graph neural networks (GNNs), which leverage neighborhood information in the graph to learn node representations. For DDI, however, there are many labels with complicated relationships due to the nature of side effects. Usual GNNs often fix labels as one-hot vectors that do not reflect label relationships and potentially do not obtain the highest performance in the difficult cases of infrequent labels. In this brief, we formulate DDI as a hypergraph where each hyperedge is a triple: two nodes for drugs and one node for a label. We then present CentSmoothie , a hypergraph neural network (HGNN) that learns representations of nodes and labels altogether with a novel "central-smoothing" formulation. We empirically demonstrate the performance advantages of CentSmoothie in simulations as well as real datasets.

4.
Bioinformatics ; 39(1)2023 01 01.
Article de Anglais | MEDLINE | ID: mdl-36576008

RÉSUMÉ

MOTIVATION: Finding molecules with desired pharmaceutical properties is crucial in drug discovery. Generative models can be an efficient tool to find desired molecules through the distribution learned by the model to approximate given training data. Existing generative models (i) do not consider backbone structures (scaffolds), resulting in inefficiency or (ii) need prior patterns for scaffolds, causing bias. Scaffolds are reasonable to use, and it is imperative to design a generative model without any prior scaffold patterns. RESULTS: We propose a generative model-based molecule generator, Sc2Mol, without any prior scaffold patterns. Sc2Mol uses SMILES strings for molecules. It consists of two steps: scaffold generation and scaffold decoration, which are carried out by a variational autoencoder and a transformer, respectively. The two steps are powerful for implementing random molecule generation and scaffold optimization. Our empirical evaluation using drug-like molecule datasets confirmed the success of our model in distribution learning and molecule optimization. Also, our model could automatically learn the rules to transform coarse scaffolds into sophisticated drug candidates. These rules were consistent with those for current lead optimization. AVAILABILITY AND IMPLEMENTATION: The code is available at https://github.com/zhiruiliao/Sc2Mol. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Sujet(s)
Découverte de médicament , Apprentissage machine
5.
Bioinformatics ; 38(Suppl 1): i220-i228, 2022 06 24.
Article de Anglais | MEDLINE | ID: mdl-35758790

RÉSUMÉ

MOTIVATION: Computationally predicting major histocompatibility complex (MHC)-peptide binding affinity is an important problem in immunological bioinformatics. Recent cutting-edge deep learning-based methods for this problem are unable to achieve satisfactory performance for MHC class II molecules. This is because such methods generate the input by simply concatenating the two given sequences: (the estimated binding core of) a peptide and (the pseudo sequence of) an MHC class II molecule, ignoring biological knowledge behind the interactions of the two molecules. We thus propose a binding core-aware deep learning-based model, DeepMHCII, with a binding interaction convolution layer, which allows to integrate all potential binding cores (in a given peptide) with the MHC pseudo (binding) sequence, through modeling the interaction with multiple convolutional kernels. RESULTS: Extensive empirical experiments with four large-scale datasets demonstrate that DeepMHCII significantly outperformed four state-of-the-art methods under numerous settings, such as 5-fold cross-validation, leave one molecule out, validation with independent testing sets and binding core prediction. All these results and visualization of the predicted binding cores indicate the effectiveness of our model, DeepMHCII, and the importance of properly modeling biological facts in deep learning for high predictive performance and efficient knowledge discovery. AVAILABILITY AND IMPLEMENTATION: DeepMHCII is publicly available at https://github.com/yourh/DeepMHCII. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Sujet(s)
Antigènes d'histocompatibilité de classe II , Peptides , Algorithmes , Antigènes d'histocompatibilité de classe II/métabolisme , Peptides/composition chimique , Liaison aux protéines , Transport des protéines
6.
Bioinformatics ; 38(Suppl 1): i333-i341, 2022 06 24.
Article de Anglais | MEDLINE | ID: mdl-35758803

RÉSUMÉ

MOTIVATION: Predicting side effects of drug-drug interactions (DDIs) is an important task in pharmacology. The state-of-the-art methods for DDI prediction use hypergraph neural networks to learn latent representations of drugs and side effects to express high-order relationships among two interacting drugs and a side effect. The idea of these methods is that each side effect is caused by a unique combination of latent features of the corresponding interacting drugs. However, in reality, a side effect might have multiple, different mechanisms that cannot be represented by a single combination of latent features of drugs. Moreover, DDI data are sparse, suggesting that using a sparsity regularization would help to learn better latent representations to improve prediction performances. RESULTS: We propose SPARSE, which encodes the DDI hypergraph and drug features to latent spaces to learn multiple types of combinations of latent features of drugs and side effects, controlling the model sparsity by a sparse prior. Our extensive experiments using both synthetic and three real-world DDI datasets showed the clear predictive performance advantage of SPARSE over cutting-edge competing methods. Also, latent feature analysis over unknown top predictions by SPARSE demonstrated the interpretability advantage contributed by the model sparsity. AVAILABILITY AND IMPLEMENTATION: Code and data can be accessed at https://github.com/anhnda/SPARSE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Sujet(s)
Effets secondaires indésirables des médicaments , , Interactions médicamenteuses , Humains
7.
IEEE/ACM Trans Comput Biol Bioinform ; 19(4): 2197-2207, 2022.
Article de Anglais | MEDLINE | ID: mdl-33705322

RÉSUMÉ

Detecting predictive biomarkers from multi-omics data is important for precision medicine, to improve diagnostics of complex diseases and for better treatments. This needs substantial experimental efforts that are made difficult by the heterogeneity of cell lines and huge cost. An effective solution is to build a computational model over the diverse omics data, including genomic, molecular, and environmental information. However, choosing informative and reliable data sources from among the different types of data is a challenging problem. We propose DIVERSE, a framework of Bayesian importance-weighted tri- and bi-matrix factorization(DIVERSE3 or DIVERSE2) to predict drug responses from data of cell lines, drugs, and gene interactions. DIVERSE integrates the data sources systematically, in a step-wise manner, examining the importance of each added data set in turn. More specifically, we sequentially integrate five different data sets, which have not all been combined in earlier bioinformatic methods for predicting drug responses. Empirical experiments show that DIVERSE clearly outperformed five other methods including three state-of-the-art approaches, under cross-validation, particularly in out-of-matrix prediction, which is closer to the setting of real use cases and more challenging than simpler in-matrix prediction. Additionally, case studies for discovering new drugs further confirmed the performance advantage of DIVERSE.


Sujet(s)
Biologie informatique , Médecine de précision , Théorème de Bayes , Biologie informatique/méthodes , Médecine de précision/méthodes
8.
Bioinformatics ; 38(3): 799-808, 2022 01 12.
Article de Anglais | MEDLINE | ID: mdl-34672333

RÉSUMÉ

MOTIVATION: Deciphering the relationship between human genes/proteins and abnormal phenotypes is of great importance in the prevention, diagnosis and treatment against diseases. The Human Phenotype Ontology (HPO) is a standardized vocabulary that describes the phenotype abnormalities encountered in human disorders. However, the current HPO annotations are still incomplete. Thus, it is necessary to computationally predict human protein-phenotype associations. In terms of current, cutting-edge computational methods for annotating proteins (such as functional annotation), three important features are (i) multiple network input, (ii) semi-supervised learning and (iii) deep graph convolutional network (GCN), whereas there are no methods with all these features for predicting HPO annotations of human protein. RESULTS: We develop HPODNets with all above three features for predicting human protein-phenotype associations. HPODNets adopts a deep GCN with eight layers which allows to capture high-order topological information from multiple interaction networks. Empirical results with both cross-validation and temporal validation demonstrate that HPODNets outperforms seven competing state-of-the-art methods for protein function prediction. HPODNets with the architecture of deep GCNs is confirmed to be effective for predicting HPO annotations of human protein and, more generally, node label ranking problem with multiple biomolecular networks input in bioinformatics. AVAILABILITY AND IMPLEMENTATION: https://github.com/liulizhi1996/HPODNets. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Sujet(s)
Algorithmes , Biologie informatique , Humains , Biologie informatique/méthodes , Phénotype
9.
PLoS One ; 16(12): e0251952, 2021.
Article de Anglais | MEDLINE | ID: mdl-34914721

RÉSUMÉ

Identifying crop loss at field parcel scale using satellite images is challenging: first, crop loss is caused by many factors during the growing season; second, reliable reference data about crop loss are lacking; third, there are many ways to define crop loss. This study investigates the feasibility of using satellite images to train machine learning (ML) models to classify agricultural field parcels into those with and without crop loss. The reference data for this study was provided by Finnish Food Authority (FFA) containing crop loss information of approximately 1.4 million field parcels in Finland covering about 3.5 million ha from 2000 to 2015. This reference data was combined with Normalised Difference Vegetation Index (NDVI) derived from Landsat 7 images, in which more than 80% of the possible data are missing. Despite the hard problem with extremely noisy data, among the four ML models we tested, random forest (with mean imputation and missing value indicators) achieved the average AUC (area under the ROC curve) of 0.688±0.059 over all 16 years with the range [0.602, 0.795] in identifying new crop-loss fields based on reference fields of the same year. To our knowledge, this is one of the first large scale benchmark study of using machine learning for crop loss classification at field parcel scale. The classification setting and trained models have numerous potential applications, for example, allowing government agencies or insurance companies to verify crop-loss claims by farmers and realise efficient agricultural monitoring.


Sujet(s)
Produits agricoles/croissance et développement , Apprentissage machine , Imagerie satellitaire , Saisons , Finlande
10.
Brief Bioinform ; 22(6)2021 11 05.
Article de Anglais | MEDLINE | ID: mdl-34368832

RÉSUMÉ

Drug combination therapy is a promising strategy to treat complex diseases such as cancer and infectious diseases. However, current knowledge of drug combination therapies, especially in cancer patients, is limited because of adverse drug effects, toxicity and cell line heterogeneity. Screening new drug combinations requires substantial efforts since considering all possible combinations between drugs is infeasible and expensive. Therefore, building computational approaches, particularly machine learning methods, could provide an effective strategy to overcome drug resistance and improve therapeutic efficacy. In this review, we group the state-of-the-art machine learning approaches to analyze personalized drug combination therapies into three categories and discuss each method in each category. We also present a short description of relevant databases used as a benchmark in drug combination therapies and provide a list of well-known, publicly available interactive data analysis portals. We highlight the importance of data integration on the identification of drug combinations. Finally, we address the advantages of combining multiple data sources on drug combination analysis by showing an experimental comparison.


Sujet(s)
Apprentissage machine , Protocoles de polychimiothérapie antinéoplasique/administration et posologie , Biologie informatique/méthodes , Humains , Tumeurs/traitement médicamenteux , Médecine de précision
11.
Bioinformatics ; 37(Suppl_1): i262-i271, 2021 07 12.
Article de Anglais | MEDLINE | ID: mdl-34252926

RÉSUMÉ

MOTIVATION: Automated function prediction (AFP) of proteins is a large-scale multi-label classification problem. Two limitations of most network-based methods for AFP are (i) a single model must be trained for each species and (ii) protein sequence information is totally ignored. These limitations cause weaker performance than sequence-based methods. Thus, the challenge is how to develop a powerful network-based method for AFP to overcome these limitations. RESULTS: We propose DeepGraphGO, an end-to-end, multispecies graph neural network-based method for AFP, which makes the most of both protein sequence and high-order protein network information. Our multispecies strategy allows one single model to be trained for all species, indicating a larger number of training samples than existing methods. Extensive experiments with a large-scale dataset show that DeepGraphGO outperforms a number of competing state-of-the-art methods significantly, including DeepGOPlus and three representative network-based methods: GeneMANIA, deepNF and clusDCA. We further confirm the effectiveness of our multispecies strategy and the advantage of DeepGraphGO over so-called difficult proteins. Finally, we integrate DeepGraphGO into the state-of-the-art ensemble method, NetGO, as a component and achieve a further performance improvement. AVAILABILITY AND IMPLEMENTATION: https://github.com/yourh/DeepGraphGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Sujet(s)
, Protéines , Séquence d'acides aminés
12.
Bioinformatics ; 37(19): 3328-3336, 2021 Oct 11.
Article de Anglais | MEDLINE | ID: mdl-33822886

RÉSUMÉ

MOTIVATION: Exploring the relationship between human proteins and abnormal phenotypes is of great importance in the prevention, diagnosis and treatment of diseases. The human phenotype ontology (HPO) is a standardized vocabulary that describes the phenotype abnormalities encountered in human diseases. However, the current HPO annotations of proteins are not complete. Thus, it is important to identify missing protein-phenotype associations. RESULTS: We propose HPOFiller, a graph convolutional network (GCN)-based approach, for predicting missing HPO annotations. HPOFiller has two key GCN components for capturing embeddings from complex network structures: (i) S-GCN for both protein-protein interaction network and HPO semantic similarity network to utilize network weights; (ii) Bi-GCN for the protein-phenotype bipartite graph to conduct message passing between proteins and phenotypes. The core idea of HPOFiller is to repeat run these two GCN modules consecutively over the three networks, to refine the embeddings. Empirical results of extremely stringent evaluation avoiding potential information leakage including cross-validation and temporal validation demonstrates that HPOFiller significantly outperforms all other state-of-the-art methods. In particular, the ablation study shows that batch normalization contributes the most to the performance. The further examination offers literature evidence for highly ranked predictions. Finally using known disease-HPO term associations, HPOFiller could suggest promising, unknown disease-gene associations, presenting possible genetic causes of human disorders. AVAILABILITYAND IMPLEMENTATION: https://github.com/liulizhi1996/HPOFiller. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

13.
iScience ; 24(1): 102002, 2021 Jan 22.
Article de Anglais | MEDLINE | ID: mdl-33490910

RÉSUMÉ

The biological carbon pump, in which carbon fixed by photosynthesis is exported to the deep ocean through sinking, is a major process in Earth's carbon cycle. The proportion of primary production that is exported is termed the carbon export efficiency (CEE). Based on in-lab or regional scale observations, viruses were previously suggested to affect the CEE (i.e., viral "shunt" and "shuttle"). In this study, we tested associations between viral community composition and CEE measured at a global scale. A regression model based on relative abundance of viral marker genes explained 67% of the variation in CEE. Viruses with high importance in the model were predicted to infect ecologically important hosts. These results are consistent with the view that the viral shunt and shuttle functions at a large scale and further imply that viruses likely act in this process in a way dependent on their hosts and ecosystem dynamics.

14.
Brief Bioinform ; 22(5)2021 09 02.
Article de Anglais | MEDLINE | ID: mdl-33515011

RÉSUMÉ

MOTIVATION: Gene set enrichment analysis (GSEA) has been widely used to identify gene sets with statistically significant difference between cases and controls against a large gene set. GSEA needs both phenotype labels and expression of genes. However, gene expression are assessed more often for model organisms than minor species. Also, importantly gene expression are not measured well under specific conditions for human, due to high risk of direct experiments, such as non-approved treatment or gene knockout, and then often substituted by mouse. Thus, predicting enrichment significance (on a phenotype) of a given gene set of a species (target, say human), by using gene expression measured under the same phenotype of the other species (source, say mouse) is a vital and challenging problem, which we call CROSS-species gene set enrichment problem (XGSEP). RESULTS: For XGSEP, we propose the CROSS-species gene set enrichment analysis (XGSEA), with three steps of: (1) running GSEA for a source species to obtain enrichment scores and $p$-values of source gene sets; (2) representing the relation between source and target gene sets by domain adaptation; and (3) using regression to predict $p$-values of target gene sets, based on the representation in (2). We extensively validated the XGSEA by using five regression and one classification measurements on four real data sets under various settings, proving that the XGSEA significantly outperformed three baseline methods in most cases. A case study of identifying important human pathways for T -cell dysfunction and reprogramming from mouse ATAC-Seq data further confirmed the reliability of the XGSEA. AVAILABILITY: Source code of the XGSEA is available through https://github.com/LiminLi-xjtu/XGSEA.


Sujet(s)
Tumeurs du cerveau/génétique , Apprentissage machine , Mélanome/génétique , Tumeurs de l'ovaire/génétique , Tumeurs cutanées/génétique , Animaux , Tumeurs du cerveau/immunologie , Tumeurs du cerveau/anatomopathologie , Biologie informatique/méthodes , Jeux de données comme sujet , Embryon de mammifère , Femelle , Régulation de l'expression des gènes tumoraux , Humains , Mélanome/immunologie , Mélanome/anatomopathologie , Souris , Tumeurs de l'ovaire/immunologie , Tumeurs de l'ovaire/anatomopathologie , Tumeurs cutanées/immunologie , Tumeurs cutanées/anatomopathologie , Lymphocytes T/immunologie , Lymphocytes T/anatomopathologie , Danio zébré
15.
Brief Bioinform ; 22(1): 346-359, 2021 01 18.
Article de Anglais | MEDLINE | ID: mdl-31838491

RÉSUMÉ

Predicting the response of cancer cell lines to specific drugs is one of the central problems in personalized medicine, where the cell lines show diverse characteristics. Researchers have developed a variety of computational methods to discover associations between drugs and cell lines, and improved drug sensitivity analyses by integrating heterogeneous biological data. However, choosing informative data sources and methods that can incorporate multiple sources efficiently is the challenging part of successful analysis in personalized medicine. The reason is that finding decisive factors of cancer and developing methods that can overcome the problems of integrating data, such as differences in data structures and data complexities, are difficult. In this review, we summarize recent advances in data integration-based machine learning for drug response prediction, by categorizing methods as matrix factorization-based, kernel-based and network-based methods. We also present a short description of relevant databases used as a benchmark in drug response prediction analyses, followed by providing a brief discussion of challenges faced in integrating and interpreting data from multiple sources. Finally, we address the advantages of combining multiple heterogeneous data sources on drug sensitivity analysis by showing an experimental comparison. Contact:  betul.guvenc@aalto.fi.


Sujet(s)
Résistance aux médicaments antinéoplasiques , Génomique/méthodes , Médecine de précision/méthodes , Humains , Apprentissage machine , Variants pharmacogénomiques
16.
Brief Bioinform ; 22(1): 164-177, 2021 01 18.
Article de Anglais | MEDLINE | ID: mdl-31838499

RÉSUMÉ

MOTIVATION: Adverse drug reaction (ADR) or drug side effect studies play a crucial role in drug discovery. Recently, with the rapid increase of both clinical and non-clinical data, machine learning methods have emerged as prominent tools to support analyzing and predicting ADRs. Nonetheless, there are still remaining challenges in ADR studies. RESULTS: In this paper, we summarized ADR data sources and review ADR studies in three tasks: drug-ADR benchmark data creation, drug-ADR prediction and ADR mechanism analysis. We focused on machine learning methods used in each task and then compare performances of the methods on the drug-ADR prediction task. Finally, we discussed open problems for further ADR studies. AVAILABILITY: Data and code are available at https://github.com/anhnda/ADRPModels.


Sujet(s)
Biologie informatique/méthodes , Effets secondaires indésirables des médicaments/étiologie , Apprentissage machine , Humains
17.
IEEE Trans Pattern Anal Mach Intell ; 43(8): 2710-2722, 2021 Aug.
Article de Anglais | MEDLINE | ID: mdl-32086195

RÉSUMÉ

Hypergraph is a general way of representing high-order relations on a set of objects. It is a generalization of graph, in which only pairwise relations can be represented. It finds applications in various domains where relationships of more than two objects are observed. On a hypergraph, as a generalization of graph, one wishes to learn a smooth function with respect to its topology. A fundamental issue is to find suitable smoothness measures of functions on the nodes of a graph/hypergraph. We show a general framework that generalizes previously proposed smoothness measures and also generates new ones. To address the problem of irrelevant or noisy data, we wish to incorporate sparse learning framework into learning on hypergraphs. We propose sparsely smooth formulations that learn smooth functions and induce sparsity on hypergraphs at both hyperedge and node levels. We show their properties and sparse support recovery results. We conduct experiments to show that our sparsely smooth models are beneficial to learning irrelevant and noisy data, and usually give similar or improved performances compared to dense models.

18.
Bioinformatics ; 37(5): 684-692, 2021 05 05.
Article de Anglais | MEDLINE | ID: mdl-32976559

RÉSUMÉ

MOTIVATION: With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. RESULTS: We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Sujet(s)
, Medical Subject Headings , Medline , PubMed , Sémantique
19.
Bioinformatics ; 36(14): 4180-4188, 2020 08 15.
Article de Anglais | MEDLINE | ID: mdl-32379868

RÉSUMÉ

MOTIVATION: Annotating human proteins by abnormal phenotypes has become an important topic. Human Phenotype Ontology (HPO) is a standardized vocabulary of phenotypic abnormalities encountered in human diseases. As of November 2019, only <4000 proteins have been annotated with HPO. Thus, a computational approach for accurately predicting protein-HPO associations would be important, whereas no methods have outperformed a simple Naive approach in the second Critical Assessment of Functional Annotation, 2013-2014 (CAFA2). RESULTS: We present HPOLabeler, which is able to use a wide variety of evidence, such as protein-protein interaction (PPI) networks, Gene Ontology, InterPro, trigram frequency and HPO term frequency, in the framework of learning to rank (LTR). LTR has been proved to be powerful for solving large-scale, multi-label ranking problems in bioinformatics. Given an input protein, LTR outputs the ranked list of HPO terms from a series of input scores given to the candidate HPO terms by component learning models (logistic regression, nearest neighbor and a Naive method), which are trained from given multiple evidence. We empirically evaluate HPOLabeler extensively through mainly two experiments of cross validation and temporal validation, for which HPOLabeler significantly outperformed all component models and competing methods including the current state-of-the-art method. We further found that (i) PPI is most informative for prediction among diverse data sources and (ii) low prediction performance of temporal validation might be caused by incomplete annotation of new proteins. AVAILABILITY AND IMPLEMENTATION: http://issubmission.sjtu.edu.cn/hpolabeler/. CONTACT: zhusf@fudan.edu.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Sujet(s)
Biologie informatique , Cartes d'interactions protéiques , Gene Ontology , Humains , Phénotype , Protéines/métabolisme
20.
Neural Comput ; 32(2): 447-484, 2020 02.
Article de Anglais | MEDLINE | ID: mdl-31835002

RÉSUMÉ

Recently, a set of tensor norms known as coupled norms has been proposed as a convex solution to coupled tensor completion. Coupled norms have been designed by combining low-rank inducing tensor norms with the matrix trace norm. Though coupled norms have shown good performances, they have two major limitations: they do not have a method to control the regularization of coupled modes and uncoupled modes, and they are not optimal for couplings among higher-order tensors. In this letter, we propose a method that scales the regularization of coupled components against uncoupled components to properly induce the low-rankness on the coupled mode. We also propose coupled norms for higher-order tensors by combining the square norm to coupled norms. Using the excess risk-bound analysis, we demonstrate that our proposed methods lead to lower risk bounds compared to existing coupled norms. We demonstrate the robustness of our methods through simulation and real-data experiments.

SÉLECTION CITATIONS
DÉTAIL DE RECHERCHE
...