Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 111
Filtrar
1.
ACS Biomater Sci Eng ; 10(4): 2165-2176, 2024 Apr 08.
Artigo em Inglês | MEDLINE | ID: mdl-38546298

RESUMO

Manipulating the three-dimensional (3D) structures of cells is important for facilitating to repair or regenerate tissues. A self-assembly system of cells with cellulose nanofibers (CNFs) and concentrated polymer brushes (CPBs) has been developed to fabricate various cell 3D structures. To further generate tissues at an implantable level, it is necessary to carry out a large number of experiments using different cell culture conditions and material properties; however this is practically intractable. To address this issue, we present a graph-neural network-based simulator (GNS) that can be trained by using assembly process images to predict the assembly status of future time steps. A total of 24 (25 steps) time-series images were recorded (four repeats for each of six different conditions), and each image was transformed into a graph by regarding the cells as nodes and the connecting neighboring cells as edges. Using the obtained data, the performances of the GNS were examined under three scenarios (i.e., changing a pair of the training and testing data) to verify the possibility of using the GNS as a predictor for further time steps. It was confirmed that the GNS could reasonably reproduce the assembly process, even under the toughest scenario, in which the experimental conditions differed between the training and testing data. Practically, this means that the GNS trained by the first 24 h images could predict the cell types obtained 3 weeks later. This result could reduce the number of experiments required to find the optimal conditions for generating cells with desired 3D structures. Ultimately, our approach could accelerate progress in regenerative medicine.


Assuntos
Nanofibras , Polímeros , Nanofibras/química , Celulose/química
2.
Bioinformatics ; 39(9)2023 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-37669154

RESUMO

MOTIVATION: Computationally predicting major histocompatibility complex class I (MHC-I) peptide binding affinity is an important problem in immunological bioinformatics, which is also crucial for the identification of neoantigens for personalized therapeutic cancer vaccines. Recent cutting-edge deep learning-based methods for this problem cannot achieve satisfactory performance, especially for non-9-mer peptides. This is because such methods generate the input by simply concatenating the two given sequences: a peptide and (the pseudo sequence of) an MHC class I molecule, which cannot precisely capture the anchor positions of the MHC binding motif for the peptides with variable lengths. We thus developed an anchor position-aware and high-performance deep model, DeepMHCI, with a position-wise gated layer and a residual binding interaction convolution layer. This allows the model to control the information flow in peptides to be aware of anchor positions and model the interactions between peptides and the MHC pseudo (binding) sequence directly with multiple convolutional kernels. RESULTS: The performance of DeepMHCI has been thoroughly validated by extensive experiments on four benchmark datasets under various settings, such as 5-fold cross-validation, validation with the independent testing set, external HPV vaccine identification, and external CD8+ epitope identification. Experimental results with visualization of binding motifs demonstrate that DeepMHCI outperformed all competing methods, especially on non-9-mer peptides binding prediction. AVAILABILITY AND IMPLEMENTATION: DeepMHCI is publicly available at https://github.com/ZhuLab-Fudan/DeepMHCI.


Assuntos
Algoritmos , Benchmarking , Biologia Computacional , Epitopos , Peptídeos
3.
Artigo em Inglês | MEDLINE | ID: mdl-37018091

RESUMO

Predicting drug-drug interactions (DDIs) is the problem of predicting side effects (unwanted outcomes) of a pair of drugs using drug information and known side effects of many pairs. This problem can be formulated as predicting labels (i.e., side effects) for each pair of nodes in a DDI graph, of which nodes are drugs and edges are interacting drugs with known labels. State-of-the-art methods for this problem are graph neural networks (GNNs), which leverage neighborhood information in the graph to learn node representations. For DDI, however, there are many labels with complicated relationships due to the nature of side effects. Usual GNNs often fix labels as one-hot vectors that do not reflect label relationships and potentially do not obtain the highest performance in the difficult cases of infrequent labels. In this brief, we formulate DDI as a hypergraph where each hyperedge is a triple: two nodes for drugs and one node for a label. We then present CentSmoothie , a hypergraph neural network (HGNN) that learns representations of nodes and labels altogether with a novel "central-smoothing" formulation. We empirically demonstrate the performance advantages of CentSmoothie in simulations as well as real datasets.

4.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36576008

RESUMO

MOTIVATION: Finding molecules with desired pharmaceutical properties is crucial in drug discovery. Generative models can be an efficient tool to find desired molecules through the distribution learned by the model to approximate given training data. Existing generative models (i) do not consider backbone structures (scaffolds), resulting in inefficiency or (ii) need prior patterns for scaffolds, causing bias. Scaffolds are reasonable to use, and it is imperative to design a generative model without any prior scaffold patterns. RESULTS: We propose a generative model-based molecule generator, Sc2Mol, without any prior scaffold patterns. Sc2Mol uses SMILES strings for molecules. It consists of two steps: scaffold generation and scaffold decoration, which are carried out by a variational autoencoder and a transformer, respectively. The two steps are powerful for implementing random molecule generation and scaffold optimization. Our empirical evaluation using drug-like molecule datasets confirmed the success of our model in distribution learning and molecule optimization. Also, our model could automatically learn the rules to transform coarse scaffolds into sophisticated drug candidates. These rules were consistent with those for current lead optimization. AVAILABILITY AND IMPLEMENTATION: The code is available at https://github.com/zhiruiliao/Sc2Mol. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Descoberta de Drogas , Aprendizado de Máquina
5.
Bioinformatics ; 38(Suppl 1): i220-i228, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758790

RESUMO

MOTIVATION: Computationally predicting major histocompatibility complex (MHC)-peptide binding affinity is an important problem in immunological bioinformatics. Recent cutting-edge deep learning-based methods for this problem are unable to achieve satisfactory performance for MHC class II molecules. This is because such methods generate the input by simply concatenating the two given sequences: (the estimated binding core of) a peptide and (the pseudo sequence of) an MHC class II molecule, ignoring biological knowledge behind the interactions of the two molecules. We thus propose a binding core-aware deep learning-based model, DeepMHCII, with a binding interaction convolution layer, which allows to integrate all potential binding cores (in a given peptide) with the MHC pseudo (binding) sequence, through modeling the interaction with multiple convolutional kernels. RESULTS: Extensive empirical experiments with four large-scale datasets demonstrate that DeepMHCII significantly outperformed four state-of-the-art methods under numerous settings, such as 5-fold cross-validation, leave one molecule out, validation with independent testing sets and binding core prediction. All these results and visualization of the predicted binding cores indicate the effectiveness of our model, DeepMHCII, and the importance of properly modeling biological facts in deep learning for high predictive performance and efficient knowledge discovery. AVAILABILITY AND IMPLEMENTATION: DeepMHCII is publicly available at https://github.com/yourh/DeepMHCII. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Antígenos de Histocompatibilidade Classe II , Peptídeos , Algoritmos , Antígenos de Histocompatibilidade Classe II/metabolismo , Peptídeos/química , Ligação Proteica , Transporte Proteico
6.
Bioinformatics ; 38(Suppl 1): i333-i341, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758803

RESUMO

MOTIVATION: Predicting side effects of drug-drug interactions (DDIs) is an important task in pharmacology. The state-of-the-art methods for DDI prediction use hypergraph neural networks to learn latent representations of drugs and side effects to express high-order relationships among two interacting drugs and a side effect. The idea of these methods is that each side effect is caused by a unique combination of latent features of the corresponding interacting drugs. However, in reality, a side effect might have multiple, different mechanisms that cannot be represented by a single combination of latent features of drugs. Moreover, DDI data are sparse, suggesting that using a sparsity regularization would help to learn better latent representations to improve prediction performances. RESULTS: We propose SPARSE, which encodes the DDI hypergraph and drug features to latent spaces to learn multiple types of combinations of latent features of drugs and side effects, controlling the model sparsity by a sparse prior. Our extensive experiments using both synthetic and three real-world DDI datasets showed the clear predictive performance advantage of SPARSE over cutting-edge competing methods. Also, latent feature analysis over unknown top predictions by SPARSE demonstrated the interpretability advantage contributed by the model sparsity. AVAILABILITY AND IMPLEMENTATION: Code and data can be accessed at https://github.com/anhnda/SPARSE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Redes Neurais de Computação , Interações Medicamentosas , Humanos
7.
IEEE/ACM Trans Comput Biol Bioinform ; 19(4): 2197-2207, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-33705322

RESUMO

Detecting predictive biomarkers from multi-omics data is important for precision medicine, to improve diagnostics of complex diseases and for better treatments. This needs substantial experimental efforts that are made difficult by the heterogeneity of cell lines and huge cost. An effective solution is to build a computational model over the diverse omics data, including genomic, molecular, and environmental information. However, choosing informative and reliable data sources from among the different types of data is a challenging problem. We propose DIVERSE, a framework of Bayesian importance-weighted tri- and bi-matrix factorization(DIVERSE3 or DIVERSE2) to predict drug responses from data of cell lines, drugs, and gene interactions. DIVERSE integrates the data sources systematically, in a step-wise manner, examining the importance of each added data set in turn. More specifically, we sequentially integrate five different data sets, which have not all been combined in earlier bioinformatic methods for predicting drug responses. Empirical experiments show that DIVERSE clearly outperformed five other methods including three state-of-the-art approaches, under cross-validation, particularly in out-of-matrix prediction, which is closer to the setting of real use cases and more challenging than simpler in-matrix prediction. Additionally, case studies for discovering new drugs further confirmed the performance advantage of DIVERSE.


Assuntos
Biologia Computacional , Medicina de Precisão , Teorema de Bayes , Biologia Computacional/métodos , Medicina de Precisão/métodos
8.
Bioinformatics ; 38(3): 799-808, 2022 01 12.
Artigo em Inglês | MEDLINE | ID: mdl-34672333

RESUMO

MOTIVATION: Deciphering the relationship between human genes/proteins and abnormal phenotypes is of great importance in the prevention, diagnosis and treatment against diseases. The Human Phenotype Ontology (HPO) is a standardized vocabulary that describes the phenotype abnormalities encountered in human disorders. However, the current HPO annotations are still incomplete. Thus, it is necessary to computationally predict human protein-phenotype associations. In terms of current, cutting-edge computational methods for annotating proteins (such as functional annotation), three important features are (i) multiple network input, (ii) semi-supervised learning and (iii) deep graph convolutional network (GCN), whereas there are no methods with all these features for predicting HPO annotations of human protein. RESULTS: We develop HPODNets with all above three features for predicting human protein-phenotype associations. HPODNets adopts a deep GCN with eight layers which allows to capture high-order topological information from multiple interaction networks. Empirical results with both cross-validation and temporal validation demonstrate that HPODNets outperforms seven competing state-of-the-art methods for protein function prediction. HPODNets with the architecture of deep GCNs is confirmed to be effective for predicting HPO annotations of human protein and, more generally, node label ranking problem with multiple biomolecular networks input in bioinformatics. AVAILABILITY AND IMPLEMENTATION: https://github.com/liulizhi1996/HPODNets. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Biologia Computacional , Humanos , Biologia Computacional/métodos , Fenótipo
9.
PLoS One ; 16(12): e0251952, 2021.
Artigo em Inglês | MEDLINE | ID: mdl-34914721

RESUMO

Identifying crop loss at field parcel scale using satellite images is challenging: first, crop loss is caused by many factors during the growing season; second, reliable reference data about crop loss are lacking; third, there are many ways to define crop loss. This study investigates the feasibility of using satellite images to train machine learning (ML) models to classify agricultural field parcels into those with and without crop loss. The reference data for this study was provided by Finnish Food Authority (FFA) containing crop loss information of approximately 1.4 million field parcels in Finland covering about 3.5 million ha from 2000 to 2015. This reference data was combined with Normalised Difference Vegetation Index (NDVI) derived from Landsat 7 images, in which more than 80% of the possible data are missing. Despite the hard problem with extremely noisy data, among the four ML models we tested, random forest (with mean imputation and missing value indicators) achieved the average AUC (area under the ROC curve) of 0.688±0.059 over all 16 years with the range [0.602, 0.795] in identifying new crop-loss fields based on reference fields of the same year. To our knowledge, this is one of the first large scale benchmark study of using machine learning for crop loss classification at field parcel scale. The classification setting and trained models have numerous potential applications, for example, allowing government agencies or insurance companies to verify crop-loss claims by farmers and realise efficient agricultural monitoring.


Assuntos
Produtos Agrícolas/crescimento & desenvolvimento , Aprendizado de Máquina , Imagens de Satélites , Estações do Ano , Finlândia
10.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34368832

RESUMO

Drug combination therapy is a promising strategy to treat complex diseases such as cancer and infectious diseases. However, current knowledge of drug combination therapies, especially in cancer patients, is limited because of adverse drug effects, toxicity and cell line heterogeneity. Screening new drug combinations requires substantial efforts since considering all possible combinations between drugs is infeasible and expensive. Therefore, building computational approaches, particularly machine learning methods, could provide an effective strategy to overcome drug resistance and improve therapeutic efficacy. In this review, we group the state-of-the-art machine learning approaches to analyze personalized drug combination therapies into three categories and discuss each method in each category. We also present a short description of relevant databases used as a benchmark in drug combination therapies and provide a list of well-known, publicly available interactive data analysis portals. We highlight the importance of data integration on the identification of drug combinations. Finally, we address the advantages of combining multiple data sources on drug combination analysis by showing an experimental comparison.


Assuntos
Aprendizado de Máquina , Protocolos de Quimioterapia Combinada Antineoplásica/administração & dosagem , Biologia Computacional/métodos , Humanos , Neoplasias/tratamento farmacológico , Medicina de Precisão
11.
Bioinformatics ; 37(Suppl_1): i262-i271, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252926

RESUMO

MOTIVATION: Automated function prediction (AFP) of proteins is a large-scale multi-label classification problem. Two limitations of most network-based methods for AFP are (i) a single model must be trained for each species and (ii) protein sequence information is totally ignored. These limitations cause weaker performance than sequence-based methods. Thus, the challenge is how to develop a powerful network-based method for AFP to overcome these limitations. RESULTS: We propose DeepGraphGO, an end-to-end, multispecies graph neural network-based method for AFP, which makes the most of both protein sequence and high-order protein network information. Our multispecies strategy allows one single model to be trained for all species, indicating a larger number of training samples than existing methods. Extensive experiments with a large-scale dataset show that DeepGraphGO outperforms a number of competing state-of-the-art methods significantly, including DeepGOPlus and three representative network-based methods: GeneMANIA, deepNF and clusDCA. We further confirm the effectiveness of our multispecies strategy and the advantage of DeepGraphGO over so-called difficult proteins. Finally, we integrate DeepGraphGO into the state-of-the-art ensemble method, NetGO, as a component and achieve a further performance improvement. AVAILABILITY AND IMPLEMENTATION: https://github.com/yourh/DeepGraphGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Neurais de Computação , Proteínas , Sequência de Aminoácidos
12.
Bioinformatics ; 37(19): 3328-3336, 2021 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-33822886

RESUMO

MOTIVATION: Exploring the relationship between human proteins and abnormal phenotypes is of great importance in the prevention, diagnosis and treatment of diseases. The human phenotype ontology (HPO) is a standardized vocabulary that describes the phenotype abnormalities encountered in human diseases. However, the current HPO annotations of proteins are not complete. Thus, it is important to identify missing protein-phenotype associations. RESULTS: We propose HPOFiller, a graph convolutional network (GCN)-based approach, for predicting missing HPO annotations. HPOFiller has two key GCN components for capturing embeddings from complex network structures: (i) S-GCN for both protein-protein interaction network and HPO semantic similarity network to utilize network weights; (ii) Bi-GCN for the protein-phenotype bipartite graph to conduct message passing between proteins and phenotypes. The core idea of HPOFiller is to repeat run these two GCN modules consecutively over the three networks, to refine the embeddings. Empirical results of extremely stringent evaluation avoiding potential information leakage including cross-validation and temporal validation demonstrates that HPOFiller significantly outperforms all other state-of-the-art methods. In particular, the ablation study shows that batch normalization contributes the most to the performance. The further examination offers literature evidence for highly ranked predictions. Finally using known disease-HPO term associations, HPOFiller could suggest promising, unknown disease-gene associations, presenting possible genetic causes of human disorders. AVAILABILITYAND IMPLEMENTATION: https://github.com/liulizhi1996/HPOFiller. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

13.
Brief Bioinform ; 22(5)2021 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-33515011

RESUMO

MOTIVATION: Gene set enrichment analysis (GSEA) has been widely used to identify gene sets with statistically significant difference between cases and controls against a large gene set. GSEA needs both phenotype labels and expression of genes. However, gene expression are assessed more often for model organisms than minor species. Also, importantly gene expression are not measured well under specific conditions for human, due to high risk of direct experiments, such as non-approved treatment or gene knockout, and then often substituted by mouse. Thus, predicting enrichment significance (on a phenotype) of a given gene set of a species (target, say human), by using gene expression measured under the same phenotype of the other species (source, say mouse) is a vital and challenging problem, which we call CROSS-species gene set enrichment problem (XGSEP). RESULTS: For XGSEP, we propose the CROSS-species gene set enrichment analysis (XGSEA), with three steps of: (1) running GSEA for a source species to obtain enrichment scores and $p$-values of source gene sets; (2) representing the relation between source and target gene sets by domain adaptation; and (3) using regression to predict $p$-values of target gene sets, based on the representation in (2). We extensively validated the XGSEA by using five regression and one classification measurements on four real data sets under various settings, proving that the XGSEA significantly outperformed three baseline methods in most cases. A case study of identifying important human pathways for T -cell dysfunction and reprogramming from mouse ATAC-Seq data further confirmed the reliability of the XGSEA. AVAILABILITY: Source code of the XGSEA is available through https://github.com/LiminLi-xjtu/XGSEA.


Assuntos
Neoplasias Encefálicas/genética , Aprendizado de Máquina , Melanoma/genética , Neoplasias Ovarianas/genética , Neoplasias Cutâneas/genética , Animais , Neoplasias Encefálicas/imunologia , Neoplasias Encefálicas/patologia , Biologia Computacional/métodos , Conjuntos de Dados como Assunto , Embrião de Mamíferos , Feminino , Regulação Neoplásica da Expressão Gênica , Humanos , Melanoma/imunologia , Melanoma/patologia , Camundongos , Neoplasias Ovarianas/imunologia , Neoplasias Ovarianas/patologia , Neoplasias Cutâneas/imunologia , Neoplasias Cutâneas/patologia , Linfócitos T/imunologia , Linfócitos T/patologia , Peixe-Zebra
14.
iScience ; 24(1): 102002, 2021 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-33490910

RESUMO

The biological carbon pump, in which carbon fixed by photosynthesis is exported to the deep ocean through sinking, is a major process in Earth's carbon cycle. The proportion of primary production that is exported is termed the carbon export efficiency (CEE). Based on in-lab or regional scale observations, viruses were previously suggested to affect the CEE (i.e., viral "shunt" and "shuttle"). In this study, we tested associations between viral community composition and CEE measured at a global scale. A regression model based on relative abundance of viral marker genes explained 67% of the variation in CEE. Viruses with high importance in the model were predicted to infect ecologically important hosts. These results are consistent with the view that the viral shunt and shuttle functions at a large scale and further imply that viruses likely act in this process in a way dependent on their hosts and ecosystem dynamics.

15.
IEEE Trans Pattern Anal Mach Intell ; 43(8): 2710-2722, 2021 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-32086195

RESUMO

Hypergraph is a general way of representing high-order relations on a set of objects. It is a generalization of graph, in which only pairwise relations can be represented. It finds applications in various domains where relationships of more than two objects are observed. On a hypergraph, as a generalization of graph, one wishes to learn a smooth function with respect to its topology. A fundamental issue is to find suitable smoothness measures of functions on the nodes of a graph/hypergraph. We show a general framework that generalizes previously proposed smoothness measures and also generates new ones. To address the problem of irrelevant or noisy data, we wish to incorporate sparse learning framework into learning on hypergraphs. We propose sparsely smooth formulations that learn smooth functions and induce sparsity on hypergraphs at both hyperedge and node levels. We show their properties and sparse support recovery results. We conduct experiments to show that our sparsely smooth models are beneficial to learning irrelevant and noisy data, and usually give similar or improved performances compared to dense models.

16.
Brief Bioinform ; 22(1): 346-359, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-31838491

RESUMO

Predicting the response of cancer cell lines to specific drugs is one of the central problems in personalized medicine, where the cell lines show diverse characteristics. Researchers have developed a variety of computational methods to discover associations between drugs and cell lines, and improved drug sensitivity analyses by integrating heterogeneous biological data. However, choosing informative data sources and methods that can incorporate multiple sources efficiently is the challenging part of successful analysis in personalized medicine. The reason is that finding decisive factors of cancer and developing methods that can overcome the problems of integrating data, such as differences in data structures and data complexities, are difficult. In this review, we summarize recent advances in data integration-based machine learning for drug response prediction, by categorizing methods as matrix factorization-based, kernel-based and network-based methods. We also present a short description of relevant databases used as a benchmark in drug response prediction analyses, followed by providing a brief discussion of challenges faced in integrating and interpreting data from multiple sources. Finally, we address the advantages of combining multiple heterogeneous data sources on drug sensitivity analysis by showing an experimental comparison. Contact:  betul.guvenc@aalto.fi.


Assuntos
Resistencia a Medicamentos Antineoplásicos , Genômica/métodos , Medicina de Precisão/métodos , Humanos , Aprendizado de Máquina , Variantes Farmacogenômicos
17.
Brief Bioinform ; 22(1): 164-177, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-31838499

RESUMO

MOTIVATION: Adverse drug reaction (ADR) or drug side effect studies play a crucial role in drug discovery. Recently, with the rapid increase of both clinical and non-clinical data, machine learning methods have emerged as prominent tools to support analyzing and predicting ADRs. Nonetheless, there are still remaining challenges in ADR studies. RESULTS: In this paper, we summarized ADR data sources and review ADR studies in three tasks: drug-ADR benchmark data creation, drug-ADR prediction and ADR mechanism analysis. We focused on machine learning methods used in each task and then compare performances of the methods on the drug-ADR prediction task. Finally, we discussed open problems for further ADR studies. AVAILABILITY: Data and code are available at https://github.com/anhnda/ADRPModels.


Assuntos
Biologia Computacional/métodos , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos/etiologia , Aprendizado de Máquina , Humanos
18.
Bioinformatics ; 37(5): 684-692, 2021 05 05.
Artigo em Inglês | MEDLINE | ID: mdl-32976559

RESUMO

MOTIVATION: With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database. RESULTS: We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Indexação e Redação de Resumos , Medical Subject Headings , MEDLINE , PubMed , Semântica
19.
Bioinformatics ; 36(14): 4180-4188, 2020 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-32379868

RESUMO

MOTIVATION: Annotating human proteins by abnormal phenotypes has become an important topic. Human Phenotype Ontology (HPO) is a standardized vocabulary of phenotypic abnormalities encountered in human diseases. As of November 2019, only <4000 proteins have been annotated with HPO. Thus, a computational approach for accurately predicting protein-HPO associations would be important, whereas no methods have outperformed a simple Naive approach in the second Critical Assessment of Functional Annotation, 2013-2014 (CAFA2). RESULTS: We present HPOLabeler, which is able to use a wide variety of evidence, such as protein-protein interaction (PPI) networks, Gene Ontology, InterPro, trigram frequency and HPO term frequency, in the framework of learning to rank (LTR). LTR has been proved to be powerful for solving large-scale, multi-label ranking problems in bioinformatics. Given an input protein, LTR outputs the ranked list of HPO terms from a series of input scores given to the candidate HPO terms by component learning models (logistic regression, nearest neighbor and a Naive method), which are trained from given multiple evidence. We empirically evaluate HPOLabeler extensively through mainly two experiments of cross validation and temporal validation, for which HPOLabeler significantly outperformed all component models and competing methods including the current state-of-the-art method. We further found that (i) PPI is most informative for prediction among diverse data sources and (ii) low prediction performance of temporal validation might be caused by incomplete annotation of new proteins. AVAILABILITY AND IMPLEMENTATION: http://issubmission.sjtu.edu.cn/hpolabeler/. CONTACT: zhusf@fudan.edu.cn. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional , Mapas de Interação de Proteínas , Ontologia Genética , Humanos , Fenótipo , Proteínas/metabolismo
20.
Bioinformatics ; 36(5): 1533-1541, 2020 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-31596475

RESUMO

MOTIVATION: With the rapidly growing biomedical literature, automatically indexing biomedical articles by Medical Subject Heading (MeSH), namely MeSH indexing, has become increasingly important for facilitating hypothesis generation and knowledge discovery. Over the past years, many large-scale MeSH indexing approaches have been proposed, such as Medical Text Indexer, MeSHLabeler, DeepMeSH and MeSHProbeNet. However, the performance of these methods is hampered by using limited information, i.e. only the title and abstract of biomedical articles. RESULTS: We propose FullMeSH, a large-scale MeSH indexing method taking advantage of the recent increase in the availability of full text articles. Compared to DeepMeSH and other state-of-the-art methods, FullMeSH has three novelties: (i) Instead of using a full text as a whole, FullMeSH segments it into several sections with their normalized titles in order to distinguish their contributions to the overall performance. (ii) FullMeSH integrates the evidence from different sections in a 'learning to rank' framework by combining the sparse and deep semantic representations. (iii) FullMeSH trains an Attention-based Convolutional Neural Network for each section, which achieves better performance on infrequent MeSH headings. FullMeSH has been developed and empirically trained on the entire set of 1.4 million full-text articles in the PubMed Central Open Access subset. It achieved a Micro F-measure of 66.76% on a test set of 10 000 articles, which was 3.3% and 6.4% higher than DeepMeSH and MeSHLabeler, respectively. Furthermore, FullMeSH demonstrated an average improvement of 4.7% over DeepMeSH for indexing Check Tags, a set of most frequently indexed MeSH headings. AVAILABILITY AND IMPLEMENTATION: The software is available upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Indexação e Redação de Resumos , Medical Subject Headings , MEDLINE , PubMed , Semântica , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA