Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 58
Filtrar
Mais filtros












Base de dados
Intervalo de ano de publicação
1.
Nat Commun ; 15(1): 2775, 2024 Mar 30.
Artigo em Inglês | MEDLINE | ID: mdl-38555371

RESUMO

Homologous protein search is one of the most commonly used methods for protein annotation and analysis. Compared to structure search, detecting distant evolutionary relationships from sequences alone remains challenging. Here we propose PLMSearch (Protein Language Model), a homologous protein search method with only sequences as input. PLMSearch uses deep representations from a pre-trained protein language model and trains the similarity prediction model with a large number of real structure similarity. This enables PLMSearch to capture the remote homology information concealed behind the sequences. Extensive experimental results show that PLMSearch can search millions of query-target protein pairs in seconds like MMseqs2 while increasing the sensitivity by more than threefold, and is comparable to state-of-the-art structure search methods. In particular, unlike traditional sequence search methods, PLMSearch can recall most remote homology pairs with dissimilar sequences but similar structures. PLMSearch is freely available at https://dmiip.sjtu.edu.cn/PLMSearch .


Assuntos
Evolução Biológica , Proteínas , Proteínas/química , Anotação de Sequência Molecular , Algoritmos , Análise de Sequência de Proteína
2.
Nat Commun ; 15(1): 585, 2024 Jan 17.
Artigo em Inglês | MEDLINE | ID: mdl-38233391

RESUMO

Contig binning plays a crucial role in metagenomic data analysis by grouping contigs from the same or closely related genomes. However, existing binning methods face challenges in practical applications due to the diversity of data types and the difficulties in efficiently integrating heterogeneous information. Here, we introduce COMEBin, a binning method based on contrastive multi-view representation learning. COMEBin utilizes data augmentation to generate multiple fragments (views) of each contig and obtains high-quality embeddings of heterogeneous features (sequence coverage and k-mer distribution) through contrastive learning. Experimental results on multiple simulated and real datasets demonstrate that COMEBin outperforms state-of-the-art binning methods, particularly in recovering near-complete genomes from real environmental samples. COMEBin outperforms other binning methods remarkably when integrated into metagenomic analysis pipelines, including the recovery of potentially pathogenic antibiotic-resistant bacteria (PARB) and moderate or higher quality bins containing potential biosynthetic gene clusters (BGCs).


Assuntos
Metagenoma , Metagenômica , Metagenoma/genética , Metagenômica/métodos , Algoritmos , Análise de Sequência de DNA/métodos
3.
Bioinformatics ; 39(9)2023 09 02.
Artigo em Inglês | MEDLINE | ID: mdl-37669154

RESUMO

MOTIVATION: Computationally predicting major histocompatibility complex class I (MHC-I) peptide binding affinity is an important problem in immunological bioinformatics, which is also crucial for the identification of neoantigens for personalized therapeutic cancer vaccines. Recent cutting-edge deep learning-based methods for this problem cannot achieve satisfactory performance, especially for non-9-mer peptides. This is because such methods generate the input by simply concatenating the two given sequences: a peptide and (the pseudo sequence of) an MHC class I molecule, which cannot precisely capture the anchor positions of the MHC binding motif for the peptides with variable lengths. We thus developed an anchor position-aware and high-performance deep model, DeepMHCI, with a position-wise gated layer and a residual binding interaction convolution layer. This allows the model to control the information flow in peptides to be aware of anchor positions and model the interactions between peptides and the MHC pseudo (binding) sequence directly with multiple convolutional kernels. RESULTS: The performance of DeepMHCI has been thoroughly validated by extensive experiments on four benchmark datasets under various settings, such as 5-fold cross-validation, validation with the independent testing set, external HPV vaccine identification, and external CD8+ epitope identification. Experimental results with visualization of binding motifs demonstrate that DeepMHCI outperformed all competing methods, especially on non-9-mer peptides binding prediction. AVAILABILITY AND IMPLEMENTATION: DeepMHCI is publicly available at https://github.com/ZhuLab-Fudan/DeepMHCI.


Assuntos
Algoritmos , Benchmarking , Biologia Computacional , Epitopos , Peptídeos
4.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37248747

RESUMO

Human Phenotype Ontology (HPO)-based approaches have gained popularity in recent times as a tool for genomic diagnostics of rare diseases. However, these approaches do not make full use of the available information on disease and patient phenotypes. We present a new method called Phen2Disease, which utilizes the bidirectional maximum matching semantic similarity between two phenotype sets of patients and diseases to prioritize diseases and genes. Our comprehensive experiments have been conducted on six real data cohorts with 2051 cases (Cohort 1, n = 384; Cohort 2, n = 281; Cohort 3, n = 185; Cohort 4, n = 784; Cohort 5, n = 208; and Cohort 6, n = 209) and two simulated data cohorts with 1000 cases. The results of the experiments showed that Phen2Disease outperforms the three state-of-the-art methods when only phenotype information and HPO knowledge base are used, particularly in cohorts with fewer average numbers of HPO terms. We also observed that patients with higher information content scores have more specific information, leading to more accurate predictions. Moreover, Phen2Disease provides high interpretability with ranked diseases and patient HPO terms presented. Our method provides a novel approach to utilizing phenotype data for genomic diagnostics of rare diseases, with potential for clinical impact. Phen2Disease is freely available on GitHub at https://github.com/ZhuLab-Fudan/Phen2Disease.


Assuntos
Ontologias Biológicas , Doenças Raras , Humanos , Semântica , Genômica , Fenótipo
5.
ACS Appl Mater Interfaces ; 15(15): 19470-19479, 2023 Apr 19.
Artigo em Inglês | MEDLINE | ID: mdl-37023404

RESUMO

Efficient dispersion of nanoparticles (NPs) is a crucial challenge in the preparation and application of composites that contain NPs, particularly in coatings, inks, and related materials. Physical adsorption and chemical modification are the two common methods used to disperse NPs. However, the former suffers from desorption, and the latter is more specific and has limited versatility. To address these issues, we developed a novel photo-cross-linked polymeric dispersant, comb-shaped benzophenone-containing poly(ether amine) (bPEA), using a one-pot nucleophilic/cyclic-opening addition reaction. The results demonstrated that the bPEA dispersant forms a dense and stable shell on the surface of pigment NPs through physical adsorption and subsequent chemical photo-cross-linking, which effectively overcome the drawbacks of the desorption occurred in physical adsorption and the specificity of the chemical modification. By means of the dispersing effect of bPEA, the obtained pigment dispersions show high solvent, thermal, and pH stability without flocculation during storage. Moreover, the NPs dispersants show good compatibility with screen printing, coating, and 3D printing, endowing the ornamental products with high uniformity, color fastness, and less color shading. These properties make bPEA dispersants ideal candidates in fabrication dispersions of other NPs.

6.
Genomics Proteomics Bioinformatics ; 21(2): 349-358, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-37075830

RESUMO

As one of the state-of-the-art automated function prediction (AFP) methods, NetGO 2.0 integrates multi-source information to improve the performance. However, it mainly utilizes the proteins with experimentally supported functional annotations without leveraging valuable information from a vast number of unannotated proteins. Recently, protein language models have been proposed to learn informative representations [e.g., Evolutionary Scale Modeling (ESM)-1b embedding] from protein sequences based on self-supervision. Here, we represented each protein by ESM-1b and used logistic regression (LR) to train a new model, LR-ESM, for AFP. The experimental results showed that LR-ESM achieved comparable performance with the best-performing component of NetGO 2.0. Therefore, by incorporating LR-ESM into NetGO 2.0, we developed NetGO 3.0 to improve the performance of AFP extensively. NetGO 3.0 is freely accessible at https://dmiip.sjtu.edu.cn/ng3.0.


Assuntos
alfa-Fetoproteínas , Sequência de Aminoácidos
7.
Genome Biol ; 24(1): 1, 2023 01 06.
Artigo em Inglês | MEDLINE | ID: mdl-36609515

RESUMO

Binning aims to recover microbial genomes from metagenomic data. For complex metagenomic communities, the available binning methods are far from satisfactory, which usually do not fully use different types of features and important biological knowledge. We developed a novel ensemble binner, MetaBinner, which generates component results with multiple types of features by k-means and uses single-copy gene information for initialization. It then employs a two-stage ensemble strategy based on single-copy genes to integrate the component results efficiently and effectively. Extensive experimental results on three large-scale simulated datasets and one real-world dataset demonstrate that MetaBinner outperforms the state-of-the-art binners significantly.


Assuntos
Algoritmos , Microbiota , Microbiota/genética , Metagenoma , Genoma Microbiano , Metagenômica/métodos , Análise de Sequência de DNA
8.
Bioinformatics ; 39(1)2023 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-36576008

RESUMO

MOTIVATION: Finding molecules with desired pharmaceutical properties is crucial in drug discovery. Generative models can be an efficient tool to find desired molecules through the distribution learned by the model to approximate given training data. Existing generative models (i) do not consider backbone structures (scaffolds), resulting in inefficiency or (ii) need prior patterns for scaffolds, causing bias. Scaffolds are reasonable to use, and it is imperative to design a generative model without any prior scaffold patterns. RESULTS: We propose a generative model-based molecule generator, Sc2Mol, without any prior scaffold patterns. Sc2Mol uses SMILES strings for molecules. It consists of two steps: scaffold generation and scaffold decoration, which are carried out by a variational autoencoder and a transformer, respectively. The two steps are powerful for implementing random molecule generation and scaffold optimization. Our empirical evaluation using drug-like molecule datasets confirmed the success of our model in distribution learning and molecule optimization. Also, our model could automatically learn the rules to transform coarse scaffolds into sophisticated drug candidates. These rules were consistent with those for current lead optimization. AVAILABILITY AND IMPLEMENTATION: The code is available at https://github.com/zhiruiliao/Sc2Mol. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Descoberta de Drogas , Aprendizado de Máquina
9.
Bioinformatics ; 38(Suppl 1): i220-i228, 2022 06 24.
Artigo em Inglês | MEDLINE | ID: mdl-35758790

RESUMO

MOTIVATION: Computationally predicting major histocompatibility complex (MHC)-peptide binding affinity is an important problem in immunological bioinformatics. Recent cutting-edge deep learning-based methods for this problem are unable to achieve satisfactory performance for MHC class II molecules. This is because such methods generate the input by simply concatenating the two given sequences: (the estimated binding core of) a peptide and (the pseudo sequence of) an MHC class II molecule, ignoring biological knowledge behind the interactions of the two molecules. We thus propose a binding core-aware deep learning-based model, DeepMHCII, with a binding interaction convolution layer, which allows to integrate all potential binding cores (in a given peptide) with the MHC pseudo (binding) sequence, through modeling the interaction with multiple convolutional kernels. RESULTS: Extensive empirical experiments with four large-scale datasets demonstrate that DeepMHCII significantly outperformed four state-of-the-art methods under numerous settings, such as 5-fold cross-validation, leave one molecule out, validation with independent testing sets and binding core prediction. All these results and visualization of the predicted binding cores indicate the effectiveness of our model, DeepMHCII, and the importance of properly modeling biological facts in deep learning for high predictive performance and efficient knowledge discovery. AVAILABILITY AND IMPLEMENTATION: DeepMHCII is publicly available at https://github.com/yourh/DeepMHCII. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Antígenos de Histocompatibilidade Classe II , Peptídeos , Algoritmos , Antígenos de Histocompatibilidade Classe II/metabolismo , Peptídeos/química , Ligação Proteica , Transporte Proteico
10.
Virol J ; 19(1): 114, 2022 06 28.
Artigo em Inglês | MEDLINE | ID: mdl-35765099

RESUMO

BACKGROUND: Chronic infection with hepatitis B virus (HBV) has been proved highly associated with the development of hepatocellular carcinoma (HCC). AIMS: The purpose of the study is to investigate the association between HBV preS region quasispecies and HCC development, as well as to develop HCC diagnosis model using HBV preS region quasispecies. METHODS: A total of 104 chronic hepatitis B (CHB) patients and 117 HBV-related HCC patients were enrolled. HBV preS region was sequenced using next generation sequencing (NGS) and the nucleotide entropy was calculated for quasispecies evaluation. Sparse logistic regression (SLR) was used to predict HCC development and prediction performances were evaluated using receiver operating characteristic curves. RESULTS: Entropy of HBV preS1, preS2 regions and several nucleotide points showed significant divergence between CHB and HCC patients. Using SLR, the classification of HCC/CHB groups achieved a mean area under the receiver operating characteristic curve (AUC) of 0.883 in the training data and 0.795 in the test data. The prediction model was also validated by a completely independent dataset from Hong Kong. The 10 selected nucleotide positions showed significantly different entropy between CHB and HCC patients. The HBV quasispecies also classified three clinical parameters, including HBeAg, HBVDNA, and Alkaline phosphatase (ALP) with the AUC value greater than 0.6 in the test data. CONCLUSIONS: Using NGS and SLR, the association between HBV preS region nucleotide entropy and HCC development was validated in our study and this could promote the understanding of HCC progression mechanism.


Assuntos
Carcinoma Hepatocelular , Neoplasias Hepáticas , Antígenos de Superfície da Hepatite B/genética , Vírus da Hepatite B/genética , Humanos , Modelos Logísticos , Nucleotídeos , Quase-Espécies
11.
J Transl Med ; 20(1): 193, 2022 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-35509104

RESUMO

PURPOSE: We develop a new risk score to predict patients with stroke-associated pneumonia (SAP) who have an acute intracranial hemorrhage (ICH). METHOD: We applied logistic regression to develop a new risk score called ICH-LR2S2. It was derived from examining a dataset of 70,540 ICH patients between 2015 and 2018 from the Chinese Stroke Center Alliance (CSCA). During the training of ICH-LR2S2, patients were randomly divided into two groups - 80% for the training set and 20% for model validation. A prospective test set was developed using 12,523 patients recruited in 2019. To further verify its effectiveness, we tested ICH-LR2S2 on an external dataset of 24,860 patients from the China National Stroke Registration Management System II (CNSR II). The performance of ICH-LR2S2 was measured by the area under the receiver operating characteristic curve (AUROC). RESULTS: The incidence of SAP in the dataset was 25.52%. A 24-point ICH-LR2S2 was developed from independent predictors, including age, modified Rankin Scale, fasting blood glucose, National Institutes of Health Stroke Scale admission score, Glasgow Coma Scale score, C-reactive protein, dysphagia, Chronic Obstructive Pulmonary Disease, and current smoking. The results showed that ICH-LR2S2 achieved an AUC = 0.749 [95% CI 0.739-0.759], which outperforms the best baseline ICH-APS (AUC = 0.704) [95% CI 0.694-0.714]. Compared with the previous ICH risk scores, ICH-LR2S2 incorporates fasting blood glucose and C-reactive protein, improving its discriminative ability. Machine learning methods such as XGboost (AUC = 0.772) [95% CI 0.762-0.782] can further improve our prediction performance. It also performed well when further validated by the external independent cohort of patients (n = 24,860), ICH-LR2S2 AUC = 0.784 [95% CI 0.774-0.794]. CONCLUSION: ICH-LR2S2 accurately distinguishes SAP patients based on easily available clinical features. It can help identify high-risk patients in the early stages of diseases.


Assuntos
Pneumonia , Acidente Vascular Cerebral , Glicemia , Proteína C-Reativa , Hemorragia Cerebral/complicações , Humanos , Hemorragias Intracranianas/complicações , Pneumonia/complicações , Prognóstico , Estudos Prospectivos , Fatores de Risco , Acidente Vascular Cerebral/complicações
12.
Bioinformatics ; 38(3): 799-808, 2022 01 12.
Artigo em Inglês | MEDLINE | ID: mdl-34672333

RESUMO

MOTIVATION: Deciphering the relationship between human genes/proteins and abnormal phenotypes is of great importance in the prevention, diagnosis and treatment against diseases. The Human Phenotype Ontology (HPO) is a standardized vocabulary that describes the phenotype abnormalities encountered in human disorders. However, the current HPO annotations are still incomplete. Thus, it is necessary to computationally predict human protein-phenotype associations. In terms of current, cutting-edge computational methods for annotating proteins (such as functional annotation), three important features are (i) multiple network input, (ii) semi-supervised learning and (iii) deep graph convolutional network (GCN), whereas there are no methods with all these features for predicting HPO annotations of human protein. RESULTS: We develop HPODNets with all above three features for predicting human protein-phenotype associations. HPODNets adopts a deep GCN with eight layers which allows to capture high-order topological information from multiple interaction networks. Empirical results with both cross-validation and temporal validation demonstrate that HPODNets outperforms seven competing state-of-the-art methods for protein function prediction. HPODNets with the architecture of deep GCNs is confirmed to be effective for predicting HPO annotations of human protein and, more generally, node label ranking problem with multiple biomolecular networks input in bioinformatics. AVAILABILITY AND IMPLEMENTATION: https://github.com/liulizhi1996/HPODNets. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Biologia Computacional , Humanos , Biologia Computacional/métodos , Fenótipo
13.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34893793

RESUMO

Drug-drug interactions (DDIs) are one of the major concerns in pharmaceutical research, and a number of computational methods have been developed to predict whether two drugs interact or not. Recently, more attention has been paid to events caused by the DDIs, which is more useful for investigating the mechanism hidden behind the combined drug usage or adverse reactions. However, some rare events may only have few examples, hindering them from being precisely predicted. To address the above issues, we present a few-shot computational method named META-DDIE, which consists of a representation module and a comparing module, to predict DDI events. We collect drug chemical structures and DDIs from DrugBank, and categorize DDI events into hundreds of types using a standard pipeline. META-DDIE uses the structures of drugs as input and learns the interpretable representations of DDIs through the representation module. Then, the model uses the comparing module to predict whether two representations are similar, and finally predicts DDI events with few labeled examples. In the computational experiments, META-DDIE outperforms several baseline methods and especially enhances the predictive capability for rare events. Moreover, META-DDIE helps to identify the key factors that may cause DDI events and reveal the relationship among different events.


Assuntos
Interações Medicamentosas , Preparações Farmacêuticas , Bases de Dados Factuais , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Modelos Teóricos
14.
NAR Genom Bioinform ; 3(3): lqab066, 2021 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-34377977

RESUMO

Antibiotic resistance in bacteria limits the effect of corresponding antibiotics, and the classification of antibiotic resistance genes (ARGs) is important for the treatment of bacterial infections and for understanding the dynamics of microbial communities. Although several methods have been developed to classify ARGs, none of them work well when the ARGs diverge from those in the reference ARG databases. We develop a novel method, ARG-SHINE, for ARG classification. ARG-SHINE utilizes state-of-the-art learning to rank machine learning approach to ensemble three component methods with different features, including sequence homology, protein domain/family/motif and raw amino acid sequences for the deep convolutional neural network. Compared with other methods, ARG-SHINE achieves better performance on two benchmark datasets in terms of accuracy, macro-average f1-score and weighted-average f1-score. ARG-SHINE is used to classify newly discovered ARGs through functional screening and achieves high prediction accuracy. ARG-SHINE is freely available at https://github.com/ziyewang/ARG_SHINE.

15.
Bioinformatics ; 37(Suppl_1): i262-i271, 2021 07 12.
Artigo em Inglês | MEDLINE | ID: mdl-34252926

RESUMO

MOTIVATION: Automated function prediction (AFP) of proteins is a large-scale multi-label classification problem. Two limitations of most network-based methods for AFP are (i) a single model must be trained for each species and (ii) protein sequence information is totally ignored. These limitations cause weaker performance than sequence-based methods. Thus, the challenge is how to develop a powerful network-based method for AFP to overcome these limitations. RESULTS: We propose DeepGraphGO, an end-to-end, multispecies graph neural network-based method for AFP, which makes the most of both protein sequence and high-order protein network information. Our multispecies strategy allows one single model to be trained for all species, indicating a larger number of training samples than existing methods. Extensive experiments with a large-scale dataset show that DeepGraphGO outperforms a number of competing state-of-the-art methods significantly, including DeepGOPlus and three representative network-based methods: GeneMANIA, deepNF and clusDCA. We further confirm the effectiveness of our multispecies strategy and the advantage of DeepGraphGO over so-called difficult proteins. Finally, we integrate DeepGraphGO into the state-of-the-art ensemble method, NetGO, as a component and achieve a further performance improvement. AVAILABILITY AND IMPLEMENTATION: https://github.com/yourh/DeepGraphGO. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Redes Neurais de Computação , Proteínas , Sequência de Aminoácidos
16.
Nucleic Acids Res ; 49(W1): W469-W475, 2021 07 02.
Artigo em Inglês | MEDLINE | ID: mdl-34038555

RESUMO

With the explosive growth of protein sequences, large-scale automated protein function prediction (AFP) is becoming challenging. A protein is usually associated with dozens of gene ontology (GO) terms. Therefore, AFP is regarded as a problem of large-scale multi-label classification. Under the learning to rank (LTR) framework, our previous NetGO tool integrated massive networks and multi-type information about protein sequences to achieve good performance by dealing with all possible GO terms (>44 000). In this work, we propose the updated version as NetGO 2.0, which further improves the performance of large-scale AFP. NetGO 2.0 also incorporates literature information by logistic regression and deep sequence information by recurrent neural network (RNN) into the framework. We generate datasets following the critical assessment of functional annotation (CAFA) protocol. Experiment results show that NetGO 2.0 outperformed NetGO significantly in biological process ontology (BPO) and cellular component ontology (CCO). In particular, NetGO 2.0 achieved a 12.6% improvement over NetGO in terms of area under precision-recall curve (AUPR) in BPO and around 2.6% in terms of $\mathbf {F_{max}}$ in CCO. These results demonstrate the benefits of incorporating text and deep sequence information for the functional annotation of BPO and CCO. The NetGO 2.0 web server is freely available at http://issubmission.sjtu.edu.cn/ng2/.


Assuntos
Proteínas/fisiologia , Software , Fator de Ligação a CCAAT/química , Fator de Ligação a CCAAT/metabolismo , Proteínas de Caenorhabditis elegans/química , Proteínas de Caenorhabditis elegans/metabolismo , Sequenciamento de Nucleotídeos em Larga Escala , Redes Neurais de Computação , Domínios Proteicos , Proteínas/classificação , Proteínas/metabolismo , Análise de Sequência de Proteína
17.
Bioinformatics ; 37(19): 3328-3336, 2021 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-33822886

RESUMO

MOTIVATION: Exploring the relationship between human proteins and abnormal phenotypes is of great importance in the prevention, diagnosis and treatment of diseases. The human phenotype ontology (HPO) is a standardized vocabulary that describes the phenotype abnormalities encountered in human diseases. However, the current HPO annotations of proteins are not complete. Thus, it is important to identify missing protein-phenotype associations. RESULTS: We propose HPOFiller, a graph convolutional network (GCN)-based approach, for predicting missing HPO annotations. HPOFiller has two key GCN components for capturing embeddings from complex network structures: (i) S-GCN for both protein-protein interaction network and HPO semantic similarity network to utilize network weights; (ii) Bi-GCN for the protein-phenotype bipartite graph to conduct message passing between proteins and phenotypes. The core idea of HPOFiller is to repeat run these two GCN modules consecutively over the three networks, to refine the embeddings. Empirical results of extremely stringent evaluation avoiding potential information leakage including cross-validation and temporal validation demonstrates that HPOFiller significantly outperforms all other state-of-the-art methods. In particular, the ablation study shows that batch normalization contributes the most to the performance. The further examination offers literature evidence for highly ranked predictions. Finally using known disease-HPO term associations, HPOFiller could suggest promising, unknown disease-gene associations, presenting possible genetic causes of human disorders. AVAILABILITYAND IMPLEMENTATION: https://github.com/liulizhi1996/HPOFiller. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

18.
Artigo em Inglês | MEDLINE | ID: mdl-31494556

RESUMO

Grant support (GS) in the MEDLINE database refers to funding agencies and contract numbers. It is important for funding organizations to track their funding outcomes from the GS information. As such, how to accurately and automatically extract funding information from biomedical literature is challenging. In this paper, we present a pipeline system called GrantExtractor that is able to accurately extract GS information from fulltext biomedical literature. GrantExtractor effectively integrates several advanced machine learning techniques. In particular, we use a sentence classifier to identify funding sentences from articles first. A bi-directional LSTM and the CRF layer (BiLSTM-CRF), and pattern matching are then used to extract entities of grant numbers and agencies from these identified funding sentences. After removing noisy numbers by a multi-class model, we finally match each grant number with its corresponding agency. Experimental results on benchmark datasets have demonstrated that GrantExtractor clearly outperforms all baseline methods. It is further evident that GrantExtractor won the first place in Task 5C of 2017 BioASQ challenge, with achieving the Micro-recall of 0.9526 for 22,610 articles. Moreover, GrantExtractor has achieved the Micro F-measure score as high as 0.90 in extracting grant pairs.


Assuntos
Mineração de Dados/métodos , Aprendizado Profundo , Organização do Financiamento , MEDLINE , Modelos Estatísticos , National Library of Medicine (U.S.) , Estados Unidos
20.
Phenomics ; 1(4): 171-185, 2021 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-36939789

RESUMO

Deciphering the relationship between human proteins (genes) and phenotypes is one of the fundamental tasks in phenomics research. The Human Phenotype Ontology (HPO) builds upon a standardized logical vocabulary to describe the abnormal phenotypes encountered in human diseases and paves the way towards the computational analysis of their genetic causes. To date, many computational methods have been proposed to predict the HPO annotations of proteins. In this paper, we conduct a comprehensive review of the existing approaches to predicting HPO annotations of novel proteins, identifying missing HPO annotations, and prioritizing candidate proteins with respect to a certain HPO term. For each topic, we first give the formalized description of the problem, and then systematically revisit the published literatures highlighting their advantages and disadvantages, followed by the discussion on the challenges and promising future directions. In addition, we point out several potential topics to be worthy of exploration including the selection of negative HPO annotations and detecting HPO misannotations. We believe that this review will provide insight to the researchers in the field of computational phenotype analyses in terms of comprehending and developing novel prediction algorithms.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...