Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
Más filtros












Base de datos
Intervalo de año de publicación
1.
Comput Biol Med ; 168: 107753, 2024 01.
Artículo en Inglés | MEDLINE | ID: mdl-38039889

RESUMEN

BACKGROUND: Trans-acting factors are of special importance in transcription regulation, which is a group of proteins that can directly or indirectly recognize or bind to the 8-12 bp core sequence of cis-acting elements and regulate the transcription efficiency of target genes. The progressive development in high-throughput chromatin capture technology (e.g., Hi-C) enables the identification of chromatin-interacting sequence groups where trans-acting DNA motif groups can be discovered. The problem difficulty lies in the combinatorial nature of DNA sequence pattern matching and its underlying sequence pattern search space. METHOD: Here, we propose to develop MotifHub for trans-acting DNA motif group discovery on grouped sequences. Specifically, the main approach is to develop probabilistic modeling for accommodating the stochastic nature of DNA motif patterns. RESULTS: Based on the modeling, we develop global sampling techniques based on EM and Gibbs sampling to address the global optimization challenge for model fitting with latent variables. The results reflect that our proposed approaches demonstrate promising performance with linear time complexities. CONCLUSION: MotifHub is a novel algorithm considering the identification of both DNA co-binding motif groups and trans-acting TFs. Our study paves the way for identifying hub TFs of stem cell development (OCT4 and SOX2) and determining potential therapeutic targets of prostate cancer (FOXA1 and MYC). To ensure scientific reproducibility and long-term impact, its matrix-algebra-optimized source code is released at http://bioinfo.cs.cityu.edu.hk/MotifHub.


Asunto(s)
Algoritmos , Programas Informáticos , Motivos de Nucleótidos/genética , Reproducibilidad de los Resultados , Cromatina/genética
2.
J Biomed Inform ; 129: 104059, 2022 05.
Artículo en Inglés | MEDLINE | ID: mdl-35351638

RESUMEN

The study aims at developing a neural network model to improve the performance of Human Phenotype Ontology (HPO) concept recognition tools. We used the terms, definitions, and comments about the phenotypic concepts in the HPO database to train our model. The document to be analyzed is first split into sentences and annotated with a base method to generate candidate concepts. The sentences, along with the candidate concepts, are then fed into the pre-trained model for re-ranking. Our model comprises the pre-trained BlueBERT and a feature selection module, followed by a contrastive loss. We re-ranked the results generated by three robust HPO annotation tools and compared the performance against most of the existing approaches. The experimental results show that our model can improve the performance of the existing methods. Significantly, it boosted 3.0% and 5.6% in F1 score on the two evaluated datasets compared with the base methods. It removed more than 80% of the false positives predicted by the base methods, resulting in up to 18% improvement in precision. Our model utilizes the descriptive data in the ontology and the contextual information in the sentences for re-ranking. The results indicate that the additional information and the re-ranking model can significantly enhance the precision of HPO concept recognition compared with the base method.


Asunto(s)
Lenguaje , Redes Neurales de la Computación , Bases de Datos Factuales , Humanos , Fenotipo
3.
Annu Rev Biomed Data Sci ; 4: 313-339, 2021 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-34465169

RESUMEN

The COVID-19 (coronavirus disease 2019) pandemic has had a significant impact on society, both because of the serious health effects of COVID-19 and because of public health measures implemented to slow its spread. Many of these difficulties are fundamentally information needs; attempts to address these needs have caused an information overload for both researchers and the public. Natural language processing (NLP)-the branch of artificial intelligence that interprets human language-can be applied to address many of the information needs made urgent by the COVID-19 pandemic. This review surveys approximately 150 NLP studies and more than 50 systems and datasets addressing the COVID-19 pandemic. We detail work on four core NLP tasks: information retrieval, named entity recognition, literature-based discovery, and question answering. We also describe work that directly addresses aspects of the pandemic through four additional tasks: topic modeling, sentiment and emotion analysis, caseload forecasting, and misinformation detection. We conclude by discussing observable trends and remaining challenges.


Asunto(s)
COVID-19/epidemiología , Almacenamiento y Recuperación de la Información/métodos , Procesamiento de Lenguaje Natural , Comunicación , Minería de Datos/métodos , Conjuntos de Datos como Asunto , Emociones , Humanos , Descubrimiento del Conocimiento , Pandemias , Publicaciones Periódicas como Asunto , Programas Informáticos
4.
Bioinformatics ; 37(13): 1884-1890, 2021 Jul 27.
Artículo en Inglés | MEDLINE | ID: mdl-33471061

RESUMEN

MOTIVATION: Automatic phenotype concept recognition from unstructured text remains a challenging task in biomedical text mining research. Previous works that address the task typically use dictionary-based matching methods, which can achieve high precision but suffer from lower recall. Recently, machine learning-based methods have been proposed to identify biomedical concepts, which can recognize more unseen concept synonyms by automatic feature learning. However, most methods require large corpora of manually annotated data for model training, which is difficult to obtain due to the high cost of human annotation. RESULTS: In this article, we propose PhenoTagger, a hybrid method that combines both dictionary and machine learning-based methods to recognize Human Phenotype Ontology (HPO) concepts in unstructured biomedical text. We first use all concepts and synonyms in HPO to construct a dictionary, which is then used to automatically build a distantly supervised training dataset for machine learning. Next, a cutting-edge deep learning model is trained to classify each candidate phrase (n-gram from input sentence) into a corresponding concept label. Finally, the dictionary and machine learning-based prediction results are combined for improved performance. Our method is validated with two HPO corpora, and the results show that PhenoTagger compares favorably to previous methods. In addition, to demonstrate the generalizability of our method, we retrained PhenoTagger using the disease ontology MEDIC for disease concept recognition to investigate the effect of training on different ontologies. Experimental results on the NCBI disease corpus show that PhenoTagger without requiring manually annotated training data achieves competitive performance as compared with state-of-the-art supervised methods. AVAILABILITYAND IMPLEMENTATION: The source code, API information and data for PhenoTagger are freely available at https://github.com/ncbi-nlp/PhenoTagger. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

5.
PLoS Comput Biol ; 16(4): e1007617, 2020 04.
Artículo en Inglés | MEDLINE | ID: mdl-32324731

RESUMEN

A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings-which involve the learning of vector representations of concepts using machine learning models-have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec.


Asunto(s)
Biología Computacional/métodos , Minería de Datos/métodos , Aprendizaje Profundo , Publicaciones , Algoritmos , Bases de Datos de Proteínas , Interacciones Farmacológicas , Registros Electrónicos de Salud , Humanos , Mapeo de Interacción de Proteínas , PubMed , Semántica
6.
IEEE J Biomed Health Inform ; 24(1): 311-318, 2020 01.
Artículo en Inglés | MEDLINE | ID: mdl-30714937

RESUMEN

The gene expression omnibus (GEO) repository harbours an exponentially increasing number of gene expression studies. The expression data, as well as the related metadata, provides an abundant resource for knowledge discovery. Each study in GEO focuses on the gene expression perturbation of a specific subject (e.g., gene, drug, and disease). The identification of those subjects and the associations among them are beneficial for further in-depth studies. However, they cannot be directly inferred from the studies. A unified representation of those subjects (i.e., gene expression signatures) is desired. We developed GESgnExt for the automatic construction of gene expression signatures. The resultant 6542 signatures are built on 1934 series and 35 919 samples from GEO. To evaluate its significance, we calculated the similarities among those signatures and compared the discovered associations against the existing interaction databases. The signatures connect the genes, drugs, and diseases, covering most of the experimentally validated interactions. Besides, we have discovered 3307 novel signatures and their related associations, complementing the existing signature knowledge. The biomedical relevance of GESgnExt is demonstrated further in multiple case studies, providing mechanistic insights into its knowledge discovery process.


Asunto(s)
Sistemas de Administración de Bases de Datos , Bases de Datos Genéticas , Perfilación de la Expresión Génica/métodos , Genómica/métodos , Antineoplásicos/uso terapéutico , Minería de Datos , Resistencia a Antineoplásicos/genética , Humanos , Metaanálisis como Asunto , Neoplasias/tratamiento farmacológico , Neoplasias/genética
7.
Artículo en Inglés | MEDLINE | ID: mdl-30475727

RESUMEN

Transcription factors (TFs) are the major components of human gene regulation. In particular, they bind onto specific DNA sequences and regulate neighborhood genes in different tissues at different developmental stages. Non-synonymous single nucleotide polymorphisms on its protein-coding sequences could result in undesired consequences in human. Therefore, it is necessary to develop methods for predicting any abnormality among those non-synonymous single nucleotide polymorphisms. To address it, we have developed and compared different strategies to predict deleterious non-synonymous single nucleotide polymorphisms (also known as missense mutations) on the protein-coding sequences of human TFs. Taking advantage of evolutionary conservation signals, we have developed and compared different classifiers with different feature sets as computed from different evolutionarily related sequence collections. The results indicate that the classic ensemble algorithm, Adaboost with decision stumps, with orthologous sequence collection, has performed the best (namely, TFmedic). We have further compared TFmedic with other state-of-the-arts methods (i.e., PolyPhen-2 and SIFT) on PolyPhen-2's own datasets, demonstrating that TFmedic can outperform the others. As applications, we have further applied TFmedic to all possible missense mutations on all human transcription factors; the proteome-wide results reveal interesting insights, consistent with the existing physiochemical knowledge. A case study with the actual 3D structure is conducted, revealing how TFmedic can be contributed to protein-DNA binding complex studies.


Asunto(s)
Aprendizaje Automático , Polimorfismo de Nucleótido Simple/genética , Factores de Transcripción/genética , Algoritmos , Biología Computacional , Minería de Datos , Humanos , Mutación Missense/genética
8.
Bioinformatics ; 36(2): 637-643, 2020 01 15.
Artículo en Inglés | MEDLINE | ID: mdl-31392318

RESUMEN

MOTIVATION: Biomedical event extraction is fundamental for information extraction in molecular biology and biomedical research. The detected events form the central basis for comprehensive biomedical knowledge fusion, facilitating the digestion of massive information influx from the literature. Limited by the event context, the existing event detection models are mostly applicable for a single task. A general and scalable computational model is desiderated for biomedical knowledge management. RESULTS: We consider and propose a bottom-up detection framework to identify the events from recognized arguments. To capture the relations between the arguments, we trained a bidirectional long short-term memory network to model their context embedding. Leveraging the compositional attributes, we further derived the candidate samples for training event classifiers. We built our models on the datasets from BioNLP Shared Task for evaluations. Our method achieved the average F-scores of 0.81 and 0.92 on BioNLPST-BGI and BioNLPST-BB datasets, respectively. Comparing with seven state-of-the-art methods, our method nearly doubled the existing F-score performance (0.92 versus 0.56) on the BioNLPST-BB dataset. Case studies were conducted to reveal the underlying reasons. AVAILABILITY AND IMPLEMENTATION: https://github.com/cskyan/evntextrc. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Investigación Biomédica , Publicaciones , Almacenamiento y Recuperación de la Información , Biología Molecular
9.
iScience ; 15: 332-341, 2019 May 31.
Artículo en Inglés | MEDLINE | ID: mdl-31103852

RESUMEN

The early detection of cancers has the potential to save many lives. A recent attempt has been demonstrated successful. However, we note several critical limitations. Given the central importance and broad impact of early cancer detection, we aspire to address those limitations. We explore different supervised learning approaches for multiple cancer type detection and observe significant improvements; for instance, one of our approaches (i.e., CancerA1DE) can double the existing sensitivity from 38% to 77% for the earliest cancer detection (i.e., Stage I) at the 99% specificity level. For Stage II, it can even reach up to about 90% across multiple cancer types. In addition, CancerA1DE can also double the existing sensitivity from 30% to 70% for detecting breast cancers at the 99% specificity level. Data and model analysis are conducted to reveal the underlying reasons. A website is built at http://cancer.cs.cityu.edu.hk/.

10.
J Biomed Inform ; 73: 84-94, 2017 09.
Artículo en Inglés | MEDLINE | ID: mdl-28723579

RESUMEN

MOTIVATION: Cancer hallmark annotation is a promising technique that could discover novel knowledge about cancer from the biomedical literature. The automated annotation of cancer hallmarks could reveal relevant cancer transformation processes in the literature or extract the articles that correspond to the cancer hallmark of interest. It acts as a complementary approach that can retrieve knowledge from massive text information, advancing numerous focused studies in cancer research. Nonetheless, the high-dimensional nature of cancer hallmark annotation imposes a unique challenge. RESULTS: To address the curse of dimensionality, we compared multiple cancer hallmark annotation methods on 1580 PubMed abstracts. Based on the insights, a novel approach, UDT-RF, which makes use of ontological features is proposed. It expands the feature space via the Medical Subject Headings (MeSH) ontology graph and utilizes novel feature selections for elucidating the high-dimensional cancer hallmark annotation space. To demonstrate its effectiveness, state-of-the-art methods are compared and evaluated by a multitude of performance metrics, revealing the full performance spectrum on the full set of cancer hallmarks. Several case studies are conducted, demonstrating how the proposed approach could reveal novel insights into cancers. AVAILABILITY: https://github.com/cskyan/chmannot.


Asunto(s)
Curaduría de Datos , Medical Subject Headings , Neoplasias , PubMed , Ontologías Biológicas , Humanos
11.
IEEE Trans Nanobioscience ; 16(1): 43-50, 2017 01.
Artículo en Inglés | MEDLINE | ID: mdl-27893398

RESUMEN

Understanding genome-wide protein-DNA interaction signals forms the basis for further focused studies in gene regulation. In particular, the chromatin immunoprecipitation with massively parallel DNA sequencing technology (ChIP-Seq) can enable us to measure the in vivo genome-wide occupancy of the DNA-binding protein of interest in a single run. Multiple ChIP-Seq runs thus inherent the potential for us to decipher the combinatorial occupancies of multiple DNA-binding proteins. To handle the genome-wide signal profiles from those multiple runs, we propose to integrate regularized regression functions (i.e., LASSO, Elastic Net, and Ridge Regression) into the well-established SignalRanker and FullSignalRanker frameworks, resulting in six additional probabilistic models for inference on multiple normalized genome-wide signal profiles. The corresponding model training algorithms are devised with computational complexity analysis. Comprehensive benchmarking is conducted to demonstrate and compare the performance of nine related probabilistic models on the ENCODE ChIP-Seq datasets. The results indicate that the regularized SignalRanker models, in contrast to the original SignalRanker models, can demonstrate excellent inference performance comparable to the FullSignalRanker models with low model complexities and time complexities. Such a feature is especially valuable in the context of the rapidly growing genome-wide signal profile data in the recent years.


Asunto(s)
Inmunoprecipitación de Cromatina/métodos , Genómica/métodos , Modelos Estadísticos , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Genoma/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Células K562
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...