Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Comput Struct Biotechnol J ; 24: 322-333, 2024 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-38690549

RESUMEN

Data curation for a hospital-based cancer registry heavily relies on the labor-intensive manual abstraction process by cancer registrars to identify cancer-related information from free-text electronic health records. To streamline this process, a natural language processing system incorporating a hybrid of deep learning-based and rule-based approaches for identifying lung cancer registry-related concepts, along with a symbolic expert system that generates registry coding based on weighted rules, was developed. The system is integrated with the hospital information system at a medical center to provide cancer registrars with a patient journey visualization platform. The embedded system offers a comprehensive view of patient reports annotated with significant registry concepts to facilitate the manual coding process and elevate overall quality. Extensive evaluations, including comparisons with state-of-the-art methods, were conducted using a lung cancer dataset comprising 1428 patients from the medical center. The experimental results illustrate the effectiveness of the developed system, consistently achieving F1-scores of 0.85 and 1.00 across 30 coding items. Registrar feedback highlights the system's reliability as a tool for assisting and auditing the abstraction. By presenting key registry items along the timeline of a patient's reports with accurate code predictions, the system improves the quality of registrar outcomes and reduces the labor resources and time required for data abstraction. Our study highlights advancements in cancer registry coding practices, demonstrating that the proposed hybrid weighted neural-symbolic cancer registry system is reliable and efficient for assisting cancer registrars in the coding workflow and contributing to clinical outcomes.

2.
J Med Internet Res ; 25: e48145, 2023 12 06.
Artículo en Inglés | MEDLINE | ID: mdl-38055317

RESUMEN

BACKGROUND: Electronic health records (EHRs) in unstructured formats are valuable sources of information for research in both the clinical and biomedical domains. However, before such records can be used for research purposes, sensitive health information (SHI) must be removed in several cases to protect patient privacy. Rule-based and machine learning-based methods have been shown to be effective in deidentification. However, very few studies investigated the combination of transformer-based language models and rules. OBJECTIVE: The objective of this study is to develop a hybrid deidentification pipeline for Australian EHR text notes using rules and transformers. The study also aims to investigate the impact of pretrained word embedding and transformer-based language models. METHODS: In this study, we present a hybrid deidentification pipeline called OpenDeID, which is developed using an Australian multicenter EHR-based corpus called OpenDeID Corpus. The OpenDeID corpus consists of 2100 pathology reports with 38,414 SHI entities from 1833 patients. The OpenDeID pipeline incorporates a hybrid approach of associative rules, supervised deep learning, and pretrained language models. RESULTS: The OpenDeID achieved a best F1-score of 0.9659 by fine-tuning the Discharge Summary BioBERT model and incorporating various preprocessing and postprocessing rules. The OpenDeID pipeline has been deployed at a large tertiary teaching hospital and has processed over 8000 unstructured EHR text notes in real time. CONCLUSIONS: The OpenDeID pipeline is a hybrid deidentification pipeline to deidentify SHI entities in unstructured EHR text notes. The pipeline has been evaluated on a large multicenter corpus. External validation will be undertaken as part of our future work to evaluate the effectiveness of the OpenDeID pipeline.


Asunto(s)
Anonimización de la Información , Registros Electrónicos de Salud , Humanos , Australia , Algoritmos , Hospitales de Enseñanza
3.
Database (Oxford) ; 20232023 02 03.
Artículo en Inglés | MEDLINE | ID: mdl-36734300

RESUMEN

This study presents the outcomes of the shared task competition BioCreative VII (Task 3) focusing on the extraction of medication names from a Twitter user's publicly available tweets (the user's 'timeline'). In general, detecting health-related tweets is notoriously challenging for natural language processing tools. The main challenge, aside from the informality of the language used, is that people tweet about any and all topics, and most of their tweets are not related to health. Thus, finding those tweets in a user's timeline that mention specific health-related concepts such as medications requires addressing extreme imbalance. Task 3 called for detecting tweets in a user's timeline that mentions a medication name and, for each detected mention, extracting its span. The organizers made available a corpus consisting of 182 049 tweets publicly posted by 212 Twitter users with all medication mentions manually annotated. The corpus exhibits the natural distribution of positive tweets, with only 442 tweets (0.2%) mentioning a medication. This task was an opportunity for participants to evaluate methods that are robust to class imbalance beyond the simple lexical match. A total of 65 teams registered, and 16 teams submitted a system run. This study summarizes the corpus created by the organizers and the approaches taken by the participating teams for this challenge. The corpus is freely available at https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-3/. The methods and the results of the competing systems are analyzed with a focus on the approaches taken for learning from class-imbalanced data.


Asunto(s)
Minería de Datos , Procesamiento de Lenguaje Natural , Humanos , Minería de Datos/métodos
4.
Zhen Ci Yan Jiu ; 46(9): 782-8, 2021 Sep 25.
Artículo en Chino | MEDLINE | ID: mdl-34558245

RESUMEN

OBJECTIVE: To explore the molecular mechanism of locus coeruleus(LC) involved in electroacupuncture (EA) anti myocardial ischemia. METHODS: Twenty-four SD rats were randomly divided into sham-operation, model, EA and EA +lesion groups, with 6 rats in each group. The acute myocardial ischemia (AMI) model was established by ligation of the left anterior descending branch of coronary artery. EA (2 Hz/15 Hz, 1 mA) was applied to bilateral "Shenmen" (HT7) -"Tongli" (HT5) and the middle-point between HT7 and HT5 for 30 min, once daily for 3 days. For rats of the EA +lesion group, the virus (300 nL) was injected into bilateral LC before EA treatment. Serum aspartate aminotransferase (AST) was detected by ELISA. The gene expression profiles of rat heart were detected by transcriptome sequencing, the differentially expressed genes were screened, and Gene Ontology (GO) functional classification and Kyoto Encyclopedia of genes and genomes (KEGG) metabolic pathway enrichment analysis were performed. RESULTS: Compared with the sham-operation group, serum AST content was significantly increased in the model group (P<0.01). Following the intervention, serum AST was significantly reduced in the EA group (P<0.01), while the serum AST in the EA + lesion group was significantly higher compared with the EA group (P<0.05). Differential expression analysis showed that 1 138 differentially expressed genes were screened out between the model group and the sham-operation group, 1 330 differentially expressed genes between model and EA group, and 804 differentially expressed genes between EA and EA + lesion group. Among them, 218 differential genes were involved in the regulation of EA anti-myocardial ischemia in LC. GO functional classification analysis showed that these differentially expressed genes mainly involved in cell processes, metabolic processes and biological regulation in biological processes. KEGG pathway analysis showed that these differentially expressed genes were enriched in sulfur relay system, thiamine metabolism, glutathione metabolism, C5 branch dicarboxylic acid metabolism, cell adhesion molecules and Th1 and Th2 cell differentiation. CONCLUSION: EA intervention has a positive effect in anti-myocardial ischemia, which may be related to the sulfur relay system, thiamine metabolism, glutathione metabolism, C5 branch dicarboxylic acid metabolism, cell adhesion molecules and Th1 and Th2 cell differentiation involved in LC.


Asunto(s)
Electroacupuntura , Isquemia Miocárdica , Puntos de Acupuntura , Animales , Locus Coeruleus , Isquemia Miocárdica/genética , Isquemia Miocárdica/terapia , Ratas , Ratas Sprague-Dawley , Transcriptoma
5.
Zhongguo Zhong Yao Za Zhi ; 46(5): 1084-1093, 2021 Mar.
Artículo en Chino | MEDLINE | ID: mdl-33787101

RESUMEN

In order to enrich the transcriptome data of Fagopyrum dibotrys plants, analyze the genes encoding key enzyme involved in flavonoid biosynthesis pathway, and mine their functional genes, in this study, we performed RNA sequencing analysis for the rhizomes, roots, flowers, leaves and stems of F. dibotrys on the BGISEQ-500 sequencing platform. After de novo assembly of transcripts, a total of 205 619 unigenes were generated and 132 372 unigenes were obtained and annotated into seven public databases, of which, 81 327 unigenes were mapped to the GO database and most of the unigenes were annotated in cellular process, biological regulation, binding and catalytic activity. Besides, 86 922 unigenes were enriched in 136 pathways using KEGG database' and we identified 82 unigenes that encodes key enzymes involved in flavonoid biosynthesis. Comparing rhizome with root, flower, leaf or stem in F. dibotrys, 27 962 co-expressed differentially expressed genes(DEGs) were obtained. Among them, 23 515 DEGs of rhizome tissue-specific were enriched into 132 pathways and 13 unigenes were significantly enriched in biosynthesis of flavone and flavonol. In addition, we also identified 3 427 unigenes encoding 60 transcription factor(TFs) families as well as four unigenes encoding bHLH TFs were enriched in flavonoid biosynthesis. Our results greatly enriched the transcriptome database of plants, provided a reference for the analysis of key enzymes involved in flavonoid biosynthesis in plants, and will facilitate the study of the functions and regulatory mechanisms of key enzymes involved in flavonoid biosynthesis in F. dibotrys at the genetic level.


Asunto(s)
Fagopyrum , Vías Biosintéticas/genética , Flavonoides , Flores , Perfilación de la Expresión Génica , Regulación de la Expresión Génica de las Plantas , Humanos , Transcriptoma/genética
6.
Zhongguo Zhong Yao Za Zhi ; 45(12): 2847-2857, 2020 Jun.
Artículo en Chino | MEDLINE | ID: mdl-32627459

RESUMEN

Steroidal saponins, which are the characteristic and main active constituents of Polygonatum, exhibit a broad range of pharmacological functions, such as regulating blood sugar, preventing cardiovascular and cerebrovascular diseases and anti-tumor. In this study, we performed RNA sequencing(RNA-Seq) analysis for the flowers, leaves, roots, and rhizomes of Polygonatum cyrtonema using the BGISEQ-500 platform to understand the biosynthesis pathway of steroidal saponins and study their key enzyme genes. The assembly of transcripts for four tissues generated 129 989 unigenes, of which 88 958 were mapped to several public databases for functional annotation, 22 813 unigenes were assigned to 53 subcategories and 64 877 unigenes were annotated to 136 pathways in KEGG database. Furthermore, 502 unigenes involved in the biosynthesis pathway of steroidal saponins were identified, of which 97 unigenes encoding 12 key enzymes. Cycloartenol synthase, the first key enzyme in the pathway of phytosterol biosynthesis, showed conserved catalytic domain and substrate binding domain based on sequence analysis and homology modeling. Differentially expressed genes(DEGs) were identified in rhizomes as compared to other tissues(flowers, leaves or roots).The 2 437 unigenes annotated by KEGG showed rhizome-specific expression, of which 35 unigenes involved in the biosynthesis of steroidal saponins. Our results greatly extend the public transcriptome dataset of Polygonatum and provide valuable information for the identification of candidate genes involved in the biosynthesis of steroidal saponins and other important secondary metabolites.


Asunto(s)
Polygonatum , Saponinas , Vías Biosintéticas , Perfilación de la Expresión Génica , Análisis de Secuencia de ARN , Transcriptoma
7.
Front Psychiatry ; 11: 533949, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33584354

RESUMEN

The introduction of pre-trained language models in natural language processing (NLP) based on deep learning and the availability of electronic health records (EHRs) presents a great opportunity to transfer the "knowledge" learned from data in the general domain to enable the analysis of unstructured textual data in clinical domains. This study explored the feasibility of applying NLP to a small EHR dataset to investigate the power of transfer learning to facilitate the process of patient screening in psychiatry. A total of 500 patients were randomly selected from a medical center database. Three annotators with clinical experience reviewed the notes to make diagnoses for major/minor depression, bipolar disorder, schizophrenia, and dementia to form a small and highly imbalanced corpus. Several state-of-the-art NLP methods based on deep learning along with pre-trained models based on shallow or deep transfer learning were adapted to develop models to classify the aforementioned diseases. We hypothesized that the models that rely on transferred knowledge would be expected to outperform the models learned from scratch. The experimental results demonstrated that the models with the pre-trained techniques outperformed the models without transferred knowledge by micro-avg. and macro-avg. F-scores of 0.11 and 0.28, respectively. Our results also suggested that the use of the feature dependency strategy to build multi-labeling models instead of problem transformation is superior considering its higher performance and simplicity in the training process.

8.
Int J Med Inform ; 129: 122-132, 2019 09.
Artículo en Inglés | MEDLINE | ID: mdl-31445246

RESUMEN

BACKGROUND: Nowadays, social media are often being used by general public to create and share public messages related to their health. With the global increase in social media usage, there is a trend of posting information related to adverse drug reactions (ADR). Mining the social media data for this type of information will be helpful for pharmacological post-marketing surveillance and monitoring. Although the concept of using social media to facilitate pharmacovigilance is convincing, construction of automatic ADR detection systems remains a challenge because the corpora compiled from social media tend to be highly imbalanced, posing a major obstacle to the development of classifiers with reliable performance. METHODS: Several methods have been proposed to address the challenge of imbalanced corpora. However, we are not aware of any studies that investigated the effectiveness of the strategies of dealing with the problem of imbalanced data in the context of ADR detection from social media. In light of this, we evaluated a variety of imbalanced techniques and proposed a novel word embedding-based synthetic minority over-sampling technique (WESMOTE), which synthesizes new training examples from the sentence representation based on word embeddings. We compared the performance of all methods on two large imbalanced datasets released for the purpose of detecting ADR posts. RESULTS: In comparison with the state-of-the-art approaches, the classifiers that incorporated imbalanced classification techniques achieved comparable or better F-scores. All of our best performing configurations combined random under-sampling with techniques including the proposed WESMOTE, boosting and ensemble, implying that an integration of these approaches with under-sampling provides a reliable solution for large imbalanced social media datasets. Furthermore, ensemble-based methods like vote-based under-sampling (VUE) and random under-sampling boosting can be alternatives for the hybrid synthetic methods because both methods increase the diversity of the created weak classifiers, leading to better recall and overall F-scores for the minority classes. CONCLUSIONS: Data collected from the social media are usually very large and highly imbalanced. In order to maximize the performance of a classifier trained on such data, applications of imbalanced strategies are required. We considered several practical methods for handling imbalanced Twitter data along with their performance on the binary classification task with respect to ADRs. In conclusion, the following practical insights are gained: 1) When dealing with text classification, the proposed word embedding-based synthetic minority over-sampling technique is more effective than traditional synthetic-based over-sampling methods. 2) In cases where large amounts of training data are available, the imbalanced strategies combined with under-sampling techniques are preferred. 3) Finally, employment of advanced methods does not guarantee better performance than simpler ones such as VUE, which achieved high performance with advantages like faster building time and ease of development.


Asunto(s)
Efectos Colaterales y Reacciones Adversas Relacionados con Medicamentos , Medios de Comunicación Sociales , Concienciación , Farmacovigilancia
9.
Zhongguo Zhong Yao Za Zhi ; 44(9): 1799-1807, 2019 May.
Artículo en Chino | MEDLINE | ID: mdl-31342705

RESUMEN

Chalcone synthase( CHS) and chalcone isomerase( CHI) are key enzymes in the biosynthesis pathway of flavonoids. In this study,unigenes for CHS and CHI were screened from the transcriptome database of Arisaema heterophyllum. The open reading frame( ORFs) of chalcone synthase( Ah CHS) and chalcone isomerase( Ah CHI) were cloned from the plant by RT-PCR. The physicochemical properties,expression and structure characteristics of the encoded proteins Ah CHS and Ah CHI were analyzed. The ORFs of Ah CHS and Ah CHI were 1 176,630 bp in length and encoded 392,209 amino acids,respectively. Ah CHS functioned as a symmetric homodimer. The N-terminal helix of one monomer entwined with the corresponding helix of another monomer. Each CHS monomer consisted of two structural domains. In particular,four conserved residues define the active site. The tertiary structure of Ah CHI revealed a novel open-faced ß-sandwich fold. A large ß-sheet( ß4-ß11) and a layer of α-helices( α1-α7) comprised the core structure. The residues spanning ß4,ß5,α4,and α6 in the three-dimensional structure were conserved among CHIs from different species. Notably,these structural elements formed the active site on the protein surface,and the topology of the active-site cleft defined the stereochemistry of the cyclization reaction. The homology comparison showed that Ah CHS had the highest similarity to the CHS of Anthurium andraeanum,while Ah CHI had the highest similarity to the CHI of Paeonia delavayi. This study provided the basis for the functional study of Ah CHS and Ah CHI and the further study on plant flavonoid biosynthesis pathway.


Asunto(s)
Aciltransferasas/genética , Arisaema/enzimología , Liasas Intramoleculares/genética , Proteínas de Plantas/genética , Aciltransferasas/química , Arisaema/genética , Clonación Molecular , Liasas Intramoleculares/química , Proteínas de Plantas/química
10.
Database (Oxford) ; 20192019 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-30809637

RESUMEN

The detection of MicroRNA (miRNA) mentions in scientific literature facilitates researchers with the ability to find relevant and appropriate literature based on queries formulated using miRNA information. Considering most published biological studies elaborated on signal transduction pathways or genetic regulatory information in the form of figure captions, the extraction of miRNA from both the main content and figure captions of a manuscript is useful in aggregate analysis and comparative analysis of the studies published. In this study, we present a statistical principle-based miRNA recognition and normalization method to identify miRNAs and link them to the identifiers in the Rfam database. As one of the core components in the text mining pipeline of the database miRTarBase, the proposed method combined the advantages of previous works relying on pattern, dictionary and supervised learning and provided an integrated solution for the problem of miRNA identification. Furthermore, the knowledge learned from the training data was organized in a human-interpretable manner to understand the reason why the system considers a span of text as a miRNA mention, and the represented knowledge can be further complemented by domain experts. We studied the ambiguity level of miRNA nomenclature to connect the miRNA mentions to the Rfam database and evaluated the performance of our approach on two datasets: the BioCreative VI Bio-ID corpus and the miRNA interaction corpus by extending the later corpus with additional Rfam normalization information. Our study highlights and also proposes a better understanding of the challenges associated with miRNA identification and normalization in scientific literature and the research gap that needs to be further explored in prospective studies.


Asunto(s)
MicroARNs/metabolismo , Publicaciones , Estadística como Asunto , Algoritmos , Bases de Datos Genéticas , Internet , MicroARNs/genética , Anotación de Secuencia Molecular
11.
Database (Oxford) ; 20192019 01 01.
Artículo en Inglés | MEDLINE | ID: mdl-30689846

RESUMEN

The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.


Asunto(s)
Minería de Datos/métodos , Bases de Datos de Proteínas , Mutación , Medicina de Precisión/métodos , Mapas de Interacción de Proteínas , Programas Informáticos , Biología Computacional/métodos , Humanos , Mutación/genética , Mutación/fisiología , Mapeo de Interacción de Proteínas , Mapas de Interacción de Proteínas/genética , Mapas de Interacción de Proteínas/fisiología
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA