Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 44
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
PLoS Comput Biol ; 17(5): e1008967, 2021 05.
Artículo en Inglés | MEDLINE | ID: mdl-34043624

RESUMEN

Antibodies are widely used reagents to test for expression of proteins and other antigens. However, they might not always reliably produce results when they do not specifically bind to the target proteins that their providers designed them for, leading to unreliable research results. While many proposals have been developed to deal with the problem of antibody specificity, it is still challenging to cover the millions of antibodies that are available to researchers. In this study, we investigate the feasibility of automatically generating alerts to users of problematic antibodies by extracting statements about antibody specificity reported in the literature. The extracted alerts can be used to construct an "Antibody Watch" knowledge base containing supporting statements of problematic antibodies. We developed a deep neural network system and tested its performance with a corpus of more than two thousand articles that reported uses of antibodies. We divided the problem into two tasks. Given an input article, the first task is to identify snippets about antibody specificity and classify if the snippets report that any antibody exhibits non-specificity, and thus is problematic. The second task is to link each of these snippets to one or more antibodies mentioned in the snippet. The experimental evaluation shows that our system can accurately perform the classification task with 0.925 weighted F1-score, linking with 0.962 accuracy, and 0.914 weighted F1 when combined to complete the joint task. We leveraged Research Resource Identifiers (RRID) to precisely identify antibodies linked to the extracted specificity snippets. The result shows that it is feasible to construct a reliable knowledge base about problematic antibodies by text mining.


Asunto(s)
Especificidad de Anticuerpos , Minería de Datos , Animales , Humanos , Ratones , Redes Neurales de la Computación
2.
J Biomed Inform ; 82: 63-69, 2018 06.
Artículo en Inglés | MEDLINE | ID: mdl-29679685

RESUMEN

BACKGROUND: Big clinical note datasets found in electronic health records (EHR) present substantial opportunities to train accurate statistical models that identify patterns in patient diagnosis and outcomes. However, near-to-exact duplication in note texts is a common issue in many clinical note datasets. We aimed to use a scalable algorithm to de-duplicate notes and further characterize the sources of duplication. METHODS: We use an approximation algorithm to minimize pairwise comparisons consisting of three phases: (1) Minhashing with Locality Sensitive Hashing; (2) a clustering method using tree-structured disjoint sets; and (3) classification of near-duplicates (exact copies, common machine output notes, or similar notes) via pairwise comparison of notes in each cluster. We use the Jaccard Similarity (JS) to measure similarity between two documents. We analyzed two big clinical note datasets: our institutional dataset and MIMIC-III. RESULTS: There were 1,528,940 notes analyzed from our institution. The de-duplication algorithm completed in 36.3 h. When the JS threshold was set at 0.7, the total number of clusters was 82,371 (total notes = 304,418). Among all JS thresholds, no clusters contained pairs of notes that were incorrectly clustered. When the JS threshold was set at 0.9 or 1.0, the de-duplication algorithm captured 100% of all random pairs with their JS at least as high as the set thresholds from the validation set. Similar performance was noted when analyzing the MIMIC-III dataset. CONCLUSIONS: We showed that among the EHR from our institution and from the publicly-available MIMIC-III dataset, there were a significant number of near-to-exact duplicated notes.


Asunto(s)
Recolección de Datos , Registros Electrónicos de Salud , Informática Médica/métodos , Algoritmos , Análisis por Conglomerados , Computadores , Bases de Datos Factuales , Conjuntos de Datos como Asunto , Humanos , Aprendizaje Automático , Procesamiento de Lenguaje Natural , Obesidad Mórbida/diagnóstico , Obesidad Mórbida/epidemiología , Reconocimiento de Normas Patrones Automatizadas
3.
BMC Bioinformatics ; 17 Suppl 1: 1, 2016 Jan 11.
Artículo en Inglés | MEDLINE | ID: mdl-26817711

RESUMEN

BACKGROUND: Numerous publicly available biomedical databases derive data by curating from literatures. The curated data can be useful as training examples for information extraction, but curated data usually lack the exact mentions and their locations in the text required for supervised machine learning. This paper describes a general approach to information extraction using curated data as training examples. The idea is to formulate the problem as cost-sensitive learning from noisy labels, where the cost is estimated by a committee of weak classifiers that consider both curated data and the text. RESULTS: We test the idea on two information extraction tasks of Genome-Wide Association Studies (GWAS). The first task is to extract target phenotypes (diseases or traits) of a study and the second is to extract ethnicity backgrounds of study subjects for different stages (initial or replication). Experimental results show that our approach can achieve 87% of Precision-at-2 (P@2) for disease/trait extraction, and 0.83 of F1-Score for stage-ethnicity extraction, both outperforming their cost-insensitive baseline counterparts. CONCLUSIONS: The results show that curated biomedical databases can potentially be reused as training examples to train information extractors without expert annotation or refinement, opening an unprecedented opportunity of using "big data" in biomedical text mining.


Asunto(s)
Indización y Redacción de Resúmenes/métodos , Curaduría de Datos , Minería de Datos/métodos , Bases de Datos Factuales , Enfermedad/genética , Predisposición Genética a la Enfermedad , Estudio de Asociación del Genoma Completo , Humanos , Medición de Riesgo
4.
Mol Cell Proteomics ; 12(5): 1335-49, 2013 May.
Artículo en Inglés | MEDLINE | ID: mdl-23397142

RESUMEN

Deciphering the network of signaling pathways in cancer via protein-protein interactions (PPIs) at the cellular level is a promising approach but remains incomplete. We used an in situ proximity ligation assay to identify and quantify 67 endogenous PPIs among 21 interlinked pathways in two hepatocellular carcinoma (HCC) cells, Huh7 (minimally migratory cells) and Mahlavu (highly migratory cells). We then applied a differential network biology analysis and determined that the novel interaction, CRKL-FLT1, has a high centrality ranking, and the expression of this interaction is strongly correlated with the migratory ability of HCC and other cancer cell lines. Knockdown of CRKL and FLT1 in HCC cells leads to a decrease in cell migration via ERK signaling and the epithelial-mesenchymal transition process. Our immunohistochemical analysis shows high expression levels of the CRKL and CRKL-FLT1 pair that strongly correlate with reduced disease-free and overall survival in HCC patient samples, and a multivariate analysis further established CRKL and the CRKL-FLT1 as novel prognosis markers. This study demonstrated that functional exploration of a disease network with interlinked pathways via PPIs can be used to discover novel biomarkers.


Asunto(s)
Proteínas Adaptadoras Transductoras de Señales/metabolismo , Biomarcadores de Tumor/metabolismo , Carcinoma Hepatocelular/metabolismo , Neoplasias Hepáticas/metabolismo , Proteínas Nucleares/metabolismo , Mapas de Interacción de Proteínas , Adulto , Anciano , Anciano de 80 o más Años , Carcinoma Hepatocelular/diagnóstico , Carcinoma Hepatocelular/mortalidad , Supervivencia sin Enfermedad , Células HEK293 , Células Hep G2 , Humanos , Estimación de Kaplan-Meier , Neoplasias Hepáticas/diagnóstico , Neoplasias Hepáticas/mortalidad , Persona de Mediana Edad , Pronóstico , Modelos de Riesgos Proporcionales , Estudios Retrospectivos , Transducción de Señal , Análisis de Matrices Tisulares , Receptor 1 de Factores de Crecimiento Endotelial Vascular/metabolismo , Adulto Joven
5.
Bioinformatics ; 28(12): i106-14, 2012 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-22689749

RESUMEN

MOTIVATION: The recent development of high-throughput drug profiling (high content screening or HCS) provides a large amount of quantitative multidimensional data. Despite its potentials, it poses several challenges for academia and industry analysts alike. This is especially true for ranking the effectiveness of several drugs from many thousands of images directly. This paper introduces, for the first time, a new framework for automatically ordering the performance of drugs, called fractional adjusted bi-partitional score (FABS). This general strategy takes advantage of graph-based formulations and solutions and avoids many shortfalls of traditionally used methods in practice. We experimented with FABS framework by implementing it with a specific algorithm, a variant of normalized cut-normalized cut prime (FABS-NC(')), producing a ranking of drugs. This algorithm is known to run in polynomial time and therefore can scale well in high-throughput applications. RESULTS: We compare the performance of FABS-NC(') to other methods that could be used for drugs ranking. We devise two variants of the FABS algorithm: FABS-SVM that utilizes support vector machine (SVM) as black box, and FABS-Spectral that utilizes the eigenvector technique (spectral) as black box. We compare the performance of FABS-NC(') also to three other methods that have been previously considered: center ranking (Center), PCA ranking (PCA), and graph transition energy method (GTEM). The conclusion is encouraging: FABS-NC(') consistently outperforms all these five alternatives. FABS-SVM has the second best performance among these six methods, but is far behind FABS-NC('): In some cases FABS-NC(') produces over half correctly predicted ranking experiment trials than FABS-SVM. AVAILABILITY: The system and data for the evaluation reported here will be made available upon request to the authors after this manuscript is accepted for publication.


Asunto(s)
Descubrimiento de Drogas/métodos , Preparaciones Farmacéuticas/análisis , Máquina de Vectores de Soporte , Animales , Células CHO , Cricetinae
6.
Genomics ; 100(3): 141-8, 2012 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-22735742

RESUMEN

Recent genome-wide surveys on ncRNA have revealed that a substantial fraction of miRNA genes is likely to form clusters. However, the evolutionary and biological function implications of clustered miRNAs are still elusive. After identifying clustered miRNA genes under different maximum inter-miRNA distances (MIDs), this study intended to reveal evolution conservation patterns among these clustered miRNA genes in metazoan species using a computation algorithm. As examples, a total of 15-35% of known and predicted miRNA genes in nine selected species constitute clusters under the MIDs ranging from 1kb to 50kb. Intriguingly, 33 out of 37 metazoan miRNA clusters in 56 metazoan genomes are co-conserved with their up/down-stream adjacent protein-coding genes. Meanwhile, a co-expression pattern of miR-1 and miR-133a in the mir-133-1 cluster has been experimentally demonstrated. Therefore, the MetaMirClust database provides a useful bioinformatic resource for biologists to facilitate the advanced interrogations on the composition of miRNA clusters and their evolution patterns.


Asunto(s)
Minería de Datos/métodos , MicroARNs/análisis , Familia de Multigenes , Programas Informáticos , Algoritmos , Animales , Secuencia de Bases , Biología Computacional/métodos , Secuencia Conservada , Bases de Datos Genéticas , Evolución Molecular , Genes de ARNr , Células Hep G2 , Humanos , MicroARNs/genética , Reacción en Cadena de la Polimerasa de Transcriptasa Inversa , Ribosomas/genética , Homología de Secuencia de Ácido Nucleico , Transcriptoma
7.
Mil Med ; 188(Suppl 6): 590-597, 2023 11 08.
Artículo en Inglés | MEDLINE | ID: mdl-37948284

RESUMEN

INTRODUCTION: Foot and ankle fractures are the most common military health problem. Automated diagnosis can save time and personnel. It is crucial to distinguish fractures not only from normal healthy cases, but also robust against the presence of other orthopedic pathologies. Artificial intelligence (AI) deep learning has been shown to be promising. Previously, we have developed HAMIL-Net to automatically detect orthopedic injuries for upper extremity injuries. In this research, we investigated the performance of HAMIL-Net for detecting foot and ankle fractures in the presence of other abnormalities. MATERIALS AND METHODS: HAMIL-Net is a novel deep neural network consisting of a hierarchical attention layer followed by a multiple-instance learning layer. The design allowed it to deal with imaging studies with multiple views. We used 148K musculoskeletal imaging studies for 51K Veterans at VA San Diego in the past 20 years to create datasets for this research. We annotated each study by a semi-automated pipeline leveraging radiology reports written by board-certified radiologists and extracting findings with a natural language processing tool and manually validated the annotations. RESULTS: HAMIL-Net can be trained with study-level, multiple-view examples, and detect foot and ankle fractures with a 0.87 area under the receiver operational curve, but the performance dropped when tested by cases including other abnormalities. By integrating a fracture specialized model with one that detecting a broad range of abnormalities, HAMIL-Net's accuracy of detecting any abnormality improved from 0.53 to 0.77 and F-score from 0.46 to 0.86. We also reported HAMIL-Net's performance under different study types including for young (age 18-35) patients. CONCLUSIONS: Automated fracture detection is promising but to be deployed in clinical use, presence of other abnormalities must be considered to deliver its full benefit. Our results with HAMIL-Net showed that considering other abnormalities improved fracture detection and allowed for incidental findings of other musculoskeletal abnormalities pertinent or superimposed on fractures.


Asunto(s)
Fracturas de Tobillo , Inteligencia Artificial , Humanos , Adolescente , Adulto Joven , Adulto , Redes Neurales de la Computación , Estudios Retrospectivos
8.
PLoS Comput Biol ; 7(10): e1002212, 2011 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-21998575

RESUMEN

Morphological dynamics of mitochondria is associated with key cellular processes related to aging and neuronal degenerative diseases, but the lack of standard quantification of mitochondrial morphology impedes systematic investigation. This paper presents an automated system for the quantification and classification of mitochondrial morphology. We discovered six morphological subtypes of mitochondria for objective quantification of mitochondrial morphology. These six subtypes are small globules, swollen globules, straight tubules, twisted tubules, branched tubules and loops. The subtyping was derived by applying consensus clustering to a huge collection of more than 200 thousand mitochondrial images extracted from 1422 micrographs of Chinese hamster ovary (CHO) cells treated with different drugs, and was validated by evidence of functional similarity reported in the literature. Quantitative statistics of subtype compositions in cells is useful for correlating drug response and mitochondrial dynamics. Combining the quantitative results with our biochemical studies about the effects of squamocin on CHO cells reveals new roles of Caspases in the regulatory mechanisms of mitochondrial dynamics. This system is not only of value to the mitochondrial field, but also applicable to the investigation of other subcellular organelle morphology.


Asunto(s)
Caspasas/metabolismo , Mitocondrias/enzimología , Mitocondrias/ultraestructura , Animales , Células CHO , Inhibidores de Caspasas , Biología Computacional , Cricetinae , Cricetulus , Inhibidores de Cisteína Proteinasa/farmacología , Dimetilsulfóxido/farmacología , Furanos/farmacología , Lactonas/farmacología , Mitocondrias/clasificación , Mitocondrias/efectos de los fármacos , Modelos Biológicos , Oligopéptidos/farmacología , Reconocimiento de Normas Patrones Automatizadas/estadística & datos numéricos
9.
Genomics ; 98(6): 453-9, 2011 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-21930198

RESUMEN

Rabbit (Oryctolagus cuniculus) is the only lagomorph animal of which the genome has been sequenced. Establishing a rabbit miRNA resource will benefit subsequent functional genomic studies in mammals. We have generated small RNA sequence reads with SOLiD and Solexa platforms to identify rabbit miRNAs, where we identified 464 pre-miRNAs and 886 mature miRNAs. The brain and heart miRNA libraries were used for further in-depth analysis of isomiR distributions. There are several intriguing findings. First, several rabbit pre-miRNAs form highly conserved clusters. Second, there is a preference in selecting one strand as mature miRNA, resulting in an arm selection preference. Third, we analyzed the isomiR expression and validated the expression of isomiR types in different rabbit tissues. Moreover, we further performed additional small RNA libraries and defined miRNAs differentially expressed between brain and heart. We conclude also that isomiR distribution profiles could vary between brain and heart tissues.


Asunto(s)
MicroARNs/genética , MicroARNs/metabolismo , Conejos/genética , Secuencia de Aminoácidos , Animales , Perfilación de la Expresión Génica , Biblioteca de Genes , Datos de Secuencia Molecular , Familia de Multigenes , Análisis de Secuencia de ARN
10.
Radiol Artif Intell ; 4(4): e210258, 2022 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-35923376

RESUMEN

Purpose: To investigate if tailoring a transformer-based language model to radiology is beneficial for radiology natural language processing (NLP) applications. Materials and Methods: This retrospective study presents a family of bidirectional encoder representations from transformers (BERT)-based language models adapted for radiology, named RadBERT. Transformers were pretrained with either 2.16 or 4.42 million radiology reports from U.S. Department of Veterans Affairs health care systems nationwide on top of four different initializations (BERT-base, Clinical-BERT, robustly optimized BERT pretraining approach [RoBERTa], and BioMed-RoBERTa) to create six variants of RadBERT. Each variant was fine-tuned for three representative NLP tasks in radiology: (a) abnormal sentence classification: models classified sentences in radiology reports as reporting abnormal or normal findings; (b) report coding: models assigned a diagnostic code to a given radiology report for five coding systems; and (c) report summarization: given the findings section of a radiology report, models selected key sentences that summarized the findings. Model performance was compared by bootstrap resampling with five intensively studied transformer language models as baselines: BERT-base, BioBERT, Clinical-BERT, BlueBERT, and BioMed-RoBERTa. Results: For abnormal sentence classification, all models performed well (accuracies above 97.5 and F1 scores above 95.0). RadBERT variants achieved significantly higher scores than corresponding baselines when given only 10% or less of 12 458 annotated training sentences. For report coding, all variants outperformed baselines significantly for all five coding systems. The variant RadBERT-BioMed-RoBERTa performed the best among all models for report summarization, achieving a Recall-Oriented Understudy for Gisting Evaluation-1 score of 16.18 compared with 15.27 by the corresponding baseline (BioMed-RoBERTa, P < .004). Conclusion: Transformer-based language models tailored to radiology had improved performance of radiology NLP tasks compared with baseline transformer language models.Keywords: Translation, Unsupervised Learning, Transfer Learning, Neural Networks, Informatics Supplemental material is available for this article. © RSNA, 2022See also commentary by Wiggins and Tejani in this issue.

11.
BMC Bioinformatics ; 12 Suppl 8: S6, 2011 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-22152021

RESUMEN

BACKGROUND: Previously, gene normalization (GN) systems are mostly focused on disambiguation using contextual information. An effective gene mention tagger is deemed unnecessary because the subsequent steps will filter out false positives and high recall is sufficient. However, unlike similar tasks in the past BioCreative challenges, the BioCreative III GN task is particularly challenging because it is not species-specific. Required to process full-length articles, an ineffective gene mention tagger may produce a huge number of ambiguous false positives that overwhelm subsequent filtering steps while still missing many true positives. RESULTS: We present our GN system participated in the BioCreative III GN task. Our system applies a typical 2-stage approach to GN but features a soft tagging gene mention tagger that generates a set of overlapping gene mention variants with a nearly perfect recall. The overlapping gene mention variants increase the chance of precise match in the dictionary and alleviate the need of disambiguation. Our GN system achieved a precision of 0.9 (F-score 0.63) on the BioCreative III GN test corpus with the silver annotation of 507 articles. Its TAP-k scores are competitive to the best results among all participants. CONCLUSIONS: We show that despite the lack of clever disambiguation in our gene normalization system, effective soft tagging of gene mention variants can indeed contribute to performance in cross-species and full-text gene normalization.


Asunto(s)
Minería de Datos , Genes , Especificidad de la Especie , Minería de Datos/métodos , Procesamiento de Lenguaje Natural , Publicaciones Periódicas como Asunto , Programas Informáticos , Terminología como Asunto
13.
BMC Bioinformatics ; 12 Suppl 1: S9, 2011 Feb 15.
Artículo en Inglés | MEDLINE | ID: mdl-21342592

RESUMEN

BACKGROUND: Un-MAppable Reads Solution (UMARS) is a user-friendly web service focusing on retrieving valuable information from sequence reads that cannot be mapped back to reference genomes. Recently, next-generation sequencing (NGS) technology has emerged as a powerful tool for generating high-throughput sequencing data and has been applied to many kinds of biological research. In a typical analysis, adaptor-trimmed NGS reads were first mapped back to reference sequences, including genomes or transcripts. However, a fraction of NGS reads failed to be mapped back to the reference sequences. Such un-mappable reads are usually imputed to sequencing errors and discarded without further consideration. METHODS: We are investigating possible biological relevance and possible sources of un-mappable reads. Therefore, we developed UMARS to scan for virus genomic fragments or exon-exon junctions of novel alternative splicing isoforms from un-mappable reads. For mapping un-mappable reads, we first collected viral genomes and sequences of exon-exon junctions. Then, we constructed UMARS pipeline as an automatic alignment interface. RESULTS: By demonstrating the results of two UMARS alignment cases, we show the applicability of UMARS. We first showed that the expected EBV genomic fragments can be detected by UMARS. Second, we also detected exon-exon junctions from un-mappable reads. Further experimental validation also ensured the authenticity of the UMARS pipeline. The UMARS service is freely available to the academic community and can be accessed via http://musk.ibms.sinica.edu.tw/UMARS/. CONCLUSIONS: In this study, we have shown that some un-mappable reads are not caused by sequencing errors. They can originate from viral infection or transcript splicing. Our UMARS pipeline provides another way to examine and recycle the un-mappable reads that are commonly discarded as garbage.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Mapeo Cromosómico , ADN Complementario/genética , Exones , Genoma Viral , Empalme del ARN , Alineación de Secuencia , Interfaz Usuario-Computador
14.
BMC Bioinformatics ; 12 Suppl 8: S2, 2011 Oct 03.
Artículo en Inglés | MEDLINE | ID: mdl-22151901

RESUMEN

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.


Asunto(s)
Algoritmos , Minería de Datos/métodos , Genes , Animales , Minería de Datos/normas , Humanos , National Library of Medicine (U.S.) , Publicaciones Periódicas como Asunto , Estados Unidos
15.
Bioinformatics ; 26(12): i29-37, 2010 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-20529919

RESUMEN

MOTIVATION: High-throughput image-based assay technologies can rapidly produce a large number of cell images for drug screening, but data analysis is still a major bottleneck that limits their utility. Quantifying a wide variety of morphological differences observed in cell images under different drug influences is still a challenging task because the result can be highly sensitive to sampling and noise. RESULTS: We propose a graph-based approach to cell image analysis. We define graph transition energy to quantify morphological differences between image sets. A spectral graph theoretic regularization is applied to transform the feature space based on training examples of extremely different images to calibrate the quantification. Calibration is essential for a practical quantification method because we need to measure the confidence of the quantification. We applied our method to quantify the degree of partial fragmentation of mitochondria in collections of fluorescent cell images. We show that with transformation, the quantification can be more accurate and sensitive than that without transformation. We also show that our method outperforms competing methods, including neighbourhood component analysis and the multi-variate drug profiling method by Loo et al. We illustrate its utility with a study of Annonaceous acetogenins, a family of compounds with drug potential. Our result reveals that squamocin induces more fragmented mitochondria than muricin A. AVAILABILITY: Mitochondrial cell images, their corresponding feature sets (SSLF and WSLF) and the source code of our proposed method are available at http://aiia.iis.sinica.edu.tw/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Estructuras Celulares/ultraestructura , Biología Computacional/métodos , Acetogeninas/metabolismo , Calibración , Interpretación de Imagen Asistida por Computador/métodos , Mitocondrias/ultraestructura
16.
Genomics ; 96(1): 1-9, 2010 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-20347954

RESUMEN

MicroRNAs (miRNAs) are endogenous non-protein-coding RNAs of approximately 22 nucleotides. Thousands of miRNA genes have been identified (computationally and/or experimentally) in a variety of organisms, which suggests that miRNA genes have been widely shared and distributed among species. Here, we used unique miRNA sequence patterns to scan the genome sequences of 56 bilaterian animal species for locating candidate miRNAs first. The regions centered surrounding these candidate miRNAs were then extracted for folding and calculating the features of their secondary structure. Using a support vector machine (SVM) as a classifier combined with these features, we identified an additional 13,091 orthologous or paralogous candidate pre-miRNAs, as well as their corresponding candidate mature miRNAs. Stem-loop RT-PCR and deep sequencing methods were used to experimentally validate the prediction results in human, medaka and rabbit. Our prediction pipeline allows the rapid and effective discovery of homologous miRNAs in a large number of genomes.


Asunto(s)
Genoma , MicroARNs/clasificación , MicroARNs/genética , Análisis de Secuencia de ARN , Diseño de Software , Algoritmos , Animales , Línea Celular Tumoral , Biología Computacional/estadística & datos numéricos , ADN Complementario , Bases de Datos Genéticas , Femenino , Genómica , Humanos , Secuencias Invertidas Repetidas , Masculino , Modelos Estadísticos , Datos de Secuencia Molecular , Conformación de Ácido Nucleico , Oryzias , ARN Mensajero/genética , Conejos , Alineación de Secuencia , Especificidad de la Especie
17.
Database (Oxford) ; 20212021 04 29.
Artículo en Inglés | MEDLINE | ID: mdl-33914028

RESUMEN

High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information's Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE.


Asunto(s)
Aprendizaje Profundo , Metadatos , Secuenciación de Nucleótidos de Alto Rendimiento , Reproducibilidad de los Resultados , Programas Informáticos
18.
Nat Med ; 27(10): 1735-1743, 2021 10.
Artículo en Inglés | MEDLINE | ID: mdl-34526699

RESUMEN

Federated learning (FL) is a method used for training artificial intelligence models with data from multiple sources while maintaining data anonymity, thus removing many barriers to data sharing. Here we used data from 20 institutes across the globe to train a FL model, called EXAM (electronic medical record (EMR) chest X-ray AI model), that predicts the future oxygen requirements of symptomatic patients with COVID-19 using inputs of vital signs, laboratory data and chest X-rays. EXAM achieved an average area under the curve (AUC) >0.92 for predicting outcomes at 24 and 72 h from the time of initial presentation to the emergency room, and it provided 16% improvement in average AUC measured across all participating sites and an average increase in generalizability of 38% when compared with models trained at a single site using that site's data. For prediction of mechanical ventilation treatment or death at 24 h at the largest independent test site, EXAM achieved a sensitivity of 0.950 and specificity of 0.882. In this study, FL facilitated rapid data science collaboration without data exchange and generated a model that generalized across heterogeneous, unharmonized datasets for prediction of clinical outcomes in patients with COVID-19, setting the stage for the broader use of FL in healthcare.


Asunto(s)
COVID-19/fisiopatología , Aprendizaje Automático , Evaluación de Resultado en la Atención de Salud , COVID-19/terapia , COVID-19/virología , Registros Electrónicos de Salud , Humanos , Pronóstico , SARS-CoV-2/aislamiento & purificación
19.
Res Sq ; 2021 Jan 08.
Artículo en Inglés | MEDLINE | ID: mdl-33442676

RESUMEN

'Federated Learning' (FL) is a method to train Artificial Intelligence (AI) models with data from multiple sources while maintaining anonymity of the data thus removing many barriers to data sharing. During the SARS-COV-2 pandemic, 20 institutes collaborated on a healthcare FL study to predict future oxygen requirements of infected patients using inputs of vital signs, laboratory data, and chest x-rays, constituting the "EXAM" (EMR CXR AI Model) model. EXAM achieved an average Area Under the Curve (AUC) of over 0.92, an average improvement of 16%, and a 38% increase in generalisability over local models. The FL paradigm was successfully applied to facilitate a rapid data science collaboration without data exchange, resulting in a model that generalised across heterogeneous, unharmonized datasets. This provided the broader healthcare community with a validated model to respond to COVID-19 challenges, as well as set the stage for broader use of FL in healthcare.

20.
BMC Bioinformatics ; 11 Suppl 1: S21, 2010 Jan 18.
Artículo en Inglés | MEDLINE | ID: mdl-20122193

RESUMEN

BACKGROUND: Recombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in Escherichia coli (E. coli). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression. RESULTS: In this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in E. coli. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production. CONCLUSION: In this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.


Asunto(s)
Inteligencia Artificial , Vectores Genéticos/genética , Proteínas Recombinantes/genética , Bases de Datos de Proteínas , Escherichia coli/genética , Escherichia coli/metabolismo , Vectores Genéticos/metabolismo , Proteínas Recombinantes/metabolismo , Solubilidad
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA