Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 44
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
PLoS Comput Biol ; 17(5): e1008967, 2021 05.
Artigo em Inglês | MEDLINE | ID: mdl-34043624

RESUMO

Antibodies are widely used reagents to test for expression of proteins and other antigens. However, they might not always reliably produce results when they do not specifically bind to the target proteins that their providers designed them for, leading to unreliable research results. While many proposals have been developed to deal with the problem of antibody specificity, it is still challenging to cover the millions of antibodies that are available to researchers. In this study, we investigate the feasibility of automatically generating alerts to users of problematic antibodies by extracting statements about antibody specificity reported in the literature. The extracted alerts can be used to construct an "Antibody Watch" knowledge base containing supporting statements of problematic antibodies. We developed a deep neural network system and tested its performance with a corpus of more than two thousand articles that reported uses of antibodies. We divided the problem into two tasks. Given an input article, the first task is to identify snippets about antibody specificity and classify if the snippets report that any antibody exhibits non-specificity, and thus is problematic. The second task is to link each of these snippets to one or more antibodies mentioned in the snippet. The experimental evaluation shows that our system can accurately perform the classification task with 0.925 weighted F1-score, linking with 0.962 accuracy, and 0.914 weighted F1 when combined to complete the joint task. We leveraged Research Resource Identifiers (RRID) to precisely identify antibodies linked to the extracted specificity snippets. The result shows that it is feasible to construct a reliable knowledge base about problematic antibodies by text mining.


Assuntos
Especificidade de Anticorpos , Mineração de Dados , Animais , Humanos , Camundongos , Redes Neurais de Computação
2.
J Biomed Inform ; 82: 63-69, 2018 06.
Artigo em Inglês | MEDLINE | ID: mdl-29679685

RESUMO

BACKGROUND: Big clinical note datasets found in electronic health records (EHR) present substantial opportunities to train accurate statistical models that identify patterns in patient diagnosis and outcomes. However, near-to-exact duplication in note texts is a common issue in many clinical note datasets. We aimed to use a scalable algorithm to de-duplicate notes and further characterize the sources of duplication. METHODS: We use an approximation algorithm to minimize pairwise comparisons consisting of three phases: (1) Minhashing with Locality Sensitive Hashing; (2) a clustering method using tree-structured disjoint sets; and (3) classification of near-duplicates (exact copies, common machine output notes, or similar notes) via pairwise comparison of notes in each cluster. We use the Jaccard Similarity (JS) to measure similarity between two documents. We analyzed two big clinical note datasets: our institutional dataset and MIMIC-III. RESULTS: There were 1,528,940 notes analyzed from our institution. The de-duplication algorithm completed in 36.3 h. When the JS threshold was set at 0.7, the total number of clusters was 82,371 (total notes = 304,418). Among all JS thresholds, no clusters contained pairs of notes that were incorrectly clustered. When the JS threshold was set at 0.9 or 1.0, the de-duplication algorithm captured 100% of all random pairs with their JS at least as high as the set thresholds from the validation set. Similar performance was noted when analyzing the MIMIC-III dataset. CONCLUSIONS: We showed that among the EHR from our institution and from the publicly-available MIMIC-III dataset, there were a significant number of near-to-exact duplicated notes.


Assuntos
Coleta de Dados , Registros Eletrônicos de Saúde , Informática Médica/métodos , Algoritmos , Análise por Conglomerados , Computadores , Bases de Dados Factuais , Conjuntos de Dados como Assunto , Humanos , Aprendizado de Máquina , Processamento de Linguagem Natural , Obesidade Mórbida/diagnóstico , Obesidade Mórbida/epidemiologia , Reconhecimento Automatizado de Padrão
3.
BMC Bioinformatics ; 17 Suppl 1: 1, 2016 Jan 11.
Artigo em Inglês | MEDLINE | ID: mdl-26817711

RESUMO

BACKGROUND: Numerous publicly available biomedical databases derive data by curating from literatures. The curated data can be useful as training examples for information extraction, but curated data usually lack the exact mentions and their locations in the text required for supervised machine learning. This paper describes a general approach to information extraction using curated data as training examples. The idea is to formulate the problem as cost-sensitive learning from noisy labels, where the cost is estimated by a committee of weak classifiers that consider both curated data and the text. RESULTS: We test the idea on two information extraction tasks of Genome-Wide Association Studies (GWAS). The first task is to extract target phenotypes (diseases or traits) of a study and the second is to extract ethnicity backgrounds of study subjects for different stages (initial or replication). Experimental results show that our approach can achieve 87% of Precision-at-2 (P@2) for disease/trait extraction, and 0.83 of F1-Score for stage-ethnicity extraction, both outperforming their cost-insensitive baseline counterparts. CONCLUSIONS: The results show that curated biomedical databases can potentially be reused as training examples to train information extractors without expert annotation or refinement, opening an unprecedented opportunity of using "big data" in biomedical text mining.


Assuntos
Indexação e Redação de Resumos/métodos , Curadoria de Dados , Mineração de Dados/métodos , Bases de Dados Factuais , Doença/genética , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Humanos , Medição de Risco
4.
Mol Cell Proteomics ; 12(5): 1335-49, 2013 May.
Artigo em Inglês | MEDLINE | ID: mdl-23397142

RESUMO

Deciphering the network of signaling pathways in cancer via protein-protein interactions (PPIs) at the cellular level is a promising approach but remains incomplete. We used an in situ proximity ligation assay to identify and quantify 67 endogenous PPIs among 21 interlinked pathways in two hepatocellular carcinoma (HCC) cells, Huh7 (minimally migratory cells) and Mahlavu (highly migratory cells). We then applied a differential network biology analysis and determined that the novel interaction, CRKL-FLT1, has a high centrality ranking, and the expression of this interaction is strongly correlated with the migratory ability of HCC and other cancer cell lines. Knockdown of CRKL and FLT1 in HCC cells leads to a decrease in cell migration via ERK signaling and the epithelial-mesenchymal transition process. Our immunohistochemical analysis shows high expression levels of the CRKL and CRKL-FLT1 pair that strongly correlate with reduced disease-free and overall survival in HCC patient samples, and a multivariate analysis further established CRKL and the CRKL-FLT1 as novel prognosis markers. This study demonstrated that functional exploration of a disease network with interlinked pathways via PPIs can be used to discover novel biomarkers.


Assuntos
Proteínas Adaptadoras de Transdução de Sinal/metabolismo , Biomarcadores Tumorais/metabolismo , Carcinoma Hepatocelular/metabolismo , Neoplasias Hepáticas/metabolismo , Proteínas Nucleares/metabolismo , Mapas de Interação de Proteínas , Adulto , Idoso , Idoso de 80 Anos ou mais , Carcinoma Hepatocelular/diagnóstico , Carcinoma Hepatocelular/mortalidade , Intervalo Livre de Doença , Células HEK293 , Células Hep G2 , Humanos , Estimativa de Kaplan-Meier , Neoplasias Hepáticas/diagnóstico , Neoplasias Hepáticas/mortalidade , Pessoa de Meia-Idade , Prognóstico , Modelos de Riscos Proporcionais , Estudos Retrospectivos , Transdução de Sinais , Análise Serial de Tecidos , Receptor 1 de Fatores de Crescimento do Endotélio Vascular/metabolismo , Adulto Jovem
5.
Bioinformatics ; 28(12): i106-14, 2012 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-22689749

RESUMO

MOTIVATION: The recent development of high-throughput drug profiling (high content screening or HCS) provides a large amount of quantitative multidimensional data. Despite its potentials, it poses several challenges for academia and industry analysts alike. This is especially true for ranking the effectiveness of several drugs from many thousands of images directly. This paper introduces, for the first time, a new framework for automatically ordering the performance of drugs, called fractional adjusted bi-partitional score (FABS). This general strategy takes advantage of graph-based formulations and solutions and avoids many shortfalls of traditionally used methods in practice. We experimented with FABS framework by implementing it with a specific algorithm, a variant of normalized cut-normalized cut prime (FABS-NC(')), producing a ranking of drugs. This algorithm is known to run in polynomial time and therefore can scale well in high-throughput applications. RESULTS: We compare the performance of FABS-NC(') to other methods that could be used for drugs ranking. We devise two variants of the FABS algorithm: FABS-SVM that utilizes support vector machine (SVM) as black box, and FABS-Spectral that utilizes the eigenvector technique (spectral) as black box. We compare the performance of FABS-NC(') also to three other methods that have been previously considered: center ranking (Center), PCA ranking (PCA), and graph transition energy method (GTEM). The conclusion is encouraging: FABS-NC(') consistently outperforms all these five alternatives. FABS-SVM has the second best performance among these six methods, but is far behind FABS-NC('): In some cases FABS-NC(') produces over half correctly predicted ranking experiment trials than FABS-SVM. AVAILABILITY: The system and data for the evaluation reported here will be made available upon request to the authors after this manuscript is accepted for publication.


Assuntos
Descoberta de Drogas/métodos , Preparações Farmacêuticas/análise , Máquina de Vetores de Suporte , Animais , Células CHO , Cricetinae
6.
Genomics ; 100(3): 141-8, 2012 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-22735742

RESUMO

Recent genome-wide surveys on ncRNA have revealed that a substantial fraction of miRNA genes is likely to form clusters. However, the evolutionary and biological function implications of clustered miRNAs are still elusive. After identifying clustered miRNA genes under different maximum inter-miRNA distances (MIDs), this study intended to reveal evolution conservation patterns among these clustered miRNA genes in metazoan species using a computation algorithm. As examples, a total of 15-35% of known and predicted miRNA genes in nine selected species constitute clusters under the MIDs ranging from 1kb to 50kb. Intriguingly, 33 out of 37 metazoan miRNA clusters in 56 metazoan genomes are co-conserved with their up/down-stream adjacent protein-coding genes. Meanwhile, a co-expression pattern of miR-1 and miR-133a in the mir-133-1 cluster has been experimentally demonstrated. Therefore, the MetaMirClust database provides a useful bioinformatic resource for biologists to facilitate the advanced interrogations on the composition of miRNA clusters and their evolution patterns.


Assuntos
Mineração de Dados/métodos , MicroRNAs/análise , Família Multigênica , Software , Algoritmos , Animais , Sequência de Bases , Biologia Computacional/métodos , Sequência Conservada , Bases de Dados Genéticas , Evolução Molecular , Genes de RNAr , Células Hep G2 , Humanos , MicroRNAs/genética , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Ribossomos/genética , Homologia de Sequência do Ácido Nucleico , Transcriptoma
7.
Mil Med ; 188(Suppl 6): 590-597, 2023 11 08.
Artigo em Inglês | MEDLINE | ID: mdl-37948284

RESUMO

INTRODUCTION: Foot and ankle fractures are the most common military health problem. Automated diagnosis can save time and personnel. It is crucial to distinguish fractures not only from normal healthy cases, but also robust against the presence of other orthopedic pathologies. Artificial intelligence (AI) deep learning has been shown to be promising. Previously, we have developed HAMIL-Net to automatically detect orthopedic injuries for upper extremity injuries. In this research, we investigated the performance of HAMIL-Net for detecting foot and ankle fractures in the presence of other abnormalities. MATERIALS AND METHODS: HAMIL-Net is a novel deep neural network consisting of a hierarchical attention layer followed by a multiple-instance learning layer. The design allowed it to deal with imaging studies with multiple views. We used 148K musculoskeletal imaging studies for 51K Veterans at VA San Diego in the past 20 years to create datasets for this research. We annotated each study by a semi-automated pipeline leveraging radiology reports written by board-certified radiologists and extracting findings with a natural language processing tool and manually validated the annotations. RESULTS: HAMIL-Net can be trained with study-level, multiple-view examples, and detect foot and ankle fractures with a 0.87 area under the receiver operational curve, but the performance dropped when tested by cases including other abnormalities. By integrating a fracture specialized model with one that detecting a broad range of abnormalities, HAMIL-Net's accuracy of detecting any abnormality improved from 0.53 to 0.77 and F-score from 0.46 to 0.86. We also reported HAMIL-Net's performance under different study types including for young (age 18-35) patients. CONCLUSIONS: Automated fracture detection is promising but to be deployed in clinical use, presence of other abnormalities must be considered to deliver its full benefit. Our results with HAMIL-Net showed that considering other abnormalities improved fracture detection and allowed for incidental findings of other musculoskeletal abnormalities pertinent or superimposed on fractures.


Assuntos
Fraturas do Tornozelo , Inteligência Artificial , Humanos , Adolescente , Adulto Jovem , Adulto , Redes Neurais de Computação , Estudos Retrospectivos
8.
PLoS Comput Biol ; 7(10): e1002212, 2011 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-21998575

RESUMO

Morphological dynamics of mitochondria is associated with key cellular processes related to aging and neuronal degenerative diseases, but the lack of standard quantification of mitochondrial morphology impedes systematic investigation. This paper presents an automated system for the quantification and classification of mitochondrial morphology. We discovered six morphological subtypes of mitochondria for objective quantification of mitochondrial morphology. These six subtypes are small globules, swollen globules, straight tubules, twisted tubules, branched tubules and loops. The subtyping was derived by applying consensus clustering to a huge collection of more than 200 thousand mitochondrial images extracted from 1422 micrographs of Chinese hamster ovary (CHO) cells treated with different drugs, and was validated by evidence of functional similarity reported in the literature. Quantitative statistics of subtype compositions in cells is useful for correlating drug response and mitochondrial dynamics. Combining the quantitative results with our biochemical studies about the effects of squamocin on CHO cells reveals new roles of Caspases in the regulatory mechanisms of mitochondrial dynamics. This system is not only of value to the mitochondrial field, but also applicable to the investigation of other subcellular organelle morphology.


Assuntos
Caspases/metabolismo , Mitocôndrias/enzimologia , Mitocôndrias/ultraestrutura , Animais , Células CHO , Inibidores de Caspase , Biologia Computacional , Cricetinae , Cricetulus , Inibidores de Cisteína Proteinase/farmacologia , Dimetil Sulfóxido/farmacologia , Furanos/farmacologia , Lactonas/farmacologia , Mitocôndrias/classificação , Mitocôndrias/efeitos dos fármacos , Modelos Biológicos , Oligopeptídeos/farmacologia , Reconhecimento Automatizado de Padrão/estatística & dados numéricos
9.
Genomics ; 98(6): 453-9, 2011 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-21930198

RESUMO

Rabbit (Oryctolagus cuniculus) is the only lagomorph animal of which the genome has been sequenced. Establishing a rabbit miRNA resource will benefit subsequent functional genomic studies in mammals. We have generated small RNA sequence reads with SOLiD and Solexa platforms to identify rabbit miRNAs, where we identified 464 pre-miRNAs and 886 mature miRNAs. The brain and heart miRNA libraries were used for further in-depth analysis of isomiR distributions. There are several intriguing findings. First, several rabbit pre-miRNAs form highly conserved clusters. Second, there is a preference in selecting one strand as mature miRNA, resulting in an arm selection preference. Third, we analyzed the isomiR expression and validated the expression of isomiR types in different rabbit tissues. Moreover, we further performed additional small RNA libraries and defined miRNAs differentially expressed between brain and heart. We conclude also that isomiR distribution profiles could vary between brain and heart tissues.


Assuntos
MicroRNAs/genética , MicroRNAs/metabolismo , Coelhos/genética , Sequência de Aminoácidos , Animais , Perfilação da Expressão Gênica , Biblioteca Gênica , Dados de Sequência Molecular , Família Multigênica , Análise de Sequência de RNA
10.
Radiol Artif Intell ; 4(4): e210258, 2022 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-35923376

RESUMO

Purpose: To investigate if tailoring a transformer-based language model to radiology is beneficial for radiology natural language processing (NLP) applications. Materials and Methods: This retrospective study presents a family of bidirectional encoder representations from transformers (BERT)-based language models adapted for radiology, named RadBERT. Transformers were pretrained with either 2.16 or 4.42 million radiology reports from U.S. Department of Veterans Affairs health care systems nationwide on top of four different initializations (BERT-base, Clinical-BERT, robustly optimized BERT pretraining approach [RoBERTa], and BioMed-RoBERTa) to create six variants of RadBERT. Each variant was fine-tuned for three representative NLP tasks in radiology: (a) abnormal sentence classification: models classified sentences in radiology reports as reporting abnormal or normal findings; (b) report coding: models assigned a diagnostic code to a given radiology report for five coding systems; and (c) report summarization: given the findings section of a radiology report, models selected key sentences that summarized the findings. Model performance was compared by bootstrap resampling with five intensively studied transformer language models as baselines: BERT-base, BioBERT, Clinical-BERT, BlueBERT, and BioMed-RoBERTa. Results: For abnormal sentence classification, all models performed well (accuracies above 97.5 and F1 scores above 95.0). RadBERT variants achieved significantly higher scores than corresponding baselines when given only 10% or less of 12 458 annotated training sentences. For report coding, all variants outperformed baselines significantly for all five coding systems. The variant RadBERT-BioMed-RoBERTa performed the best among all models for report summarization, achieving a Recall-Oriented Understudy for Gisting Evaluation-1 score of 16.18 compared with 15.27 by the corresponding baseline (BioMed-RoBERTa, P < .004). Conclusion: Transformer-based language models tailored to radiology had improved performance of radiology NLP tasks compared with baseline transformer language models.Keywords: Translation, Unsupervised Learning, Transfer Learning, Neural Networks, Informatics Supplemental material is available for this article. © RSNA, 2022See also commentary by Wiggins and Tejani in this issue.

11.
BMC Bioinformatics ; 12 Suppl 8: S6, 2011 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-22152021

RESUMO

BACKGROUND: Previously, gene normalization (GN) systems are mostly focused on disambiguation using contextual information. An effective gene mention tagger is deemed unnecessary because the subsequent steps will filter out false positives and high recall is sufficient. However, unlike similar tasks in the past BioCreative challenges, the BioCreative III GN task is particularly challenging because it is not species-specific. Required to process full-length articles, an ineffective gene mention tagger may produce a huge number of ambiguous false positives that overwhelm subsequent filtering steps while still missing many true positives. RESULTS: We present our GN system participated in the BioCreative III GN task. Our system applies a typical 2-stage approach to GN but features a soft tagging gene mention tagger that generates a set of overlapping gene mention variants with a nearly perfect recall. The overlapping gene mention variants increase the chance of precise match in the dictionary and alleviate the need of disambiguation. Our GN system achieved a precision of 0.9 (F-score 0.63) on the BioCreative III GN test corpus with the silver annotation of 507 articles. Its TAP-k scores are competitive to the best results among all participants. CONCLUSIONS: We show that despite the lack of clever disambiguation in our gene normalization system, effective soft tagging of gene mention variants can indeed contribute to performance in cross-species and full-text gene normalization.


Assuntos
Mineração de Dados , Genes , Especificidade da Espécie , Mineração de Dados/métodos , Processamento de Linguagem Natural , Publicações Periódicas como Assunto , Software , Terminologia como Assunto
13.
BMC Bioinformatics ; 12 Suppl 1: S9, 2011 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-21342592

RESUMO

BACKGROUND: Un-MAppable Reads Solution (UMARS) is a user-friendly web service focusing on retrieving valuable information from sequence reads that cannot be mapped back to reference genomes. Recently, next-generation sequencing (NGS) technology has emerged as a powerful tool for generating high-throughput sequencing data and has been applied to many kinds of biological research. In a typical analysis, adaptor-trimmed NGS reads were first mapped back to reference sequences, including genomes or transcripts. However, a fraction of NGS reads failed to be mapped back to the reference sequences. Such un-mappable reads are usually imputed to sequencing errors and discarded without further consideration. METHODS: We are investigating possible biological relevance and possible sources of un-mappable reads. Therefore, we developed UMARS to scan for virus genomic fragments or exon-exon junctions of novel alternative splicing isoforms from un-mappable reads. For mapping un-mappable reads, we first collected viral genomes and sequences of exon-exon junctions. Then, we constructed UMARS pipeline as an automatic alignment interface. RESULTS: By demonstrating the results of two UMARS alignment cases, we show the applicability of UMARS. We first showed that the expected EBV genomic fragments can be detected by UMARS. Second, we also detected exon-exon junctions from un-mappable reads. Further experimental validation also ensured the authenticity of the UMARS pipeline. The UMARS service is freely available to the academic community and can be accessed via http://musk.ibms.sinica.edu.tw/UMARS/. CONCLUSIONS: In this study, we have shown that some un-mappable reads are not caused by sequencing errors. They can originate from viral infection or transcript splicing. Our UMARS pipeline provides another way to examine and recycle the un-mappable reads that are commonly discarded as garbage.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Mapeamento Cromossômico , DNA Complementar/genética , Éxons , Genoma Viral , Splicing de RNA , Alinhamento de Sequência , Interface Usuário-Computador
14.
BMC Bioinformatics ; 12 Suppl 8: S2, 2011 Oct 03.
Artigo em Inglês | MEDLINE | ID: mdl-22151901

RESUMO

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance.


Assuntos
Algoritmos , Mineração de Dados/métodos , Genes , Animais , Mineração de Dados/normas , Humanos , National Library of Medicine (U.S.) , Publicações Periódicas como Assunto , Estados Unidos
15.
Bioinformatics ; 26(12): i29-37, 2010 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-20529919

RESUMO

MOTIVATION: High-throughput image-based assay technologies can rapidly produce a large number of cell images for drug screening, but data analysis is still a major bottleneck that limits their utility. Quantifying a wide variety of morphological differences observed in cell images under different drug influences is still a challenging task because the result can be highly sensitive to sampling and noise. RESULTS: We propose a graph-based approach to cell image analysis. We define graph transition energy to quantify morphological differences between image sets. A spectral graph theoretic regularization is applied to transform the feature space based on training examples of extremely different images to calibrate the quantification. Calibration is essential for a practical quantification method because we need to measure the confidence of the quantification. We applied our method to quantify the degree of partial fragmentation of mitochondria in collections of fluorescent cell images. We show that with transformation, the quantification can be more accurate and sensitive than that without transformation. We also show that our method outperforms competing methods, including neighbourhood component analysis and the multi-variate drug profiling method by Loo et al. We illustrate its utility with a study of Annonaceous acetogenins, a family of compounds with drug potential. Our result reveals that squamocin induces more fragmented mitochondria than muricin A. AVAILABILITY: Mitochondrial cell images, their corresponding feature sets (SSLF and WSLF) and the source code of our proposed method are available at http://aiia.iis.sinica.edu.tw/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Estruturas Celulares/ultraestrutura , Biologia Computacional/métodos , Acetogeninas/metabolismo , Calibragem , Interpretação de Imagem Assistida por Computador/métodos , Mitocôndrias/ultraestrutura
16.
Genomics ; 96(1): 1-9, 2010 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-20347954

RESUMO

MicroRNAs (miRNAs) are endogenous non-protein-coding RNAs of approximately 22 nucleotides. Thousands of miRNA genes have been identified (computationally and/or experimentally) in a variety of organisms, which suggests that miRNA genes have been widely shared and distributed among species. Here, we used unique miRNA sequence patterns to scan the genome sequences of 56 bilaterian animal species for locating candidate miRNAs first. The regions centered surrounding these candidate miRNAs were then extracted for folding and calculating the features of their secondary structure. Using a support vector machine (SVM) as a classifier combined with these features, we identified an additional 13,091 orthologous or paralogous candidate pre-miRNAs, as well as their corresponding candidate mature miRNAs. Stem-loop RT-PCR and deep sequencing methods were used to experimentally validate the prediction results in human, medaka and rabbit. Our prediction pipeline allows the rapid and effective discovery of homologous miRNAs in a large number of genomes.


Assuntos
Genoma , MicroRNAs/classificação , MicroRNAs/genética , Análise de Sequência de RNA , Design de Software , Algoritmos , Animais , Linhagem Celular Tumoral , Biologia Computacional/estatística & dados numéricos , DNA Complementar , Bases de Dados Genéticas , Feminino , Genômica , Humanos , Sequências Repetidas Invertidas , Masculino , Modelos Estatísticos , Dados de Sequência Molecular , Conformação de Ácido Nucleico , Oryzias , RNA Mensageiro/genética , Coelhos , Alinhamento de Sequência , Especificidade da Espécie
17.
Database (Oxford) ; 20212021 04 29.
Artigo em Inglês | MEDLINE | ID: mdl-33914028

RESUMO

High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information's Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE.


Assuntos
Aprendizado Profundo , Metadados , Sequenciamento de Nucleotídeos em Larga Escala , Reprodutibilidade dos Testes , Software
18.
Res Sq ; 2021 Jan 08.
Artigo em Inglês | MEDLINE | ID: mdl-33442676

RESUMO

'Federated Learning' (FL) is a method to train Artificial Intelligence (AI) models with data from multiple sources while maintaining anonymity of the data thus removing many barriers to data sharing. During the SARS-COV-2 pandemic, 20 institutes collaborated on a healthcare FL study to predict future oxygen requirements of infected patients using inputs of vital signs, laboratory data, and chest x-rays, constituting the "EXAM" (EMR CXR AI Model) model. EXAM achieved an average Area Under the Curve (AUC) of over 0.92, an average improvement of 16%, and a 38% increase in generalisability over local models. The FL paradigm was successfully applied to facilitate a rapid data science collaboration without data exchange, resulting in a model that generalised across heterogeneous, unharmonized datasets. This provided the broader healthcare community with a validated model to respond to COVID-19 challenges, as well as set the stage for broader use of FL in healthcare.

19.
Nat Med ; 27(10): 1735-1743, 2021 10.
Artigo em Inglês | MEDLINE | ID: mdl-34526699

RESUMO

Federated learning (FL) is a method used for training artificial intelligence models with data from multiple sources while maintaining data anonymity, thus removing many barriers to data sharing. Here we used data from 20 institutes across the globe to train a FL model, called EXAM (electronic medical record (EMR) chest X-ray AI model), that predicts the future oxygen requirements of symptomatic patients with COVID-19 using inputs of vital signs, laboratory data and chest X-rays. EXAM achieved an average area under the curve (AUC) >0.92 for predicting outcomes at 24 and 72 h from the time of initial presentation to the emergency room, and it provided 16% improvement in average AUC measured across all participating sites and an average increase in generalizability of 38% when compared with models trained at a single site using that site's data. For prediction of mechanical ventilation treatment or death at 24 h at the largest independent test site, EXAM achieved a sensitivity of 0.950 and specificity of 0.882. In this study, FL facilitated rapid data science collaboration without data exchange and generated a model that generalized across heterogeneous, unharmonized datasets for prediction of clinical outcomes in patients with COVID-19, setting the stage for the broader use of FL in healthcare.


Assuntos
COVID-19/fisiopatologia , Aprendizado de Máquina , Avaliação de Resultados em Cuidados de Saúde , COVID-19/terapia , COVID-19/virologia , Registros Eletrônicos de Saúde , Humanos , Prognóstico , SARS-CoV-2/isolamento & purificação
20.
BMC Bioinformatics ; 11 Suppl 1: S21, 2010 Jan 18.
Artigo em Inglês | MEDLINE | ID: mdl-20122193

RESUMO

BACKGROUND: Recombinant protein production is a useful biotechnology to produce a large quantity of highly soluble proteins. Currently, the most widely used production system is to fuse a target protein into different vectors in Escherichia coli (E. coli). However, the production efficacy of different vectors varies for different target proteins. Trial-and-error is still the common practice to find out the efficacy of a vector for a given target protein. Previous studies are limited in that they assumed that proteins would be over-expressed and focused only on the solubility of expressed proteins. In fact, many pairings of vectors and proteins result in no expression. RESULTS: In this study, we applied machine learning to train prediction models to predict whether a pairing of vector-protein will express or not express in E. coli. For expressed cases, the models further predict whether the expressed proteins would be soluble. We collected a set of real cases from the clients of our recombinant protein production core facility, where six different vectors were designed and studied. This set of cases is used in both training and evaluation of our models. We evaluate three different models based on the support vector machines (SVM) and their ensembles. Unlike many previous works, these models consider the sequence of the target protein as well as the sequence of the whole fusion vector as the features. We show that a model that classifies a case into one of the three classes (no expression, inclusion body and soluble) outperforms a model that considers the nested structure of the three classes, while a model that can take advantage of the hierarchical structure of the three classes performs slight worse but comparably to the best model. Meanwhile, compared to previous works, we show that the prediction accuracy of our best method still performs the best. Lastly, we briefly present two methods to use the trained model in the design of the recombinant protein production systems to improve the chance of high soluble protein production. CONCLUSION: In this paper, we show that a machine learning approach to the prediction of the efficacy of a vector for a target protein in a recombinant protein production system is promising and may compliment traditional knowledge-driven study of the efficacy. We will release our program to share with other labs in the public domain when this paper is published.


Assuntos
Inteligência Artificial , Vetores Genéticos/genética , Proteínas Recombinantes/genética , Bases de Dados de Proteínas , Escherichia coli/genética , Escherichia coli/metabolismo , Vetores Genéticos/metabolismo , Proteínas Recombinantes/metabolismo , Solubilidade
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA