RESUMO
We present an ensemble transfer learning method to predict suicide from Veterans Affairs (VA) electronic medical records (EMR). A diverse set of base models was trained to predict a binary outcome constructed from reported suicide, suicide attempt, and overdose diagnoses with varying choices of study design and prediction methodology. Each model used twenty cross-sectional and 190 longitudinal variables observed in eight time intervals covering 7.5 years prior to the time of prediction. Ensembles of seven base models were created and fine-tuned with ten variables expected to change with study design and outcome definition in order to predict suicide and combined outcome in a prospective cohort. The ensemble models achieved c-statistics of 0.73 on 2-year suicide risk and 0.83 on the combined outcome when predicting on a prospective cohort of [Formula: see text] 4.2 M veterans. The ensembles rely on nonlinear base models trained using a matched retrospective nested case-control (Rcc) study cohort and show good calibration across a diversity of subgroups, including risk strata, age, sex, race, and level of healthcare utilization. In addition, a linear Rcc base model provided a rich set of biological predictors, including indicators of suicide, substance use disorder, mental health diagnoses and treatments, hypoxia and vascular damage, and demographics.
Assuntos
Carcinoma de Células Renais , Neoplasias Renais , Veteranos , Humanos , Veteranos/psicologia , Estudos Retrospectivos , Estudos Transversais , Estudos Prospectivos , Tentativa de Suicídio , Aprendizado de MáquinaRESUMO
[This corrects the article DOI: 10.1371/journal.pone.0225858.].
RESUMO
BACKGROUND: Histopathology images of tumor biopsies present unique challenges for applying machine learning to the diagnosis and treatment of cancer. The pathology slides are high resolution, often exceeding 1GB, have non-uniform dimensions, and often contain multiple tissue slices of varying sizes surrounded by large empty regions. The locations of abnormal or cancerous cells, which may constitute a small portion of any given tissue sample, are not annotated. Cancer image datasets are also extremely imbalanced, with most slides being associated with relatively common cancers. Since deep representations trained on natural photographs are unlikely to be optimal for classifying pathology slide images, which have different spectral ranges and spatial structure, we here describe an approach for learning features and inferring representations of cancer pathology slides based on sparse coding. RESULTS: We show that conventional transfer learning using a state-of-the-art deep learning architecture pre-trained on ImageNet (RESNET) and fine tuned for a binary tumor/no-tumor classification task achieved between 85% and 86% accuracy. However, when all layers up to the last convolutional layer in RESNET are replaced with a single feature map inferred via a sparse coding using a dictionary optimized for sparse reconstruction of unlabeled pathology slides, classification performance improves to over 93%, corresponding to a 54% error reduction. CONCLUSIONS: We conclude that a feature dictionary optimized for biomedical imagery may in general support better classification performance than does conventional transfer learning using a dictionary pre-trained on natural images.
Assuntos
Aprendizado Profundo/tendências , Neoplasias/patologia , Redes Neurais de Computação , HumanosRESUMO
This paper explores the application of text mining to the problem of detecting protein functional sites in the biomedical literature, and specifically considers the task of identifying catalytic sites in that literature. We provide strong evidence for the need for text mining techniques that address residue-level protein function annotation through an analysis of two corpora in terms of their coverage of curated data sources. We also explore the viability of building a text-based classifier for identifying protein functional sites, identifying the low coverage of curated data sources and the potential ambiguity of information about protein functional sites as challenges that must be addressed. Nevertheless we produce a simple classifier that achieves a reasonable â¼69% F-score on our full text silver corpus on the first attempt to address this classification task. The work has application in computational prediction of the functional significance of protein sites as well as in curation workflows for databases that capture this information.
Assuntos
Proteínas/química , Aminoácidos/química , Inteligência Artificial , Sítios de Ligação , Domínio Catalítico , Biologia Computacional , Mineração de Dados/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Ligantes , Processamento de Linguagem Natural , Proteínas/classificação , Proteínas/metabolismoRESUMO
BACKGROUND: We propose a method for automatic extraction of protein-specific residue mentions from the biomedical literature. The method searches text for mentions of amino acids at specific sequence positions and attempts to correctly associate each mention with a protein also named in the text. The methods presented in this work will enable improved protein functional site extraction from articles, ultimately supporting protein function prediction. Our method made use of linguistic patterns for identifying the amino acid residue mentions in text. Further, we applied an automated graph-based method to learn syntactic patterns corresponding to protein-residue pairs mentioned in the text. We finally present an approach to automated construction of relevant training and test data using the distant supervision model. RESULTS: The performance of the method was assessed by extracting protein-residue relations from a new automatically generated test set of sentences containing high confidence examples found using distant supervision. It achieved a F-measure of 0.84 on automatically created silver corpus and 0.79 on a manually annotated gold data set for this task, outperforming previous methods. CONCLUSIONS: The primary contributions of this work are to (1) demonstrate the effectiveness of distant supervision for automatic creation of training data for protein-residue relation extraction, substantially reducing the effort and time involved in manual annotation of a data set and (2) show that the graph-based relation extraction approach we used generalizes well to the problem of protein-residue association extraction. This work paves the way towards effective extraction of protein functional residues from the literature.
RESUMO
BACKGROUND: Classification is difficult for shotgun metagenomics data from environments such as soils, where the diversity of sequences is high and where reference sequences from close relatives may not exist. Approaches based on sequence-similarity scores must deal with the confounding effects that inheritance and functional pressures exert on the relation between scores and phylogenetic distance, while approaches based on sequence alignment and tree-building are typically limited to a small fraction of gene families. We describe an approach based on finding one or more exact matches between a read and a precomputed set of peptide 10-mers. RESULTS: At even the largest phylogenetic distances, thousands of 10-mer peptide exact matches can be found between pairs of bacterial genomes. Genes that share one or more peptide 10-mers typically have high reciprocal BLAST scores. Among a set of 403 representative bacterial genomes, some 20 million 10-mer peptides were found to be shared. We assign each of these peptides as a signature of a particular node in a phylogenetic reference tree based on the RNA polymerase genes. We classify the phylogeny of a genomic fragment (e.g., read) at the most specific node on the reference tree that is consistent with the phylogeny of observed signature peptides it contains. Using both synthetic data from four newly-sequenced soil-bacterium genomes and ten real soil metagenomics data sets, we demonstrate a sensitivity and specificity comparable to that of the MEGAN metagenomics analysis package using BLASTX against the NR database. Phylogenetic and functional similarity metrics applied to real metagenomics data indicates a signal-to-noise ratio of approximately 400 for distinguishing among environments. Our method assigns ~6.6 Gbp/hr on a single CPU, compared with 25 kbp/hr for methods based on BLASTX against the NR database. CONCLUSIONS: Classification by exact matching against a precomputed list of signature peptides provides comparable results to existing techniques for reads longer than about 300 bp and does not degrade severely with shorter reads. Orders of magnitude faster than existing methods, the approach is suitable now for inclusion in analysis pipelines and appears to be extensible in several different directions.
Assuntos
Proteínas de Bactérias/genética , RNA Polimerases Dirigidas por DNA/genética , Genoma Bacteriano , Metagenômica/métodos , Oligopeptídeos/genética , Filogenia , Análise de Sequência de DNA , Microbiologia do Solo , Proteínas de Bactérias/classificação , Sequência de Bases , RNA Polimerases Dirigidas por DNA/classificação , Bases de Dados Genéticas , Perfilação da Expressão Gênica , Oligopeptídeos/classificação , Alinhamento de Sequência , Homologia de Sequência do Ácido Nucleico , Especificidade da Espécie , TranscriptomaRESUMO
We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions.
Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Animais , Sítios de Ligação , Domínio Catalítico , Cristalografia por Raios X/métodos , Bases de Dados de Proteínas , Humanos , Modelos Moleculares , Modelos Estatísticos , Conformação Molecular , Estrutura Terciária de Proteína , Proteínas/química , Análise de Sequência de Proteína/métodos , SoftwareRESUMO
Recent studies have noted extensive inconsistencies in gene start sites among orthologous genes in related microbial genomes. Here we provide the first documented evidence that imposing gene start consistency improves the accuracy of gene start-site prediction. We applied an algorithm using a genome majority vote (GMV) scheme to increase the consistency of gene starts among orthologs. We used a set of validated Escherichia coli genes as a standard to quantify accuracy. Results showed that the GMV algorithm can correct hundreds of gene prediction errors in sets of five or ten genomes while introducing few errors. Using a conservative calculation, we project that GMV would resolve many inconsistencies and errors in publicly available microbial gene maps. Our simple and logical solution provides a notable advance toward accurate gene maps.
Assuntos
Biologia Computacional/métodos , Genes Bacterianos , Genoma Bacteriano , Modelos Genéticos , Modelos Estatísticos , Algoritmos , Sequência de Bases , Mapeamento Cromossômico , Simulação por Computador , Escherichia coli , Dados de Sequência Molecular , Alinhamento de Sequência , Sítio de Iniciação de TranscriçãoRESUMO
BACKGROUND: Evolutionary divergence in the position of the translational start site among orthologous genes can have significant functional impacts. Divergence can alter the translation rate, degradation rate, subcellular location, and function of the encoded proteins. RESULTS: Existing Genbank gene maps for Burkholderia genomes suggest that extensive divergence has occurred--53% of ortholog sets based on Genbank gene maps had inconsistent gene start sites. However, most of these inconsistencies appear to be gene-calling errors. Evolutionary divergence was the most plausible explanation for only 17% of the ortholog sets. Correcting probable errors in the Genbank gene maps decreased the percentage of ortholog sets with inconsistent starts by 68%, increased the percentage of ortholog sets with extractable upstream intergenic regions by 32%, increased the sequence similarity of intergenic regions and predicted proteins, and increased the number of proteins with identifiable signal peptides. CONCLUSIONS: Our findings highlight an emerging problem in comparative genomics: single-digit percent errors in gene predictions can lead to double-digit percentages of inconsistent ortholog sets. The work demonstrates a simple approach to evaluate and improve the quality of gene maps.
Assuntos
Burkholderia/genética , Evolução Molecular , Genoma Bacteriano , Genômica/métodos , Sítio de Iniciação de Transcrição , Mapeamento Cromossômico , DNA Bacteriano/genética , DNA Intergênico/genética , Sinais Direcionadores de Proteínas , Alinhamento de Sequência , SoftwareRESUMO
BACKGROUND: We present a fast version of the dynamics perturbation analysis (DPA) algorithm to predict functional sites in protein structures. The original DPA algorithm finds regions in proteins where interactions cause a large change in the protein conformational distribution, as measured using the relative entropy Dx. Such regions are associated with functional sites. RESULTS: The Fast DPA algorithm, which accelerates DPA calculations, is motivated by an empirical observation that Dx in a normal-modes model is highly correlated with an entropic term that only depends on the eigenvalues of the normal modes. The eigenvalues are accurately estimated using first-order perturbation theory, resulting in a N-fold reduction in the overall computational requirements of the algorithm, where N is the number of residues in the protein. The performance of the original and Fast DPA algorithms was compared using protein structures from a standard small-molecule docking test set. For nominal implementations of each algorithm, top-ranked Fast DPA predictions overlapped the true binding site 94% of the time, compared to 87% of the time for original DPA. In addition, per-protein recall statistics (fraction of binding-site residues that are among predicted residues) were slightly better for Fast DPA. On the other hand, per-protein precision statistics (fraction of predicted residues that are among binding-site residues) were slightly better using original DPA. Overall, the performance of Fast DPA in predicting ligand-binding-site residues was comparable to that of the original DPA algorithm. CONCLUSION: Compared to the original DPA algorithm, the decreased run time with comparable performance makes Fast DPA well-suited for implementation on a web server and for high-throughput analysis.
Assuntos
Algoritmos , Proteínas/química , Sítios de Ligação , Modelos MolecularesRESUMO
A procedure for the identification of ligands bound in crystal structures of macromolecules is described. Two characteristics of the density corresponding to a ligand are used in the identification procedure. One is the correlation of the ligand density with each of a set of test ligands after optimization of the fit of that ligand to the density. The other is the correlation of a fingerprint of the density with the fingerprint of model density for each possible ligand. The fingerprints consist of an ordered list of correlations of each the test ligands with the density. The two characteristics are scored using a Z-score approach in which the correlations are normalized to the mean and standard deviation of correlations found for a variety of mismatched ligand-density pairs, so that the Z scores are related to the probability of observing a particular value of the correlation by chance. The procedure was tested with a set of 200 of the most commonly found ligands in the Protein Data Bank, collectively representing 57% of all ligands in the Protein Data Bank. Using a combination of these two characteristics of ligand density, ranked lists of ligand identifications were made for representative (F(o) - F(c))exp(i(phi)c) difference density from entries in the Protein Data Bank. In 48% of the 200 cases, the correct ligand was at the top of the ranked list of ligands. This approach may be useful in identification of unknown ligands in new macromolecular structures as well as in the identification of which ligands in a mixture have bound to a macromolecule.
Assuntos
Biologia Computacional/métodos , Substâncias Macromoleculares , Algoritmos , Animais , Bacterioclorofilas/química , Análise por Conglomerados , Cristalografia por Raios X , Bases de Dados de Proteínas , Elétrons , Humanos , Ligantes , Conformação Molecular , Probabilidade , Ligação Proteica , Conformação Proteica , Proteínas/químicaRESUMO
A procedure for fitting of ligands to electron-density maps by first fitting a core fragment of the ligand to density and then extending the remainder of the ligand into density is presented. The approach was tested by fitting 9327 ligands over a wide range of resolutions (most are in the range 0.8-4.8 A) from the Protein Data Bank (PDB) into (Fo - Fc)exp(i phi(c)) difference density calculated using entries from the PDB without these ligands. The procedure was able to place 58% of these 9327 ligands within 2 A (r.m.s.d.) of the coordinates of the atoms in the original PDB entry for that ligand. The success of the fitting procedure was relatively insensitive to the size of the ligand in the range 10-100 non-H atoms and was only moderately sensitive to resolution, with the percentage of ligands placed near the coordinates of the original PDB entry for fits in the range 58-73% over all resolution ranges tested.
Assuntos
Algoritmos , Bases de Dados de Proteínas , Modelos Moleculares , Proteínas/química , Sítios de Ligação , Ligantes , Conformação ProteicaRESUMO
Human chromosome 16 features one of the highest levels of segmentally duplicated sequence among the human autosomes. We report here the 78,884,754 base pairs of finished chromosome 16 sequence, representing over 99.9% of its euchromatin. Manual annotation revealed 880 protein-coding genes confirmed by 1,670 aligned transcripts, 19 transfer RNA genes, 341 pseudogenes and three RNA pseudogenes. These genes include metallothionein, cadherin and iroquois gene families, as well as the disease genes for polycystic kidney disease and acute myelomonocytic leukaemia. Several large-scale structural polymorphisms spanning hundreds of kilobase pairs were identified and result in gene content differences among humans. Whereas the segmental duplications of chromosome 16 are enriched in the relatively gene-poor pericentromere of the p arm, some are involved in recent gene duplication and conversion events that are likely to have had an impact on the evolution of primates and human disease susceptibility.