Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
Add more filters










Publication year range
1.
Sci Rep ; 14(1): 1793, 2024 01 20.
Article in English | MEDLINE | ID: mdl-38245528

ABSTRACT

We present an ensemble transfer learning method to predict suicide from Veterans Affairs (VA) electronic medical records (EMR). A diverse set of base models was trained to predict a binary outcome constructed from reported suicide, suicide attempt, and overdose diagnoses with varying choices of study design and prediction methodology. Each model used twenty cross-sectional and 190 longitudinal variables observed in eight time intervals covering 7.5 years prior to the time of prediction. Ensembles of seven base models were created and fine-tuned with ten variables expected to change with study design and outcome definition in order to predict suicide and combined outcome in a prospective cohort. The ensemble models achieved c-statistics of 0.73 on 2-year suicide risk and 0.83 on the combined outcome when predicting on a prospective cohort of [Formula: see text] 4.2 M veterans. The ensembles rely on nonlinear base models trained using a matched retrospective nested case-control (Rcc) study cohort and show good calibration across a diversity of subgroups, including risk strata, age, sex, race, and level of healthcare utilization. In addition, a linear Rcc base model provided a rich set of biological predictors, including indicators of suicide, substance use disorder, mental health diagnoses and treatments, hypoxia and vascular damage, and demographics.


Subject(s)
Carcinoma, Renal Cell , Kidney Neoplasms , Veterans , Humans , Veterans/psychology , Retrospective Studies , Cross-Sectional Studies , Prospective Studies , Suicide, Attempted , Machine Learning
2.
J Interprof Care ; 38(3): 476-485, 2024.
Article in English | MEDLINE | ID: mdl-38124506

ABSTRACT

Empirical evidence indicates that collaborative interprofessional practice leads to positive health outcomes. Further, there is an abundance of evidence examining student and/or faculty perceptions of learning or satisfaction about the interprofessional education (IPE) learning experience. However, there is a dearth of research linking IPE interventions to patient outcomes. The objective of this scoping review was to describe and summarize the evidence linking IPE interventions to the delivery of effective patient care. A three-step search strategy was utilized for this review with articles that met the following criteria: publications dated 2015-2020 using qualitative, quantitative or mixed methods; the inclusion of healthcare professionals, students, or practitioners who had experienced IPE or training that included at least two collaborators within coursework or other professional education; and at least one of ten Centers for Medicare & Medicaid Services quality measures (length of stay, medication errors, medical errors, patient satisfaction scores, medication adherence, patient and caregiver education, hospice usage, mortality, infection rates, and readmission rates). Overall, n=94 articles were identified, providing overwhelming evidence supporting a positive relationship between IPE interventions and several key quality health measures including length of stay, medical errors, patient satisfaction, patient or caregiver education, and mortality. Findings from this scoping review suggest a critical need for the development, implementation, and evaluation of IPE interventions to improve patient outcomes.


Subject(s)
Interprofessional Education , Interprofessional Relations , Aged , United States , Humans , Medicare , Patient Care , Patient Care Team
3.
J Am Med Inform Assoc ; 31(1): 220-230, 2023 12 22.
Article in English | MEDLINE | ID: mdl-37769328

ABSTRACT

OBJECTIVE: To apply deep neural networks (DNNs) to longitudinal EHR data in order to predict suicide attempt risk among veterans. Local explainability techniques were used to provide explanations for each prediction with the goal of ultimately improving outreach and intervention efforts. MATERIALS AND METHODS: The DNNs fused demographic information with diagnostic, prescription, and procedure codes. Models were trained and tested on EHR data of approximately 500 000 US veterans: all veterans with recorded suicide attempts from April 1, 2005, through January 1, 2016, each paired with 5 veterans of the same age who did not attempt suicide. Shapley Additive Explanation (SHAP) values were calculated to provide explanations of DNN predictions. RESULTS: The DNNs outperformed logistic and linear regression models in predicting suicide attempts. After adjusting for the sampling technique, the convolutional neural network (CNN) model achieved a positive predictive value (PPV) of 0.54 for suicide attempts within 12 months by veterans in the top 0.1% risk tier. Explainability methods identified meaningful subgroups of high-risk veterans as well as key determinants of suicide attempt risk at both the group and individual level. DISCUSSION AND CONCLUSION: The deep learning methods employed in the present study have the potential to significantly enhance existing suicide risk models for veterans. These methods can also provide important clues to explore the relative value of long-term and short-term intervention strategies. Furthermore, the explainability methods utilized here could also be used to communicate to clinicians the key features which increase specific veterans' risk for attempting suicide.


Subject(s)
Suicide, Attempted , Veterans , Humans , Neural Networks, Computer , Motivation
4.
Brief Bioinform ; 23(1)2022 01 17.
Article in English | MEDLINE | ID: mdl-34524425

ABSTRACT

To enable personalized cancer treatment, machine learning models have been developed to predict drug response as a function of tumor and drug features. However, most algorithm development efforts have relied on cross-validation within a single study to assess model accuracy. While an essential first step, cross-validation within a biological data set typically provides an overly optimistic estimate of the prediction performance on independent test sets. To provide a more rigorous assessment of model generalizability between different studies, we use machine learning to analyze five publicly available cell line-based data sets: National Cancer Institute 60, ancer Therapeutics Response Portal (CTRP), Genomics of Drug Sensitivity in Cancer, Cancer Cell Line Encyclopedia and Genentech Cell Line Screening Initiative (gCSI). Based on observed experimental variability across studies, we explore estimates of prediction upper bounds. We report performance results of a variety of machine learning models, with a multitasking deep neural network achieving the best cross-study generalizability. By multiple measures, models trained on CTRP yield the most accurate predictions on the remaining testing data, and gCSI is the most predictable among the cell line data sets included in this study. With these experiments and further simulations on partial data, two lessons emerge: (1) differences in viability assays can limit model generalizability across studies and (2) drug diversity, more than tumor diversity, is crucial for raising model generalizability in preclinical screening.


Subject(s)
Neoplasms , Algorithms , Cell Line , Humans , Machine Learning , Neoplasms/drug therapy , Neoplasms/genetics , Neural Networks, Computer
6.
PLoS One ; 14(12): e0225858, 2019.
Article in English | MEDLINE | ID: mdl-31825977

ABSTRACT

Around the world, scavenging birds such as vultures and condors have been experiencing drastic population declines. Scavenging birds have a distinct digestive process to deal with higher amounts of bacteria in their primary diet of carcasses in varying levels of decay. These observations motivate us to present an analysis of captive and healthy California condor (Gymnogyps californianus) microbiomes to characterize a population raised together under similar conditions. Shotgun metagenomic DNA sequences were analyzed from fecal and cloacal samples of captive birds. Classification of shotgun DNA sequence data with peptide signatures using the Sequedex package provided both phylogenetic and functional profiles, as well as individually annotated reads for targeted confirmatory analysis. We observed bacterial species previously associated with birds and gut microbiomes, including both virulent and opportunistic pathogens such as Clostridium perfringens, Propionibacterium acnes, Shigella flexneri, and Fusobacterium mortiferum, common flora such as Lactobacillus johnsonii, Lactobacillus ruminus, and Bacteroides vulgatus, and mucosal microbes such as Delftia acidovorans, Stenotrophomonas maltophilia, and Corynebacterium falsnii. Classification using shotgun metagenomic reads from phylogenetic marker genes was consistent with, and more specific than, analysis based on 16S rDNA data. Classification of samples based on either phylogenetic or functional profiles of genomic fragments differentiated three types of samples: fecal, mature cloacal and immature cloacal, with immature birds having approximately 40% higher diversity of microbes.


Subject(s)
Bacteria , Birds/microbiology , Metagenome , Microbiota/physiology , Animals , Bacteria/classification , Bacteria/genetics , Bacteria/growth & development
7.
BMC Bioinformatics ; 19(Suppl 18): 489, 2018 Dec 21.
Article in English | MEDLINE | ID: mdl-30577746

ABSTRACT

BACKGROUND: Histopathology images of tumor biopsies present unique challenges for applying machine learning to the diagnosis and treatment of cancer. The pathology slides are high resolution, often exceeding 1GB, have non-uniform dimensions, and often contain multiple tissue slices of varying sizes surrounded by large empty regions. The locations of abnormal or cancerous cells, which may constitute a small portion of any given tissue sample, are not annotated. Cancer image datasets are also extremely imbalanced, with most slides being associated with relatively common cancers. Since deep representations trained on natural photographs are unlikely to be optimal for classifying pathology slide images, which have different spectral ranges and spatial structure, we here describe an approach for learning features and inferring representations of cancer pathology slides based on sparse coding. RESULTS: We show that conventional transfer learning using a state-of-the-art deep learning architecture pre-trained on ImageNet (RESNET) and fine tuned for a binary tumor/no-tumor classification task achieved between 85% and 86% accuracy. However, when all layers up to the last convolutional layer in RESNET are replaced with a single feature map inferred via a sparse coding using a dictionary optimized for sparse reconstruction of unlabeled pathology slides, classification performance improves to over 93%, corresponding to a 54% error reduction. CONCLUSIONS: We conclude that a feature dictionary optimized for biomedical imagery may in general support better classification performance than does conventional transfer learning using a dictionary pre-trained on natural images.


Subject(s)
Deep Learning/trends , Neoplasms/pathology , Neural Networks, Computer , Humans
8.
BMC Bioinformatics ; 19(Suppl 18): 486, 2018 Dec 21.
Article in English | MEDLINE | ID: mdl-30577754

ABSTRACT

BACKGROUND: The National Cancer Institute drug pair screening effort against 60 well-characterized human tumor cell lines (NCI-60) presents an unprecedented resource for modeling combinational drug activity. RESULTS: We present a computational model for predicting cell line response to a subset of drug pairs in the NCI-ALMANAC database. Based on residual neural networks for encoding features as well as predicting tumor growth, our model explains 94% of the response variance. While our best result is achieved with a combination of molecular feature types (gene expression, microRNA and proteome), we show that most of the predictive power comes from drug descriptors. To further demonstrate value in detecting anticancer therapy, we rank the drug pairs for each cell line based on model predicted combination effect and recover 80% of the top pairs with enhanced activity. CONCLUSIONS: We present promising results in applying deep learning to predicting combinational drug response. Our feature analysis indicates screening data involving more cell lines are needed for the models to make better use of molecular features.


Subject(s)
Deep Learning/trends , Drug Evaluation, Preclinical/methods , Cell Line, Tumor , Humans , National Cancer Institute (U.S.) , Neural Networks, Computer , United States
9.
Pac Symp Biocomput ; : 433-44, 2013.
Article in English | MEDLINE | ID: mdl-23424147

ABSTRACT

This paper explores the application of text mining to the problem of detecting protein functional sites in the biomedical literature, and specifically considers the task of identifying catalytic sites in that literature. We provide strong evidence for the need for text mining techniques that address residue-level protein function annotation through an analysis of two corpora in terms of their coverage of curated data sources. We also explore the viability of building a text-based classifier for identifying protein functional sites, identifying the low coverage of curated data sources and the potential ambiguity of information about protein functional sites as challenges that must be addressed. Nevertheless we produce a simple classifier that achieves a reasonable ∼69% F-score on our full text silver corpus on the first attempt to address this classification task. The work has application in computational prediction of the functional significance of protein sites as well as in curation workflows for databases that capture this information.


Subject(s)
Proteins/chemistry , Amino Acids/chemistry , Artificial Intelligence , Binding Sites , Catalytic Domain , Computational Biology , Data Mining/statistics & numerical data , Databases, Protein/statistics & numerical data , Ligands , Natural Language Processing , Proteins/classification , Proteins/metabolism
10.
J Biomed Semantics ; 3 Suppl 3: S2, 2012 Oct 05.
Article in English | MEDLINE | ID: mdl-23046792

ABSTRACT

BACKGROUND: We propose a method for automatic extraction of protein-specific residue mentions from the biomedical literature. The method searches text for mentions of amino acids at specific sequence positions and attempts to correctly associate each mention with a protein also named in the text. The methods presented in this work will enable improved protein functional site extraction from articles, ultimately supporting protein function prediction. Our method made use of linguistic patterns for identifying the amino acid residue mentions in text. Further, we applied an automated graph-based method to learn syntactic patterns corresponding to protein-residue pairs mentioned in the text. We finally present an approach to automated construction of relevant training and test data using the distant supervision model. RESULTS: The performance of the method was assessed by extracting protein-residue relations from a new automatically generated test set of sentences containing high confidence examples found using distant supervision. It achieved a F-measure of 0.84 on automatically created silver corpus and 0.79 on a manually annotated gold data set for this task, outperforming previous methods. CONCLUSIONS: The primary contributions of this work are to (1) demonstrate the effectiveness of distant supervision for automatic creation of training data for protein-residue relation extraction, substantially reducing the effort and time involved in manual annotation of a data set and (2) show that the graph-based relation extraction approach we used generalizes well to the problem of protein-residue association extraction. This work paves the way towards effective extraction of protein functional residues from the literature.

11.
BMC Res Notes ; 5: 460, 2012 Aug 28.
Article in English | MEDLINE | ID: mdl-22925230

ABSTRACT

BACKGROUND: Classification is difficult for shotgun metagenomics data from environments such as soils, where the diversity of sequences is high and where reference sequences from close relatives may not exist. Approaches based on sequence-similarity scores must deal with the confounding effects that inheritance and functional pressures exert on the relation between scores and phylogenetic distance, while approaches based on sequence alignment and tree-building are typically limited to a small fraction of gene families. We describe an approach based on finding one or more exact matches between a read and a precomputed set of peptide 10-mers. RESULTS: At even the largest phylogenetic distances, thousands of 10-mer peptide exact matches can be found between pairs of bacterial genomes. Genes that share one or more peptide 10-mers typically have high reciprocal BLAST scores. Among a set of 403 representative bacterial genomes, some 20 million 10-mer peptides were found to be shared. We assign each of these peptides as a signature of a particular node in a phylogenetic reference tree based on the RNA polymerase genes. We classify the phylogeny of a genomic fragment (e.g., read) at the most specific node on the reference tree that is consistent with the phylogeny of observed signature peptides it contains. Using both synthetic data from four newly-sequenced soil-bacterium genomes and ten real soil metagenomics data sets, we demonstrate a sensitivity and specificity comparable to that of the MEGAN metagenomics analysis package using BLASTX against the NR database. Phylogenetic and functional similarity metrics applied to real metagenomics data indicates a signal-to-noise ratio of approximately 400 for distinguishing among environments. Our method assigns ~6.6 Gbp/hr on a single CPU, compared with 25 kbp/hr for methods based on BLASTX against the NR database. CONCLUSIONS: Classification by exact matching against a precomputed list of signature peptides provides comparable results to existing techniques for reads longer than about 300 bp and does not degrade severely with shorter reads. Orders of magnitude faster than existing methods, the approach is suitable now for inclusion in analysis pipelines and appears to be extensible in several different directions.


Subject(s)
Bacterial Proteins/genetics , DNA-Directed RNA Polymerases/genetics , Genome, Bacterial , Metagenomics/methods , Oligopeptides/genetics , Phylogeny , Sequence Analysis, DNA , Soil Microbiology , Bacterial Proteins/classification , Base Sequence , DNA-Directed RNA Polymerases/classification , Databases, Genetic , Gene Expression Profiling , Oligopeptides/classification , Sequence Alignment , Sequence Homology, Nucleic Acid , Species Specificity , Transcriptome
12.
PLoS One ; 7(2): e32171, 2012.
Article in English | MEDLINE | ID: mdl-22393388

ABSTRACT

We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions.


Subject(s)
Computational Biology/methods , Data Mining/methods , Animals , Binding Sites , Catalytic Domain , Crystallography, X-Ray/methods , Databases, Protein , Humans , Models, Molecular , Models, Statistical , Molecular Conformation , Protein Structure, Tertiary , Proteins/chemistry , Sequence Analysis, Protein/methods , Software
13.
PLoS Comput Biol ; 7(11): e1002284, 2011 Nov.
Article in English | MEDLINE | ID: mdl-22131910

ABSTRACT

Recent studies have noted extensive inconsistencies in gene start sites among orthologous genes in related microbial genomes. Here we provide the first documented evidence that imposing gene start consistency improves the accuracy of gene start-site prediction. We applied an algorithm using a genome majority vote (GMV) scheme to increase the consistency of gene starts among orthologs. We used a set of validated Escherichia coli genes as a standard to quantify accuracy. Results showed that the GMV algorithm can correct hundreds of gene prediction errors in sets of five or ten genomes while introducing few errors. Using a conservative calculation, we project that GMV would resolve many inconsistencies and errors in publicly available microbial gene maps. Our simple and logical solution provides a notable advance toward accurate gene maps.


Subject(s)
Computational Biology/methods , Genes, Bacterial , Genome, Bacterial , Models, Genetic , Models, Statistical , Algorithms , Base Sequence , Chromosome Mapping , Computer Simulation , Escherichia coli , Molecular Sequence Data , Sequence Alignment , Transcription Initiation Site
14.
BMC Genomics ; 12: 125, 2011 Feb 22.
Article in English | MEDLINE | ID: mdl-21342528

ABSTRACT

BACKGROUND: Evolutionary divergence in the position of the translational start site among orthologous genes can have significant functional impacts. Divergence can alter the translation rate, degradation rate, subcellular location, and function of the encoded proteins. RESULTS: Existing Genbank gene maps for Burkholderia genomes suggest that extensive divergence has occurred--53% of ortholog sets based on Genbank gene maps had inconsistent gene start sites. However, most of these inconsistencies appear to be gene-calling errors. Evolutionary divergence was the most plausible explanation for only 17% of the ortholog sets. Correcting probable errors in the Genbank gene maps decreased the percentage of ortholog sets with inconsistent starts by 68%, increased the percentage of ortholog sets with extractable upstream intergenic regions by 32%, increased the sequence similarity of intergenic regions and predicted proteins, and increased the number of proteins with identifiable signal peptides. CONCLUSIONS: Our findings highlight an emerging problem in comparative genomics: single-digit percent errors in gene predictions can lead to double-digit percentages of inconsistent ortholog sets. The work demonstrates a simple approach to evaluate and improve the quality of gene maps.


Subject(s)
Burkholderia/genetics , Evolution, Molecular , Genome, Bacterial , Genomics/methods , Transcription Initiation Site , Chromosome Mapping , DNA, Bacterial/genetics , DNA, Intergenic/genetics , Protein Sorting Signals , Sequence Alignment , Software
15.
BMC Struct Biol ; 8: 5, 2008 Jan 30.
Article in English | MEDLINE | ID: mdl-18234095

ABSTRACT

BACKGROUND: We present a fast version of the dynamics perturbation analysis (DPA) algorithm to predict functional sites in protein structures. The original DPA algorithm finds regions in proteins where interactions cause a large change in the protein conformational distribution, as measured using the relative entropy Dx. Such regions are associated with functional sites. RESULTS: The Fast DPA algorithm, which accelerates DPA calculations, is motivated by an empirical observation that Dx in a normal-modes model is highly correlated with an entropic term that only depends on the eigenvalues of the normal modes. The eigenvalues are accurately estimated using first-order perturbation theory, resulting in a N-fold reduction in the overall computational requirements of the algorithm, where N is the number of residues in the protein. The performance of the original and Fast DPA algorithms was compared using protein structures from a standard small-molecule docking test set. For nominal implementations of each algorithm, top-ranked Fast DPA predictions overlapped the true binding site 94% of the time, compared to 87% of the time for original DPA. In addition, per-protein recall statistics (fraction of binding-site residues that are among predicted residues) were slightly better for Fast DPA. On the other hand, per-protein precision statistics (fraction of predicted residues that are among binding-site residues) were slightly better using original DPA. Overall, the performance of Fast DPA in predicting ligand-binding-site residues was comparable to that of the original DPA algorithm. CONCLUSION: Compared to the original DPA algorithm, the decreased run time with comparable performance makes Fast DPA well-suited for implementation on a web server and for high-throughput analysis.


Subject(s)
Algorithms , Proteins/chemistry , Binding Sites , Models, Molecular
16.
Acta Crystallogr D Biol Crystallogr ; 63(Pt 1): 101-7, 2007 Jan.
Article in English | MEDLINE | ID: mdl-17164532

ABSTRACT

A procedure for the identification of ligands bound in crystal structures of macromolecules is described. Two characteristics of the density corresponding to a ligand are used in the identification procedure. One is the correlation of the ligand density with each of a set of test ligands after optimization of the fit of that ligand to the density. The other is the correlation of a fingerprint of the density with the fingerprint of model density for each possible ligand. The fingerprints consist of an ordered list of correlations of each the test ligands with the density. The two characteristics are scored using a Z-score approach in which the correlations are normalized to the mean and standard deviation of correlations found for a variety of mismatched ligand-density pairs, so that the Z scores are related to the probability of observing a particular value of the correlation by chance. The procedure was tested with a set of 200 of the most commonly found ligands in the Protein Data Bank, collectively representing 57% of all ligands in the Protein Data Bank. Using a combination of these two characteristics of ligand density, ranked lists of ligand identifications were made for representative (F(o) - F(c))exp(i(phi)c) difference density from entries in the Protein Data Bank. In 48% of the 200 cases, the correct ligand was at the top of the ranked list of ligands. This approach may be useful in identification of unknown ligands in new macromolecular structures as well as in the identification of which ligands in a mixture have bound to a macromolecule.


Subject(s)
Computational Biology/methods , Macromolecular Substances , Algorithms , Animals , Bacteriochlorophylls/chemistry , Cluster Analysis , Crystallography, X-Ray , Databases, Protein , Electrons , Humans , Ligands , Molecular Conformation , Probability , Protein Binding , Protein Conformation , Proteins/chemistry
17.
Acta Crystallogr D Biol Crystallogr ; 62(Pt 8): 915-22, 2006 Aug.
Article in English | MEDLINE | ID: mdl-16855309

ABSTRACT

A procedure for fitting of ligands to electron-density maps by first fitting a core fragment of the ligand to density and then extending the remainder of the ligand into density is presented. The approach was tested by fitting 9327 ligands over a wide range of resolutions (most are in the range 0.8-4.8 A) from the Protein Data Bank (PDB) into (Fo - Fc)exp(i phi(c)) difference density calculated using entries from the PDB without these ligands. The procedure was able to place 58% of these 9327 ligands within 2 A (r.m.s.d.) of the coordinates of the atoms in the original PDB entry for that ligand. The success of the fitting procedure was relatively insensitive to the size of the ligand in the range 10-100 non-H atoms and was only moderately sensitive to resolution, with the percentage of ligands placed near the coordinates of the original PDB entry for fits in the range 58-73% over all resolution ranges tested.


Subject(s)
Algorithms , Databases, Protein , Models, Molecular , Proteins/chemistry , Binding Sites , Ligands , Protein Conformation
18.
Protein Sci ; 15(6): 1544-9, 2006 Jun.
Article in English | MEDLINE | ID: mdl-16672243

ABSTRACT

Automated function prediction (AFP) methods increasingly use knowledge discovery algorithms to map sequence, structure, literature, and/or pathway information about proteins whose functions are unknown into functional ontologies, typically (a portion of) the Gene Ontology (GO). While there are a growing number of methods within this paradigm, the general problem of assessing the accuracy of such prediction algorithms has not been seriously addressed. We present first an application for function prediction from protein sequences using the POSet Ontology Categorizer (POSOC) to produce new annotations by analyzing collections of GO nodes derived from annotations of protein BLAST neighborhoods. We then also present hierarchical precision and hierarchical recall as new evaluation metrics for assessing the accuracy of any predictions in hierarchical ontologies, and discuss results on a test set of protein sequences. We show that our method provides substantially improved hierarchical precision (measure of predictions made that are correct) when applied to the nearest BLAST neighbors of target proteins, as compared with simply imputing that neighborhood's annotations to the target. Moreover, when our method is applied to a broader BLAST neighborhood, hierarchical precision is enhanced even further. In all cases, such increased hierarchical precision performance is purchased at a modest expense of hierarchical recall (measure of all annotations that get predicted at all).


Subject(s)
Computational Biology/methods , Proteins/chemistry , Proteins/metabolism , Software
19.
BMC Bioinformatics ; 6 Suppl 1: S20, 2005.
Article in English | MEDLINE | ID: mdl-15960833

ABSTRACT

BACKGROUND: We participated in the BioCreAtIvE Task 2, which addressed the annotation of proteins into the Gene Ontology (GO) based on the text of a given document and the selection of evidence text from the document justifying that annotation. We approached the task utilizing several combinations of two distinct methods: an unsupervised algorithm for expanding words associated with GO nodes, and an annotation methodology which treats annotation as categorization of terms from a protein's document neighborhood into the GO. RESULTS: The evaluation results indicate that the method for expanding words associated with GO nodes is quite powerful; we were able to successfully select appropriate evidence text for a given annotation in 38% of Task 2.1 queries by building on this method. The term categorization methodology achieved a precision of 16% for annotation within the correct extended family in Task 2.2, though we show through subsequent analysis that this can be improved with a different parameter setting. Our architecture proved not to be very successful on the evidence text component of the task, in the configuration used to generate the submitted results. CONCLUSION: The initial results show promise for both of the methods we explored, and we are planning to integrate the methods more closely to achieve better results overall.


Subject(s)
Databases, Genetic/classification , Genes , Pattern Recognition, Automated/methods , Proteins/classification , Writing
20.
Nature ; 432(7020): 988-94, 2004 Dec 23.
Article in English | MEDLINE | ID: mdl-15616553

ABSTRACT

Human chromosome 16 features one of the highest levels of segmentally duplicated sequence among the human autosomes. We report here the 78,884,754 base pairs of finished chromosome 16 sequence, representing over 99.9% of its euchromatin. Manual annotation revealed 880 protein-coding genes confirmed by 1,670 aligned transcripts, 19 transfer RNA genes, 341 pseudogenes and three RNA pseudogenes. These genes include metallothionein, cadherin and iroquois gene families, as well as the disease genes for polycystic kidney disease and acute myelomonocytic leukaemia. Several large-scale structural polymorphisms spanning hundreds of kilobase pairs were identified and result in gene content differences among humans. Whereas the segmental duplications of chromosome 16 are enriched in the relatively gene-poor pericentromere of the p arm, some are involved in recent gene duplication and conversion events that are likely to have had an impact on the evolution of primates and human disease susceptibility.


Subject(s)
Chromosomes, Human, Pair 16/genetics , Gene Duplication , Physical Chromosome Mapping , Animals , Genes/genetics , Genomics , Heterochromatin/genetics , Humans , Molecular Sequence Data , Polymorphism, Genetic/genetics , Sequence Analysis, DNA , Synteny/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...