Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters










Publication year range
1.
PeerJ ; 11: e16026, 2023.
Article in English | MEDLINE | ID: mdl-37727687

ABSTRACT

The discovery of low-coverage (i.e. uncovered) regions containing clinically significant variants, especially when they are related to the patient's clinical phenotype, is critical for whole-exome sequencing (WES) based clinical diagnosis. Therefore, it is essential to develop tools to identify the existence of clinically important variants in low-coverage regions. Here, we introduce a desktop application, namely DEVOUR (DEleterious Variants On Uncovered Regions), that analyzes read alignments for WES experiments, identifies genomic regions with no or low-coverage (read depth < 5) and then annotates known variants in the low-coverage regions using clinical variant annotation databases. As a proof of concept, DEVOUR was used to analyze a total of 28 samples from a publicly available Hirschsprung disease-related WES project (NCBI Bioproject: https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJEB19327), revealing the potential existence of 98 disease-associated variants in low-coverage regions. DEVOUR is available from https://github.com/projectDevour/DEVOUR under the MIT license.


Subject(s)
Existentialism , Hirschsprung Disease , Humans , Exome Sequencing , Databases, Factual , Genomics , Hirschsprung Disease/diagnosis
2.
Bioinformatics ; 31(6): 926-32, 2015 Mar 15.
Article in English | MEDLINE | ID: mdl-25398609

ABSTRACT

MOTIVATION: UniRef databases provide full-scale clustering of UniProtKB sequences and are utilized for a broad range of applications, particularly similarity-based functional annotation. Non-redundancy and intra-cluster homogeneity in UniRef were recently improved by adding a sequence length overlap threshold. Our hypothesis is that these improvements would enhance the speed and sensitivity of similarity searches and improve the consistency of annotation within clusters. RESULTS: Intra-cluster molecular function consistency was examined by analysis of Gene Ontology terms. Results show that UniRef clusters bring together proteins of identical molecular function in more than 97% of the clusters, implying that clusters are useful for annotation and can also be used to detect annotation inconsistencies. To examine coverage in similarity results, BLASTP searches against UniRef50 followed by expansion of the hit lists with cluster members demonstrated advantages compared with searches against UniProtKB sequences; the searches are concise (∼7 times shorter hit list before expansion), faster (∼6 times) and more sensitive in detection of remote similarities (>96% recall at e-value <0.0001). Our results support the use of UniRef clusters as a comprehensive and scalable alternative to native sequence databases for similarity searches and reinforces its reliability for use in functional annotation.


Subject(s)
Computational Biology , Databases, Protein , Dioxygenases/metabolism , Membrane Proteins/metabolism , Proteins/metabolism , Sequence Analysis, Protein , Software , AlkB Homolog 5, RNA Demethylase , Cluster Analysis , Dioxygenases/chemistry , Dioxygenases/genetics , Gene Ontology , Humans , Information Storage and Retrieval , Membrane Proteins/chemistry , Membrane Proteins/genetics , Molecular Sequence Annotation , Proteins/chemistry , Proteins/genetics
3.
BMC Immunol ; 15: 61, 2014 Dec 09.
Article in English | MEDLINE | ID: mdl-25486901

ABSTRACT

BACKGROUND: Near universal administration of vaccines mandates intense pharmacovigilance for vaccine safety and a stringently low tolerance for adverse events. Reports of autoimmune diseases (AID) following vaccination have been challenging to evaluate given the high rates of vaccination, background incidence of autoimmunity, and low incidence and variable times for onset of AID after vaccinations. In order to identify biologically plausible pathways to adverse autoimmune events of vaccine-related AID, we used a systems biology approach to create a matrix of innate and adaptive immune mechanisms active in specific diseases, responses to vaccine antigens, adjuvants, preservatives and stabilizers, for the most common vaccine-associated AID found in the Vaccine Adverse Event Reporting System. RESULTS: This report focuses on Guillain-Barre Syndrome (GBS), Rheumatoid Arthritis (RA), Systemic Lupus Erythematosus (SLE), and Idiopathic (or immune) Thrombocytopenic Purpura (ITP). Multiple curated databases and automated text mining of PubMed literature identified 667 genes associated with RA, 448 with SLE, 49 with ITP and 73 with GBS. While all data sources provided valuable and unique gene associations, text mining using natural language processing (NLP) algorithms provided the most information but required curation to remove incorrect associations. Six genes were associated with all four AIDs. Thirty-three pathways were shared by the four AIDs. Classification of genes into twelve immune system related categories identified more "Th17 T-cell subtype" genes in RA than the other AIDs, and more "Chemokine plus Receptors" genes associated with RA than SLE. Gene networks were visualized and clustered into interconnected modules with specific gene clusters for each AID, including one in RA with ten C-X-C motif chemokines. The intersection of genes associated with GBS, GBS peptide auto-antigens, influenza A infection, and influenza vaccination created a subnetwork of genes that inferred a possible role for the MAPK signaling pathway in influenza vaccine related GBS. CONCLUSIONS: Results showing unique and common gene sets, pathways, immune system categories and functional clusters of genes in four autoimmune diseases suggest it is possible to develop molecular classifications of autoimmune and inflammatory events. Combining this information with cellular and other disease responses should greatly aid in the assessment of potential immune-mediated adverse events following vaccination.


Subject(s)
Autoimmune Diseases , Computer Simulation , Infection Control , Infections/immunology , Models, Immunological , Vaccination , Vaccines , Adaptive Immunity , Autoimmune Diseases/genetics , Autoimmune Diseases/immunology , Autoimmune Diseases/pathology , Humans , Infections/genetics , Infections/pathology , Vaccines/adverse effects , Vaccines/immunology
4.
Bioinformatics ; 29(21): 2808-9, 2013 Nov 01.
Article in English | MEDLINE | ID: mdl-23958731

ABSTRACT

SUMMARY: We have developed a new web application for peptide matching using Apache Lucene-based search engine. The Peptide Match service is designed to quickly retrieve all occurrences of a given query peptide from UniProt Knowledgebase (UniProtKB) with isoforms. The matched proteins are shown in summary tables with rich annotations, including matched sequence region(s) and links to corresponding proteins in a number of proteomic/peptide spectral databases. The results are grouped by taxonomy and can be browsed by organism, taxonomic group or taxonomy tree. The service supports queries where isobaric leucine and isoleucine are treated equivalent, and an option for searching UniRef100 representative sequences, as well as dynamic queries to major proteomic databases. In addition to the web interface, we also provide RESTful web services. The underlying data are updated every 4 weeks in accordance with the UniProt releases. AVAILABILITY: http://proteininformationresource.org/peptide.shtml. CONTACT: chenc@udel.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Databases, Protein , Peptides/chemistry , Search Engine , Internet , Knowledge Bases , Proteomics , Sequence Analysis, Protein
5.
J Digit Imaging ; 26(4): 630-41, 2013 Aug.
Article in English | MEDLINE | ID: mdl-23589184

ABSTRACT

A widening array of novel imaging biomarkers is being developed using ever more powerful clinical and preclinical imaging modalities. These biomarkers have demonstrated effectiveness in quantifying biological processes as they occur in vivo and in the early prediction of therapeutic outcomes. However, quantitative imaging biomarker data and knowledge are not standardized, representing a critical barrier to accumulating medical knowledge based on quantitative imaging data. We use an ontology to represent, integrate, and harmonize heterogeneous knowledge across the domain of imaging biomarkers. This advances the goal of developing applications to (1) improve precision and recall of storage and retrieval of quantitative imaging-related data using standardized terminology; (2) streamline the discovery and development of novel imaging biomarkers by normalizing knowledge across heterogeneous resources; (3) effectively annotate imaging experiments thus aiding comprehension, re-use, and reproducibility; and (4) provide validation frameworks through rigorous specification as a basis for testable hypotheses and compliance tests. We have developed the Quantitative Imaging Biomarker Ontology (QIBO), which currently consists of 488 terms spanning the following upper classes: experimental subject, biological intervention, imaging agent, imaging instrument, image post-processing algorithm, biological target, indicated biology, and biomarker application. We have demonstrated that QIBO can be used to annotate imaging experiments with standardized terms in the ontology and to generate hypotheses for novel imaging biomarker-disease associations. Our results established the utility of QIBO in enabling integrated analysis of quantitative imaging data.


Subject(s)
Biomarkers , Biomedical Research , Diagnostic Imaging , Medical Informatics/methods , Biological Ontologies , Databases, Factual , Humans , Medical Informatics/standards , Reproducibility of Results
6.
J Digit Imaging ; 26(4): 614-29, 2013 Aug.
Article in English | MEDLINE | ID: mdl-23546775

ABSTRACT

Quantitative imaging biomarkers are of particular interest in drug development for their potential to accelerate the drug development pipeline. The lack of consensus methods and carefully characterized performance hampers the widespread availability of these quantitative measures. A framework to support collaborative work on quantitative imaging biomarkers would entail advanced statistical techniques, the development of controlled vocabularies, and a service-oriented architecture for processing large image archives. Until now, this framework has not been developed. With the availability of tools for automatic ontology-based annotation of datasets, coupled with image archives, and a means for batch selection and processing of image and clinical data, imaging will go through a similar increase in capability analogous to what advanced genetic profiling techniques have brought to molecular biology. We report on our current progress on developing an informatics infrastructure to store, query, and retrieve imaging biomarker data across a wide range of resources in a semantically meaningful way that facilitates the collaborative development and validation of potential imaging biomarkers by many stakeholders. Specifically, we describe the semantic components of our system, QI-Bench, that are used to specify and support experimental activities for statistical validation in quantitative imaging.


Subject(s)
Biomarkers/analysis , Diagnostic Imaging/methods , Diagnostic Imaging/statistics & numerical data , Medical Informatics/methods , Medical Informatics/statistics & numerical data , Algorithms , Data Interpretation, Statistical , Databases, Factual/statistics & numerical data , Humans , Imaging, Three-Dimensional , Reproducibility of Results
7.
J Am Med Inform Assoc ; 19(6): 1095-102, 2012.
Article in English | MEDLINE | ID: mdl-22744959

ABSTRACT

OBJECTIVE: Meaningful exchange of information is a fundamental challenge in collaborative biomedical research. To help address this, the authors developed the Life Sciences Domain Analysis Model (LS DAM), an information model that provides a framework for communication among domain experts and technical teams developing information systems to support biomedical research. The LS DAM is harmonized with the Biomedical Research Integrated Domain Group (BRIDG) model of protocol-driven clinical research. Together, these models can facilitate data exchange for translational research. MATERIALS AND METHODS: The content of the LS DAM was driven by analysis of life sciences and translational research scenarios and the concepts in the model are derived from existing information models, reference models and data exchange formats. The model is represented in the Unified Modeling Language and uses ISO 21090 data types. RESULTS: The LS DAM v2.2.1 is comprised of 130 classes and covers several core areas including Experiment, Molecular Biology, Molecular Databases and Specimen. Nearly half of these classes originate from the BRIDG model, emphasizing the semantic harmonization between these models. Validation of the LS DAM against independently derived information models, research scenarios and reference databases supports its general applicability to represent life sciences research. DISCUSSION: The LS DAM provides unambiguous definitions for concepts required to describe life sciences research. The processes established to achieve consensus among domain experts will be applied in future iterations and may be broadly applicable to other standardization efforts. CONCLUSIONS: The LS DAM provides common semantics for life sciences research. Through harmonization with BRIDG, it promotes interoperability in translational science.


Subject(s)
Biological Science Disciplines , Information Dissemination , Information Systems , Systems Integration , Translational Research, Biomedical , Humans , Information Storage and Retrieval , Reference Standards , Semantics , Unified Medical Language System
8.
J Am Med Inform Assoc ; 19(e1): e125-8, 2012 Jun.
Article in English | MEDLINE | ID: mdl-22323393

ABSTRACT

Quality control and harmonization of data is a vital and challenging undertaking for any successful data coordination center and a responsibility shared between the multiple sites that produce, integrate, and utilize the data. Here we describe a coordinated effort between scientists and data managers in the Cancer Family Registries to implement a data governance infrastructure consisting of both organizational and technical solutions. The technical solution uses a rule-based validation system that facilitates error detection and correction for data centers submitting data to a central informatics database. Validation rules comprise both standard checks on allowable values and a crosscheck of related database elements for logical and scientific consistency. Evaluation over a 2-year timeframe showed a significant decrease in the number of errors in the database and a concurrent increase in data consistency and accuracy.


Subject(s)
Breast Neoplasms , Colonic Neoplasms , Databases, Factual/standards , Registries/standards , Breast Neoplasms/epidemiology , Colonic Neoplasms/epidemiology , Databases, Factual/statistics & numerical data , Humans , Quality Control , Research Design , United States
9.
Bioinformatics ; 27(8): 1190-1, 2011 Apr 15.
Article in English | MEDLINE | ID: mdl-21478197

ABSTRACT

MOTIVATION: Identifier (ID) mapping establishes links between various biological databases and is an essential first step for molecular data integration and functional annotation. ID mapping allows diverse molecular data on genes and proteins to be combined and mapped to functional pathways and ontologies. We have developed comprehensive protein-centric ID mapping services providing mappings for 90 IDs derived from databases on genes, proteins, pathways, diseases, structures, protein families, protein interaction, literature, ontologies, etc. The services are widely used and have been regularly updated since 2006. AVAILABILITY: www.uniprot.org/mappingandproteininformation-resource.org/pirwww/search/idmapping.shtml CONTACT: huang@dbi.udel.edu.


Subject(s)
Databases, Protein , Proteins/chemistry , Proteins/genetics , Software , Internet
10.
BMC Bioinformatics ; 10: 136, 2009 May 08.
Article in English | MEDLINE | ID: mdl-19426475

ABSTRACT

BACKGROUND: The UniProt consortium was formed in 2002 by groups from the Swiss Institute of Bioinformatics (SIB), the European Bioinformatics Institute (EBI) and the Protein Information Resource (PIR) at Georgetown University, and soon afterwards the website http://www.uniprot.org was set up as a central entry point to UniProt resources. Requests to this address were redirected to one of the three organisations' websites. While these sites shared a set of static pages with general information about UniProt, their pages for searching and viewing data were different. To provide users with a consistent view and to cut the cost of maintaining three separate sites, the consortium decided to develop a common website for UniProt. Following several years of intense development and a year of public beta testing, the http://www.uniprot.org domain was switched to the newly developed site described in this paper in July 2008. DESCRIPTION: The UniProt consortium is the main provider of protein sequence and annotation data for much of the life sciences community. The http://www.uniprot.org website is the primary access point to this data and to documentation and basic tools for the data. These tools include full text and field-based text search, similarity search, multiple sequence alignment, batch retrieval and database identifier mapping. This paper discusses the design and implementation of the new website, which was released in July 2008, and shows how it improves data access for users with different levels of experience, as well as to machines for programmatic access.http://www.uniprot.org/ is open for both academic and commercial use. The site was built with open source tools and libraries. Feedback is very welcome and should be sent to help@uniprot.org. CONCLUSION: The new UniProt website makes accessing and understanding UniProt easier than ever. The two main lessons learned are that getting the basics right for such a data provider website has huge benefits, but is not trivial and easy to underestimate, and that there is no substitute for using empirical data throughout the development process to decide on what is and what is not working for your users.


Subject(s)
Databases, Protein , Sequence Analysis, Protein , Information Storage and Retrieval/methods , Internet , Proteins/chemistry , User-Computer Interface
11.
Bioinformatics ; 23(10): 1282-8, 2007 May 15.
Article in English | MEDLINE | ID: mdl-17379688

ABSTRACT

MOTIVATION: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. RESULTS: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. AVAILABILITY: UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Computational Biology , Databases, Protein , Proteins/chemistry , Amino Acid Sequence , Animals , Humans , Information Storage and Retrieval
12.
Nucleic Acids Res ; 32(Database issue): D112-4, 2004 Jan 01.
Article in English | MEDLINE | ID: mdl-14681371

ABSTRACT

The Protein Information Resource (PIR) is an integrated public resource of protein informatics. To facilitate the sensible propagation and standardization of protein annotation and the systematic detection of annotation errors, PIR has extended its superfamily concept and developed the SuperFamily (PIRSF) classification system. Based on the evolutionary relationships of whole proteins, this classification system allows annotation of both specific biological and generic biochemical functions. The system adopts a network structure for protein classification from superfamily to subfamily levels. Protein family members are homologous (sharing common ancestry) and homeomorphic (sharing full-length sequence similarity with common domain architecture). The PIRSF database consists of two data sets, preliminary clusters and curated families. The curated families include family name, protein membership, parent-child relationship, domain architecture, and optional description and bibliography. PIRSF is accessible from the website at http://pir.georgetown.edu/pirsf/ for report retrieval and sequence classification. The report presents family annotation, membership statistics, cross-references to other databases, graphical display of domain architecture, and links to multiple sequence alignments and phylogenetic trees for curated families. PIRSF can be utilized to analyze phylogenetic profiles, to reveal functional convergence and divergence, and to identify interesting relationships between homeomorphic families, domains and structural classes.


Subject(s)
Computational Biology , Databases, Protein , Proteins/chemistry , Proteins/classification , Amino Acid Motifs , Animals , Evolution, Molecular , Humans , Information Storage and Retrieval , Internet , Protein Structure, Tertiary
13.
Nucleic Acids Res ; 31(1): 345-7, 2003 Jan 01.
Article in English | MEDLINE | ID: mdl-12520019

ABSTRACT

The Protein Information Resource (PIR) is an integrated public resource of protein informatics that supports genomic and proteomic research and scientific discovery. PIR maintains the Protein Sequence Database (PSD), an annotated protein database containing over 283 000 sequences covering the entire taxonomic range. Family classification is used for sensitive identification, consistent annotation, and detection of annotation errors. The superfamily curation defines signature domain architecture and categorizes memberships to improve automated classification. To increase the amount of experimental annotation, the PIR has developed a bibliography system for literature searching, mapping, and user submission, and has conducted retrospective attribution of citations for experimental features. PIR also maintains NREF, a non-redundant reference database, and iProClass, an integrated database of protein family, function, and structure information. PIR-NREF provides a timely and comprehensive collection of protein sequences, currently consisting of more than 1 000 000 entries from PIR-PSD, SWISS-PROT, TrEMBL, RefSeq, GenPept, and PDB. The PIR web site (http://pir.georgetown.edu) connects data analysis tools to underlying databases for information retrieval and knowledge discovery, with functionalities for interactive queries, combinations of sequence and text searches, and sorting and visual exploration of search results. The FTP site provides free download for PSD and NREF biweekly releases and auxiliary databases and files.


Subject(s)
Databases, Protein , Proteins/chemistry , Proteins/classification , Amino Acid Sequence , Animals , Databases, Bibliographic , Internet , Proteins/genetics
14.
Nucleic Acids Res ; 30(1): 35-7, 2002 Jan 01.
Article in English | MEDLINE | ID: mdl-11752247

ABSTRACT

The Protein Information Resource (PIR) serves as an integrated public resource of functional annotation of protein data to support genomic/proteomic research and scientific discovery. The PIR, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the PIR-International Protein Sequence Database (PSD), the major annotated protein sequence database in the public domain, containing about 250 000 proteins. To improve protein annotation and the coverage of experimentally validated data, a bibliography submission system is developed for scientists to submit, categorize and retrieve literature information. Comprehensive protein information is available from iProClass, which includes family classification at the superfamily, domain and motif levels, structural and functional features of proteins, as well as cross-references to over 40 biological databases. To provide timely and comprehensive protein data with source attribution, we have introduced a non-redundant reference protein database, PIR-NREF. The database consists of about 800 000 proteins collected from PIR-PSD, SWISS-PROT, TrEMBL, GenPept, RefSeq and PDB, with composite protein names and literature data. To promote database interoperability, we provide XML data distribution and open database schema, and adopt common ontologies. The PIR web site (http://pir.georgetown.edu/) features data mining and sequence analysis tools for information retrieval and functional identification of proteins based on both sequence and annotation information. The PIR databases and other files are also available by FTP (ftp://nbrfa.georgetown.edu/pir_databases).


Subject(s)
Databases, Protein , Amino Acid Sequence , Animals , Humans , Information Storage and Retrieval , International Agencies , Internet , Proteins/classification , Proteins/genetics , Systems Integration
SELECTION OF CITATIONS
SEARCH DETAIL
...