Search | VHL Regional Portal

BIR Pipeline for Preparation of Phylogenomic Data.

Kumar, Surendra; Krabberød, Anders K; Neumann, Ralf S; Michalickova, Katerina; Zhao, Sen; Zhang, Xiaoli; Shalchian-Tabrizi, Kamran.

Evol Bioinform Online ; 11: 79-83, 2015.

Article in English | MEDLINE | ID: mdl-25987827

ABSTRACT

SUMMARY: We present a pipeline named BIR (Blast, Identify and Realign) developed for phylogenomic analyses. BIR is intended for the identification of gene sequences applicable for phylogenomic inference. The pipeline allows users to apply their own manually curated sequence alignments (seed) in search for homologous genes in sequence databases and available genomes. BIR automatically adds the identified sequences from these databases to the seed alignments and reconstruct a phylogenetic tree from each. The BIR pipeline is an efficient tool for the identification of orthologous gene copies because it expands user-defined sequence alignments and conducts massive parallel phylogenetic reconstruction. The application is also particularly useful for large-scale sequencing projects that require management of a large number of single-gene alignments for gene comparison, functional annotation, and evolutionary analyses. AVAILABILITY: The BIR user manual is available at http://www.bioportal.no/ and can be accessed through Lifeportal at https://lifeportal.uio.no. Access is free but requires a user account registration using the link "Register for BIR access" from the Lifeportal homepage.

A survey of protein interaction data and multigenic inherited disorders.

Mora, Antonio; Michalickova, Katerina; Donaldson, Ian M.

BMC Bioinformatics ; 14: 47, 2013 Feb 11.

Article in English | MEDLINE | ID: mdl-23398688

ABSTRACT

BACKGROUND: Multigenic diseases are often associated with protein complexes or interactions involved in the same pathway. We wanted to estimate to what extent this is true given a consolidated protein interaction data set. The study stresses data integration and data representation issues. RESULTS: We constructed 497 multigenic disease groups from OMIM and tested for overlaps with interaction and pathway data. A total of 159 disease groups had significant overlaps with protein interaction data consolidated by iRefIndex. A further 68 disease overlaps were found only in the KEGG pathway database. No single database contained all significant overlaps thus stressing the importance of data integration. We also found that disease groups overlapped with all three interaction data types: n-ary, spoke-represented complexes and binary data - thus stressing the importance of considering each of these data types separately. CONCLUSIONS: Almost half of our multigenic disease groups could potentially be explained by protein complexes and pathways. However, the fact that no database or data type was able to cover all disease groups suggests that no single database has systematically covered all disease groups for potential related complex and pathway data. This survey provides a basis for further curation efforts to confirm and search for overlaps between diseases and interaction data. The accompanying R script can be used to reproduce the work and track progress in this area as databases change. Disease group overlaps can be further explored using the iRefscape plugin for Cytoscape.

Subject(s)

Genetic Diseases, Inborn/genetics , Multiprotein Complexes/genetics , Algorithms , Databases, Genetic , Databases, Protein , Humans , Hyperglycinemia, Nonketotic/genetics , Liddle Syndrome/genetics , Nephritis, Hereditary/genetics , Protein Interaction Mapping

iRefScape. A Cytoscape plug-in for visualization and data mining of protein interaction data from iRefIndex.

Razick, Sabry; Mora, Antonio; Michalickova, Katerina; Boddie, Paul; Donaldson, Ian M.

BMC Bioinformatics ; 12: 388, 2011 Oct 05.

Article in English | MEDLINE | ID: mdl-21975162

ABSTRACT

BACKGROUND: The iRefIndex consolidates protein interaction data from ten databases in a rigorous manner using sequence-based hash keys. Working with consolidated interaction data comes with distinct challenges: data are redundant, overlapping, highly interconnected and may be collected and represented using different curation practices. These phenomena were quantified in our previous studies. RESULTS: The iRefScape plug-in for the Cytoscape graphical viewer addresses these challenges. We show how these factors impact on data-mining tasks and how our solutions resolve them in a simple and efficient manner. A uniform accession space is used to limit redundancy and support search expansion and searching on multiple accession types. Multiple node and edge features support data filtering and mining. Node colours and features supply information about search result provenance. Overlapping evidence is presented using a multi-graph and a bi-partite representation is used to distinguish binary and n-ary source data. Searching for interactions between sets of proteins is supported and specifically includes searches on disease-related genes found in OMIM. Finally, a synchronized adjacency-matrix view facilitates visualization of relationships between sets of user defined groups. CONCLUSIONS: The iRefScape plug-in will be of interest to advanced users of interaction data. The plug-in provides access to a consolidated data set in a uniform accession space while remaining faithful to the underlying source data. Tools are provided to facilitate a range of tasks from a simple search to knowledge discovery. The plug-in uses a number of strategies that will be of interest to other plug-in developers.

Subject(s)

Data Mining , Databases, Protein , Proteins/metabolism , Database Management Systems , Databases, Genetic , Protein Interaction Mapping , Software

The contribution of DNA base damage to human cancer is modulated by the base excision repair interaction network.

Arczewska, Katarzyna D; Michalickova, Katerina; Donaldson, Ian M; Nilsen, Hilde.

Crit Rev Oncog ; 14(4): 217-73, 2008.

Article in English | MEDLINE | ID: mdl-19645683

ABSTRACT

Base excision repair (BER) is a major mode of repair of DNA base damage. BER is required for maintenance of genetic stability, which is important in the prevention of cancer. However, direct genetic associations between BER deficiency and human cancer have been difficult to firmly establish, and the first-generation mouse models deficient in individual DNA-glycosylases, which are the enzymes that give lesion specificity to the BER pathway, generally do not develop spontaneous tumors. This review summarizes our current understanding of the contribution of DNA base damage to human cancer, with a particular focus on DNA-glycosylases and two of the main enzymes that prevent misincorporation of damaged deoxynucleotide triphosphates into DNA: the dUTPase and MTH1. The available evidence suggests that the most important factors determining individual susceptibility to cancer are not mutations in individual DNA repair enzymes but rather the regulation of expression and modulation of function by protein modification and interaction partners. With this in mind, we present a comprehensive list of protein-protein interactions involving DNA-glycosylases or either of the two enzymes that limit incorporation of damaged nucleotides into DNA. Interacting partners with a known role in human cancer are specifically highlighted.

Subject(s)

DNA Damage/physiology , DNA Repair/physiology , Neoplasms/genetics , Animals , Base Sequence , DNA Damage/genetics , DNA Repair/genetics , DNA, Neoplasm/genetics , DNA, Neoplasm/metabolism , Gene Regulatory Networks/physiology , Humans , Mice , Models, Biological , Neoplasms/metabolism , Protein Binding/physiology

PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine.

Donaldson, Ian; Martin, Joel; de Bruijn, Berry; Wolting, Cheryl; Lay, Vicki; Tuekam, Brigitte; Zhang, Shudong; Baskin, Berivan; Bader, Gary D; Michalickova, Katerina; Pawson, Tony; Hogue, Christopher W V.

BMC Bioinformatics ; 4: 11, 2003 Mar 27.

Article in English | MEDLINE | ID: mdl-12689350

ABSTRACT

BACKGROUND: The majority of experimentally verified molecular interaction and biological pathway data are present in the unstructured text of biomedical journal articles where they are inaccessible to computational methods. The Biomolecular interaction network database (BIND) seeks to capture these data in a machine-readable format. We hypothesized that the formidable task-size of backfilling the database could be reduced by using Support Vector Machine technology to first locate interaction information in the literature. We present an information extraction system that was designed to locate protein-protein interaction data in the literature and present these data to curators and the public for review and entry into BIND. RESULTS: Cross-validation estimated the support vector machine's test-set precision, accuracy and recall for classifying abstracts describing interaction information was 92%, 90% and 92% respectively. We estimated that the system would be able to recall up to 60% of all non-high throughput interactions present in another yeast-protein interaction database. Finally, this system was applied to a real-world curation problem and its use was found to reduce the task duration by 70% thus saving 176 days. CONCLUSIONS: Machine learning methods are useful as tools to direct interaction and pathway database back-filling; however, this potential can only be realized if these techniques are coupled with human review and entry into a factual database such as BIND. The PreBIND system described here is available to the public at http://bind.ca. Current capabilities allow searching for human, mouse and yeast protein-interaction information.

Subject(s)

Artificial Intelligence , Information Storage and Retrieval/trends , Protein Interaction Mapping/methods , Algorithms , Computational Biology/methods , Computational Biology/statistics & numerical data , Databases, Factual/trends , Databases, Protein/trends , Genome, Fungal , Protein Interaction Mapping/classification , Protein Interaction Mapping/statistics & numerical data , PubMed/classification , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae Proteins/chemistry

Species-specific protein sequence and fold optimizations.

Dumontier, Michel; Michalickova, Katerina; Hogue, Christopher W V.

BMC Bioinformatics ; 3: 39, 2002 Dec 17.

Article in English | MEDLINE | ID: mdl-12487631

ABSTRACT

BACKGROUND: An organism's ability to adapt to its particular environmental niche is of fundamental importance to its survival and proliferation. In the largest study of its kind, we sought to identify and exploit the amino-acid signatures that make species-specific protein adaptation possible across 100 complete genomes. RESULTS: Environmental niche was determined to be a significant factor in variability from correspondence analysis using the amino acid composition of over 360,000 predicted open reading frames (ORFs) from 17 archaea, 76 bacteria and 7 eukaryote complete genomes. Additionally, we found clusters of phylogenetically unrelated archaea and bacteria that share similar environments by amino acid composition clustering. Composition analyses of conservative, domain-based homology modeling suggested an enrichment of small hydrophobic residues Ala, Gly, Val and charged residues Asp, Glu, His and Arg across all genomes. However, larger aromatic residues Phe, Trp and Tyr are reduced in folds, and these results were not affected by low complexity biases. We derived two simple log-odds scoring functions from ORFs (CG) and folds (CF) for each of the complete genomes. CF achieved an average cross-validation success rate of 85 +/- 8% whereas the CG detected 73 +/- 9% species-specific sequences when competing against all other non-redundant CG. Continuously updated results are available at http://genome.mshri.on.ca. CONCLUSION: Our analysis of amino acid compositions from the complete genomes provides stronger evidence for species-specific and environmental residue preferences in genomic sequences as well as in folds. Scoring functions derived from this work will be useful in future protein engineering experiments and possibly in identifying horizontal transfer events.

Subject(s)

Computational Biology/methods , Protein Folding , Proteins/chemistry , Adaptation, Physiological/genetics , Animals , Archaeal Proteins/chemistry , Archaeal Proteins/genetics , Bacterial Proteins/chemistry , Bacterial Proteins/genetics , Caenorhabditis elegans Proteins/chemistry , Caenorhabditis elegans Proteins/genetics , Fungal Proteins/chemistry , Fungal Proteins/genetics , Genome , Genome, Archaeal , Genome, Bacterial , Genome, Fungal , Genome, Human , Humans , Predictive Value of Tests , Protein Structure, Secondary/genetics , Proteins/genetics , Proteome/chemistry , Proteome/genetics , Proteomics/methods , Species Specificity

SeqHound: biological sequence and structure database as a platform for bioinformatics research.

Michalickova, Katerina; Bader, Gary D; Dumontier, Michel; Lieu, Hao; Betel, Doron; Isserlin, Ruth; Hogue, Christopher W V.

BMC Bioinformatics ; 3: 32, 2002 Oct 25.

Article in English | MEDLINE | ID: mdl-12401134

ABSTRACT

BACKGROUND: SeqHound has been developed as an integrated biological sequence, taxonomy, annotation and 3-D structure database system. It provides a high-performance server platform for bioinformatics research in a locally-hosted environment. RESULTS: SeqHound is based on the National Center for Biotechnology Information data model and programming tools. It offers daily updated contents of all Entrez sequence databases in addition to 3-D structural data and information about sequence redundancies, sequence neighbours, taxonomy, complete genomes, functional annotation including Gene Ontology terms and literature links to PubMed. SeqHound is accessible via a web server through a Perl, C or C++ remote API or an optimized local API. It provides functionality necessary to retrieve specialized subsets of sequences, structures and structural domains. Sequences may be retrieved in FASTA, GenBank, ASN.1 and XML formats. Structures are available in ASN.1, XML and PDB formats. Emphasis has been placed on complete genomes, taxonomy, domain and functional annotation as well as 3-D structural functionality in the API, while fielded text indexing functionality remains under development. SeqHound also offers a streamlined WWW interface for simple web-user queries. CONCLUSIONS: The system has proven useful in several published bioinformatics projects such as the BIND database and offers a cost-effective infrastructure for research. SeqHound will continue to develop and be provided as a service of the Blueprint Initiative at the Samuel Lunenfeld Research Institute. The source code and examples are available under the terms of the GNU public license at the Sourceforge site http://sourceforge.net/projects/slritools/ in the SLRI Toolkit.

Subject(s)

Computational Biology/methods , Databases, Genetic , Software , Amino Acid Sequence , Base Sequence , Databases, Genetic/classification , Information Storage and Retrieval/methods , Internet , Models, Genetic , Models, Molecular , Molecular Sequence Data , Structure-Activity Relationship

Mutation profiling of mismatch repair-deficient colorectal cncers using an in silico genome scan to identify coding microsatellites.

Park, Jane; Betel, Doron; Gryfe, Robert; Michalickova, Katerina; Di Nicola, Nando; Gallinger, Steven; Hogue, Christopher W V; Redston, Mark.

Cancer Res ; 62(5): 1284-8, 2002 Mar 01.

Article in English | MEDLINE | ID: mdl-11888892

ABSTRACT

Human colorectal, endometrial, and gastric cancers with defective DNA mismatch repair (MMR) have microsatellite instability, a unique molecular alteration characterized by widespread frameshift mutations of repetitive DNA sequences. We developed "Kangaroo," a bioinformatics program for searches in nucleotide and protein sequence databases, and performed an in silico genome scan for DNA coding microsatellites that may have novel mutations in MMR-deficient cancers. Examination of 29 previously untested coding polyadenines revealed widespread mutations in MMR-deficient colorectal cancers, with the highest frequencies in ERCC5, CASP8AP2, p72, RAD50, CDC25, RECQL1, CBF2, RACK7, GRK4, and DNAPK (range, 10-33%). This algorithm allows comprehensive mutation profiling of MMR-deficient cancers, an important step in understanding the pathogenesis of these neoplasms.

Subject(s)

Base Pair Mismatch , Colorectal Neoplasms/genetics , Microsatellite Repeats , Mutation , Algorithms , Computational Biology , DNA Repair , Humans

Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry.

Ho, Yuen; Gruhler, Albrecht; Heilbut, Adrian; Bader, Gary D; Moore, Lynda; Adams, Sally-Lin; Millar, Anna; Taylor, Paul; Bennett, Keiryn; Boutilier, Kelly; Yang, Lingyun; Wolting, Cheryl; Donaldson, Ian; Schandorff, Søren; Shewnarane, Juanita; Vo, Mai; Taggart, Joanne; Goudreault, Marilyn; Muskat, Brenda; Alfarano, Cris; Dewar, Danielle; Lin, Zhen; Michalickova, Katerina; Willems, Andrew R; Sassi, Holly; Nielsen, Peter A; Rasmussen, Karina J; Andersen, Jens R; Johansen, Lene E; Hansen, Lykke H; Jespersen, Hans; Podtelejnikov, Alexandre; Nielsen, Eva; Crawford, Janne; Poulsen, Vibeke; Sørensen, Birgitte D; Matthiesen, Jesper; Hendrickson, Ronald C; Gleeson, Frank; Pawson, Tony; Moran, Michael F; Durocher, Daniel; Mann, Matthias; Hogue, Christopher W V; Figeys, Daniel; Tyers, Mike.

Nature ; 415(6868): 180-3, 2002 Jan 10.

Article in English | MEDLINE | ID: mdl-11805837

ABSTRACT

The recent abundance of genome sequence data has brought an urgent need for systematic proteomics to decipher the encoded protein networks that dictate cellular function. To date, generation of large-scale protein-protein interaction maps has relied on the yeast two-hybrid system, which detects binary interactions through activation of reporter gene expression. With the advent of ultrasensitive mass spectrometric protein identification methods, it is feasible to identify directly protein complexes on a proteome-wide scale. Here we report, using the budding yeast Saccharomyces cerevisiae as a test case, an example of this approach, which we term high-throughput mass spectrometric protein complex identification (HMS-PCI). Beginning with 10% of predicted yeast proteins as baits, we detected 3,617 associated proteins covering 25% of the yeast proteome. Numerous protein complexes were identified, including many new interactions in various signalling pathways and in the DNA damage response. Comparison of the HMS-PCI data set with interactions reported in the literature revealed an average threefold higher success rate in detection of known complexes compared with large-scale two-hybrid studies. Given the high degree of connectivity observed in this study, even partial HMS-PCI coverage of complex proteomes, including that of humans, should allow comprehensive identification of cellular networks.

Subject(s)

Cell Cycle Proteins , Saccharomyces cerevisiae Proteins/isolation & purification , Saccharomyces cerevisiae/chemistry , Amino Acid Sequence , Cloning, Molecular , DNA Damage , DNA Repair , DNA, Fungal , Humans , Macromolecular Substances , Mass Spectrometry , Molecular Sequence Data , Phosphoric Monoester Hydrolases/metabolism , Protein Binding , Protein Kinases/chemistry , Protein Kinases/metabolism , Protein Serine-Threonine Kinases , Proteome , Saccharomyces cerevisiae Proteins/chemistry , Sequence Alignment , Signal Transduction

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL