Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 116
Filter
Add more filters

Country/Region as subject
Publication year range
1.
BMC Biol ; 19(1): 73, 2021 04 13.
Article in English | MEDLINE | ID: mdl-33849527

ABSTRACT

BACKGROUND: Dinoflagellates in the family Symbiodiniaceae are important photosynthetic symbionts in cnidarians (such as corals) and other coral reef organisms. Breakdown of the coral-dinoflagellate symbiosis due to environmental stress (i.e. coral bleaching) can lead to coral death and the potential collapse of reef ecosystems. However, evolution of Symbiodiniaceae genomes, and its implications for the coral, is little understood. Genome sequences of Symbiodiniaceae remain scarce due in part to their large genome sizes (1-5 Gbp) and idiosyncratic genome features. RESULTS: Here, we present de novo genome assemblies of seven members of the genus Symbiodinium, of which two are free-living, one is an opportunistic symbiont, and the remainder are mutualistic symbionts. Integrating other available data, we compare 15 dinoflagellate genomes revealing high sequence and structural divergence. Divergence among some Symbiodinium isolates is comparable to that among distinct genera of Symbiodiniaceae. We also recovered hundreds of gene families specific to each lineage, many of which encode unknown functions. An in-depth comparison between the genomes of the symbiotic Symbiodinium tridacnidorum (isolated from a coral) and the free-living Symbiodinium natans reveals a greater prevalence of transposable elements, genetic duplication, structural rearrangements, and pseudogenisation in the symbiotic species. CONCLUSIONS: Our results underscore the potential impact of lifestyle on lineage-specific gene-function innovation, genome divergence, and the diversification of Symbiodinium and Symbiodiniaceae. The divergent features we report, and their putative causes, may also apply to other microbial eukaryotes that have undergone symbiotic phases in their evolutionary history.


Subject(s)
Anthozoa , Dinoflagellida , Animals , Anthozoa/genetics , Coral Reefs , Dinoflagellida/genetics , Ecosystem , Genetic Variation , Genome/genetics
2.
Brief Bioinform ; 20(2): 426-435, 2019 03 22.
Article in English | MEDLINE | ID: mdl-28673025

ABSTRACT

We are amidst an ongoing flood of sequence data arising from the application of high-throughput technologies, and a concomitant fundamental revision in our understanding of how genomes evolve individually and within the biosphere. Workflows for phylogenomic inference must accommodate data that are not only much larger than before, but often more error prone and perhaps misassembled, or not assembled in the first place. Moreover, genomes of microbes, viruses and plasmids evolve not only by tree-like descent with modification but also by incorporating stretches of exogenous DNA. Thus, next-generation phylogenomics must address computational scalability while rethinking the nature of orthogroups, the alignment of multiple sequences and the inference and comparison of trees. New phylogenomic workflows have begun to take shape based on so-called alignment-free (AF) approaches. Here, we review the conceptual foundations of AF phylogenetics for the hierarchical (vertical) and reticulate (lateral) components of genome evolution, focusing on methods based on k-mers. We reflect on what seems to be successful, and on where further development is needed.


Subject(s)
Evolution, Molecular , Genome , Phylogeny , Algorithms , Animals , Humans , Microbiota/genetics , Models, Genetic , Sequence Alignment , Sequence Analysis, DNA , Viruses/genetics
3.
PLoS Pathog ; 15(1): e1007513, 2019 01.
Article in English | MEDLINE | ID: mdl-30673782

ABSTRACT

Mesenteric infection by the parasitic blood fluke Schistosoma bovis is a common veterinary problem in Africa and the Middle East and occasionally in the Mediterranean Region. The species also has the ability to form interspecific hybrids with the human parasite S. haematobium with natural hybridisation observed in West Africa, presenting possible zoonotic transmission. Additionally, this exchange of alleles between species may dramatically influence disease dynamics and parasite evolution. We have generated a 374 Mb assembly of the S. bovis genome using Illumina and PacBio-based technologies. Despite infecting different hosts and organs, the genome sequences of S. bovis and S. haematobium appeared strikingly similar with 97% sequence identity. The two species share 98% of protein-coding genes, with an average sequence identity of 97.3% at the amino acid level. Genome comparison identified large continuous parts of the genome (up to several 100 kb) showing almost 100% sequence identity between S. bovis and S. haematobium. It is unlikely that this is a result of genome conservation and provides further evidence of natural interspecific hybridization between S. bovis and S. haematobium. Our results suggest that foreign DNA obtained by interspecific hybridization was maintained in the population through multiple meiosis cycles and that hybrids were sexually reproductive, producing viable offspring. The S. bovis genome assembly forms a highly valuable resource for studying schistosome evolution and exploring genetic regions that are associated with species-specific phenotypic traits.


Subject(s)
Hybridization, Genetic/genetics , Schistosoma/genetics , Africa , Africa, Western , Animals , Base Sequence/genetics , Cattle , Chromosome Mapping/methods , DNA/genetics , Genome/genetics , Genome, Mitochondrial/genetics , Hybridization, Genetic/physiology , Middle East , Phylogeny , Proteome/genetics , Species Specificity , Trematoda/genetics , Whole Genome Sequencing/methods
4.
BMC Biol ; 18(1): 56, 2020 05 24.
Article in English | MEDLINE | ID: mdl-32448240

ABSTRACT

BACKGROUND: Dinoflagellates are taxonomically diverse and ecologically important phytoplankton that are ubiquitously present in marine and freshwater environments. Mostly photosynthetic, dinoflagellates provide the basis of aquatic primary production; most taxa are free-living, while some can form symbiotic and parasitic associations with other organisms. However, knowledge of the molecular mechanisms that underpin the adaptation of these organisms to diverse ecological niches is limited by the scarce availability of genomic data, partly due to their large genome sizes estimated up to 250 Gbp. Currently available dinoflagellate genome data are restricted to Symbiodiniaceae (particularly symbionts of reef-building corals) and parasitic lineages, from taxa that have smaller genome size ranges, while genomic information from more diverse free-living species is still lacking. RESULTS: Here, we present two draft diploid genome assemblies of the free-living dinoflagellate Polarella glacialis, isolated from the Arctic and Antarctica. We found that about 68% of the genomes are composed of repetitive sequence, with long terminal repeats likely contributing to intra-species structural divergence and distinct genome sizes (3.0 and 2.7 Gbp). For each genome, guided using full-length transcriptome data, we predicted > 50,000 high-quality protein-coding genes, of which ~40% are in unidirectional gene clusters and ~25% comprise single exons. Multi-genome comparison unveiled genes specific to P. glacialis and a common, putatively bacterial origin of ice-binding domains in cold-adapted dinoflagellates. CONCLUSIONS: Our results elucidate how selection acts within the context of a complex genome structure to facilitate local adaptation. Because most dinoflagellate genes are constitutively expressed, Polarella glacialis has enhanced transcriptional responses via unidirectional, tandem duplication of single-exon genes that encode functions critical to survival in cold, low-light polar environments. These genomes provide a foundational reference for future research on dinoflagellate evolution.


Subject(s)
Dinoflagellida/genetics , Exons , Genome, Protozoan , Tandem Repeat Sequences , Transcriptome , Adaptation, Biological , Genes, Protozoan
5.
Brief Bioinform ; 16(3): 461-74, 2015 May.
Article in English | MEDLINE | ID: mdl-24950687

ABSTRACT

Breast cancer was traditionally perceived as a single disease; however, recent advances in gene expression and genomic profiling have revealed that breast cancer is in fact a collection of diseases exhibiting distinct anatomical features, responses to treatment and survival outcomes. Consequently, a number of schemes have been proposed for subtyping of breast cancer to bring out the biological and clinically relevant characteristics of the subtypes. Although some of these schemes capture underlying molecular differences, others predict variations in response to treatment and survival patterns. However, despite this diversity in the approaches, it is clear that molecular mechanisms drive clinical outcomes, and therefore an effective scheme should integrate molecular as well as clinical parameters to enable deeper understanding of cancer mechanisms and allow better decision making in the clinic. Here, using a large cohort of ∼550 breast tumours from The Cancer Genome Atlas, we systematically evaluate a number of expression-based schemes including at least eight molecular pathways implicated in breast cancer and three prognostic signatures, across a variety of classification scenarios covering molecular characteristics, biomarker status, tumour stages and survival patterns. We observe that a careful combination of these schemes yields better classification results compared with using them individually, thus confirming that molecular mechanisms and clinical outcomes are related and that an effective scheme should therefore integrate both these parameters to enable a deeper understanding of the cancer.


Subject(s)
Biomarkers, Tumor/metabolism , Breast Neoplasms/diagnosis , Breast Neoplasms/metabolism , Gene Expression Profiling/methods , Molecular Diagnostic Techniques/methods , Neoplasm Proteins/metabolism , Breast Neoplasms/classification , Female , Humans , Prognosis , Protein Interaction Mapping/methods , Reproducibility of Results , Risk Assessment/methods , Sensitivity and Specificity
6.
Environ Microbiol ; 18(5): 1338-51, 2016 05.
Article in English | MEDLINE | ID: mdl-26032777

ABSTRACT

Diazotrophic bacteria potentially supply substantial amounts of biologically fixed nitrogen to crops, but their occurrence may be suppressed by high nitrogen fertilizer application. Here, we explored the impact of high nitrogen fertilizer rates on the presence of diazotrophs in field-grown sugarcane with industry-standard or reduced nitrogen fertilizer application. Despite large differences in soil microbial communities between test sites, a core sugarcane root microbiome was identified. The sugarcane root-enriched core taxa overlap with those of Arabidopsis thaliana raising the possibility that certain bacterial families have had long association with plants. Reduced nitrogen fertilizer application had remarkably little effect on the core root microbiome and did not increase the relative abundance of root-associated diazotrophs or nif gene counts. Correspondingly, low nitrogen fertilizer crops had lower biomass and nitrogen content, reflecting a lack of major input of biologically fixed nitrogen, indicating that manipulating nitrogen fertilizer rates does not improve sugarcane yields by enriching diazotrophic populations under the test conditions. Standard nitrogen fertilizer crops had improved biomass and nitrogen content, and corresponding soils had higher abundances of nitrification and denitrification genes. These findings highlight that achieving a balance in maximizing crop yields and minimizing nutrient pollution associated with nitrogen fertilizer application requires understanding of how microbial communities respond to fertilizer use.


Subject(s)
Fertilizers , Microbiota , Nitrogen , Plant Roots/microbiology , Saccharum/microbiology , Bacteria/isolation & purification , Bacteria/metabolism , Biomass , Crops, Agricultural , Nitrogen Fixation , Soil , Soil Microbiology
7.
Brief Bioinform ; 15(2): 195-211, 2014 Mar.
Article in English | MEDLINE | ID: mdl-23698722

ABSTRACT

Inference of gene regulatory network from expression data is a challenging task. Many methods have been developed to this purpose but a comprehensive evaluation that covers unsupervised, semi-supervised and supervised methods, and provides guidelines for their practical application, is lacking. We performed an extensive evaluation of inference methods on simulated and experimental expression data. The results reveal low prediction accuracies for unsupervised techniques with the notable exception of the Z-SCORE method on knockout data. In all other cases, the supervised approach achieved the highest accuracies and even in a semi-supervised setting with small numbers of only positive samples, outperformed the unsupervised techniques.


Subject(s)
Computational Biology/methods , Gene Regulatory Networks , Algorithms , Artificial Intelligence , Computer Simulation , Databases, Genetic/statistics & numerical data , Escherichia coli/genetics , Gene Expression Profiling/statistics & numerical data , Genes, Bacterial , Genes, Fungal , Saccharomyces cerevisiae/genetics , Software , Support Vector Machine , Systems Biology
8.
Brief Bioinform ; 15(6): 973-83, 2014 Nov.
Article in English | MEDLINE | ID: mdl-23946492

ABSTRACT

Large quantities of information describing the mechanisms of biological pathways continue to be collected in publicly available databases. At the same time, experiments have increased in scale, and biologists increasingly use pathways defined in online databases to interpret the results of experiments and generate hypotheses. Emerging computational techniques that exploit the rich biological information captured in reaction systems require formal standardized descriptions of pathways to extract these reaction networks and avoid the alternative: time-consuming and largely manual literature-based network reconstruction. Here, we systematically evaluate the effects of commonly used knowledge representations on the seemingly simple task of extracting a reaction network describing signal transduction from a pathway database. We show that this process is in fact surprisingly difficult, and the pathway representations adopted by various knowledge bases have dramatic consequences for reaction network extraction, connectivity, capture of pathway crosstalk and in the modelling of cell-cell interactions. Researchers constructing computational models built from automatically extracted reaction networks must therefore consider the issues we outline in this review to maximize the value of existing pathway knowledge.


Subject(s)
Databases, Factual/statistics & numerical data , Models, Biological , Signal Transduction , Cell Communication , Computational Biology , Databases, Factual/standards , Humans , Knowledge Bases , MAP Kinase Signaling System , Systems Biology
9.
Nucleic Acids Res ; 42(10): 6106-27, 2014 Jun.
Article in English | MEDLINE | ID: mdl-24792170

ABSTRACT

DNA-damage response machinery is crucial to maintain the genomic integrity of cells, by enabling effective repair of even highly lethal lesions such as DNA double-strand breaks (DSBs). Defects in specific genes acquired through mutations, copy-number alterations or epigenetic changes can alter the balance of these pathways, triggering cancerous potential in cells. Selective killing of cancer cells by sensitizing them to further DNA damage, especially by induction of DSBs, therefore requires careful modulation of DSB-repair pathways. Here, we review the latest knowledge on the two DSB-repair pathways, homologous recombination and non-homologous end joining in human, describing in detail the functions of their components and the key mechanisms contributing to the repair. Such an in-depth characterization of these pathways enables a more mechanistic understanding of how cells respond to therapies, and suggests molecules and processes that can be explored as potential therapeutic targets. One such avenue that has shown immense promise is via the exploitation of synthetic lethal relationships, for which the BRCA1-PARP1 relationship is particularly notable. Here, we describe how this relationship functions and the manner in which cancer cells acquire therapy resistance by restoring their DSB repair potential.


Subject(s)
Breast Neoplasms/therapy , DNA Breaks, Double-Stranded , DNA End-Joining Repair , Recombinational DNA Repair , Breast Neoplasms/genetics , Breast Neoplasms/metabolism , Carcinogenesis/genetics , Carcinogenesis/metabolism , Female , Humans
10.
Bioinformatics ; 30(9): 1273-9, 2014 May 01.
Article in English | MEDLINE | ID: mdl-24407221

ABSTRACT

MOTIVATION: Cancer is a heterogeneous progressive disease caused by perturbations of the underlying gene regulatory network that can be described by dynamic models. These dynamics are commonly modeled as Boolean networks or as ordinary differential equations. Their inference from data is computationally challenging, and at least partial knowledge of the regulatory network and its kinetic parameters is usually required to construct predictive models. RESULTS: Here, we construct Hopfield networks from static gene-expression data and demonstrate that cancer subtypes can be characterized by different attractors of the Hopfield network. We evaluate the clustering performance of the network and find that it is comparable with traditional methods but offers additional advantages including a dynamic model of the energy landscape and a unification of clustering, feature selection and network inference. We visualize the Hopfield attractor landscape and propose a pruning method to generate sparse networks for feature selection and improved understanding of feature relationships.


Subject(s)
Gene Expression Profiling/methods , Gene Expression Regulation, Neoplastic , Gene Regulatory Networks , Neoplasms/genetics , Algorithms , Cluster Analysis , Humans , Kinetics , Software
11.
Bioinformatics ; 29(12): 1553-61, 2013 Jun 15.
Article in English | MEDLINE | ID: mdl-23613489

ABSTRACT

MOTIVATION: Deciphering the modus operandi of dysregulated cellular mechanisms in cancer is critical to implicate novel cancer genes and develop effective anti-cancer therapies. Fundamental to this is meticulous tracking of the behavior of core modules, including complexes and pathways across specific conditions in cancer. RESULTS: Here, we performed a straightforward yet systematic identification and comparison of modules across pancreatic normal and cancer tissue conditions by integrating PPI, gene-expression and mutation data. Our analysis revealed interesting change-patterns in gene composition and expression correlation particularly affecting modules responsible for genome stability. Although in most cases these changes indicated impairment of essential functions (e.g., of DNA damage repair), in several other cases we noticed strengthening of modules possibly abetting cancer. Some of these compensatory modules showed switches in transcription regulation and recruitment of tumor inducers (e.g., SOX2 through overexpression). In-depth analysis revealed novel genes in pancreatic cancer, which showed susceptibility to copy-number alterations (e.g., for USP15 in 17 of 67 cases), supported by literature evidence for their involvement in other tumors (e.g., USP15 in glioblastoma). Two of the identified genes, YWHAE and DISC1, further supported the nexus between neural genes and pancreatic carcinogenesis. Extension of this assessment to BRCA1 and BRCA2 breast tumors showed specific differences even across the two sub-types and revealed novel genes involved therein (e.g., TRIM5 and NCOA6). AVAILABILITY: Our software CONTOURv1 is available at: http://bioinformatics.org.au/tools-data/.


Subject(s)
Gene Expression Regulation, Neoplastic , Genes, Neoplasm , BRCA2 Protein/genetics , Breast Neoplasms/genetics , Female , Gene Expression , Genes, BRCA1 , Genes, BRCA2 , Humans , Mutation , Neoplasms/genetics , Pancreatic Neoplasms/genetics , Pancreatic Neoplasms/metabolism , Protein Interaction Mapping , Saccharomyces cerevisiae Proteins/metabolism
12.
RNA Biol ; 11(3): 176-85, 2014.
Article in English | MEDLINE | ID: mdl-24572375

ABSTRACT

From 1971 to 1985, Carl Woese and colleagues generated oligonucleotide catalogs of 16S/18S rRNAs from more than 400 organisms. Using these incomplete and imperfect data, Carl and his colleagues developed unprecedented insights into the structure, function, and evolution of the large RNA components of the translational apparatus. They recognized a third domain of life, revealed the phylogenetic backbone of bacteria (and its limitations), delineated taxa, and explored the tempo and mode of microbial evolution. For these discoveries to have stood the test of time, oligonucleotide catalogs must carry significant phylogenetic signal; they thus bear re-examination in view of the current interest in alignment-free phylogenetics based on k-mers. Here we consider the aims, successes, and limitations of this early phase of molecular phylogenetics. We computationally generate oligonucleotide sets (e-catalogs) from 16S/18S rRNA sequences, calculate pairwise distances between them based on D 2 statistics, compute distance trees, and compare their performance against alignment-based and k-mer trees. Although the catalogs themselves were superseded by full-length sequences, this stage in the development of computational molecular biology remains instructive for us today.


Subject(s)
Computational Biology/methods , Oligonucleotides , Phylogeny , RNA, Ribosomal/genetics , Archaea/classification , Archaea/genetics , Bacteria/classification , Bacteria/genetics , Databases, Genetic , Evolution, Molecular
13.
BMC Bioinformatics ; 14: 120, 2013 Apr 08.
Article in English | MEDLINE | ID: mdl-23566217

ABSTRACT

BACKGROUND: Clustering sequences into groups of putative homologs (families) is a critical first step in many areas of comparative biology and bioinformatics. The performance of clustering approaches in delineating biologically meaningful families depends strongly on characteristics of the data, including content bias and degree of divergence. New, highly scalable methods have recently been introduced to cluster the very large datasets being generated by next-generation sequencing technologies. However, there has been little systematic investigation of how characteristics of the data impact the performance of these approaches. RESULTS: Using clusters from a manually curated dataset as reference, we examined the performance of a widely used graph-based Markov clustering algorithm (MCL) and a greedy heuristic approach (UCLUST) in delineating protein families coded by three sets of bacterial genomes of different G+C content. Both MCL and UCLUST generated clusters that are comparable to the reference sets at specific parameter settings, although UCLUST tends to under-cluster compositionally biased sequences (G+C content 33% and 66%). Using simulated data, we sought to assess the individual effects of sequence divergence, rate heterogeneity, and underlying G+C content. Performance decreased with increasing sequence divergence, decreasing among-site rate variation, and increasing G+C bias. Two MCL-based methods recovered the simulated families more accurately than did UCLUST. MCL using local alignment distances is more robust across the investigated range of sequence features than are greedy heuristics using distances based on global alignment. CONCLUSIONS: Our results demonstrate that sequence divergence, rate heterogeneity and content bias can individually and in combination affect the accuracy with which MCL and UCLUST can recover homologous protein families. For application to data that are more divergent, and exhibit higher among-site rate variation and/or content bias, MCL may often be the better choice, especially if computational resources are not limiting.


Subject(s)
Proteins/classification , Sequence Homology, Amino Acid , Algorithms , Bacterial Proteins/chemistry , Bacterial Proteins/classification , Bacterial Proteins/genetics , Base Composition , Cluster Analysis , DNA, Bacterial/chemistry , Evolution, Molecular , Genome, Bacterial , Markov Chains , Sequence Analysis, Protein/methods
14.
BMC Bioinformatics ; 14 Suppl 16: S14, 2013.
Article in English | MEDLINE | ID: mdl-24564496

ABSTRACT

BACKGROUND: Cell survival and development are orchestrated by complex interlocking programs of gene activation and repression. Understanding how this gene regulatory network (GRN) functions in normal states, and is altered in cancers subtypes, offers fundamental insight into oncogenesis and disease progression, and holds great promise for guiding clinical decisions. Inferring a GRN from empirical microarray gene expression data is a challenging task in cancer systems biology. In recent years, module-based approaches for GRN inference have been proposed to address this challenge. Despite the demonstrated success of module-based approaches in uncovering biologically meaningful regulatory interactions, their application remains limited a single condition, without supporting the comparison of multiple disease subtypes/conditions. Also, their use remains unnecessarily restricted to computational biologists, as accurate inference of modules and their regulators requires integration of diverse tools and heterogeneous data sources, which in turn requires scripting skills, data infrastructure and powerful computational facilities. New analytical frameworks are required to make module-based GRN inference approach more generally useful to the research community. RESULTS: We present the RMaNI (Regulatory Module Network Inference) framework, which supports cancer subtype-specific or condition specific GRN inference and differential network analysis. It combines both transcriptomic as well as genomic data sources, and integrates heterogeneous knowledge resources and a set of complementary bioinformatic methods for automated inference of modules, their condition specific regulators and facilitates downstream network analyses and data visualization. To demonstrate its utility, we applied RMaNI to a hepatocellular microarray data containing normal and three disease conditions. We demonstrate that how RMaNI can be employed to understand the genetic architecture underlying three disease conditions. RMaNI is freely available at http://inspect.braembl.org.au/bi/inspect/rmani CONCLUSION: RMaNI makes available a workflow with comprehensive set of tools that would otherwise be challenging for non-expert users to install and apply. The framework presented in this paper is flexible and can be easily extended to analyse any dataset with multiple disease conditions.


Subject(s)
Carcinoma, Hepatocellular/genetics , Computational Biology/methods , Gene Regulatory Networks , Liver Neoplasms/genetics , Cluster Analysis , Gene Expression , Humans , Internet , Systems Biology/methods
15.
J Mol Evol ; 77(1-2): 1-2, 2013 Aug.
Article in English | MEDLINE | ID: mdl-23877343

ABSTRACT

A recent editorial in Journal of Molecular Evolution highlights opportunities and challenges facing molecular evolution in the era of next-generation sequencing. Abundant sequence data should allow more-complex models to be fit at higher confidence, making phylogenetic inference more reliable and improving our understanding of evolution at the molecular level. However, concern that approaches based on multiple sequence alignment may be computationally infeasible for large datasets is driving the development of so-called alignment-free methods for sequence comparison and phylogenetic inference. The recent editorial characterized these approaches as model-free, not based on the concept of homology, and lacking in biological intuition. We argue here that alignment-free methods have not abandoned models or homology, and can be biologically intuitive.


Subject(s)
Evolution, Molecular , Models, Genetic , Phylogeny , Animals , Humans
16.
Nat Methods ; 7(3 Suppl): S26-41, 2010 Mar.
Article in English | MEDLINE | ID: mdl-20195255

ABSTRACT

Advances in imaging techniques and high-throughput technologies are providing scientists with unprecedented possibilities to visualize internal structures of cells, organs and organisms and to collect systematic image data characterizing genes and proteins on a large scale. To make the best use of these increasingly complex and large image data resources, the scientific community must be provided with methods to query, analyze and crosslink these resources to give an intuitive visual representation of the data. This review gives an overview of existing methods and tools for this purpose and highlights some of their limitations and challenges.


Subject(s)
Image Processing, Computer-Assisted , Magnetic Resonance Imaging , Microscopy/methods
17.
Bioinformatics ; 28(6): 851-7, 2012 Mar 15.
Article in English | MEDLINE | ID: mdl-22219205

ABSTRACT

MOTIVATION: Phylogenetic profiling methods can achieve good accuracy in predicting protein-protein interactions, especially in prokaryotes. Recent studies have shown that the choice of reference taxa (RT) is critical for accurate prediction, but with more than 2500 fully sequenced taxa publicly available, identifying the most-informative RT is becoming increasingly difficult. Previous studies on the selection of RT have provided guidelines for manual taxon selection, and for eliminating closely related taxa. However, no general strategy for automatic selection of RT is currently available. RESULTS: We present three novel methods for automating the selection of RT, using machine learning based on known protein-protein interaction networks. One of these methods in particular, Tree-Based Search, yields greatly improved prediction accuracies. We further show that different methods for constituting phylogenetic profiles often require very different RT sets to support high prediction accuracy.


Subject(s)
Archaea/genetics , Artificial Intelligence , Bacteria/genetics , Eukaryota/genetics , Phylogeny , Protein Interaction Maps , Proteins/genetics , Archaea/classification , Archaea/metabolism , Bacteria/classification , Bacteria/metabolism , Eukaryota/classification , Eukaryota/metabolism , Proteins/chemistry , Proteins/metabolism
18.
Bioinformatics ; 28(1): 69-75, 2012 Jan 01.
Article in English | MEDLINE | ID: mdl-22057159

ABSTRACT

MOTIVATION: Protein-protein interactions (PPIs) are pivotal for many biological processes and similarity in Gene Ontology (GO) annotation has been found to be one of the strongest indicators for PPI. Most GO-driven algorithms for PPI inference combine machine learning and semantic similarity techniques. We introduce the concept of inducers as a method to integrate both approaches more effectively, leading to superior prediction accuracies. RESULTS: An inducer (ULCA) in combination with a Random Forest classifier compares favorably to several sequence-based methods, semantic similarity measures and multi-kernel approaches. On a newly created set of high-quality interaction data, the proposed method achieves high cross-species prediction accuracies (Area under the ROC curve ≤ 0.88), rendering it a valuable companion to sequence-based methods. AVAILABILITY: Software and datasets are available at http://bioinformatics.org.au/go2ppi/ CONTACT: m.ragan@uq.edu.au.


Subject(s)
Algorithms , Molecular Sequence Annotation , Proteins/genetics , Software , Vocabulary, Controlled , Databases, Protein , Humans , Protein Interaction Maps , ROC Curve , Yeasts/genetics , Yeasts/metabolism
19.
BMC Evol Biol ; 12: 140, 2012 Aug 07.
Article in English | MEDLINE | ID: mdl-22871040

ABSTRACT

BACKGROUND: Proteins of the mammalian PYHIN (IFI200/HIN-200) family are involved in defence against infection through recognition of foreign DNA. The family member absent in melanoma 2 (AIM2) binds cytosolic DNA via its HIN domain and initiates inflammasome formation via its pyrin domain. AIM2 lies within a cluster of related genes, many of which are uncharacterised in mouse. To better understand the evolution, orthology and function of these genes, we have documented the range of PYHIN genes present in representative mammalian species, and undertaken phylogenetic and expression analyses. RESULTS: No PYHIN genes are evident in non-mammals or monotremes, with a single member found in each of three marsupial genomes. Placental mammals show variable family expansions, from one gene in cow to four in human and 14 in mouse. A single HIN domain appears to have evolved in the common ancestor of marsupials and placental mammals, and duplicated to give rise to three distinct forms (HIN-A, -B and -C) in the placental mammal ancestor. Phylogenetic analyses showed that AIM2 HIN-C and pyrin domains clearly diverge from the rest of the family, and it is the only PYHIN protein with orthology across many species. Interestingly, although AIM2 is important in defence against some bacteria and viruses in mice, AIM2 is a pseudogene in cow, sheep, llama, dolphin, dog and elephant. The other 13 mouse genes have arisen by duplication and rearrangement within the lineage, which has allowed some diversification in expression patterns. CONCLUSIONS: The role of AIM2 in forming the inflammasome is relatively well understood, but molecular interactions of other PYHIN proteins involved in defence against foreign DNA remain to be defined. The non-AIM2 PYHIN protein sequences are very distinct from AIM2, suggesting they vary in effector mechanism in response to foreign DNA, and may bind different DNA structures. The PYHIN family has highly varied gene composition between mammalian species due to lineage-specific duplication and loss, which probably indicates different adaptations for fighting infectious disease. Non-genomic DNA can indicate infection, or a mutagenic threat. We hypothesise that defence of the genome against endogenous retroelements has been an additional evolutionary driver for PYHIN proteins.


Subject(s)
Evolution, Molecular , Mammals/genetics , Nuclear Proteins/genetics , Animals , Bayes Theorem , DNA-Binding Proteins , Humans , Inflammasomes/metabolism , Mice , Mice, Inbred C57BL , Nuclear Proteins/chemistry , Nuclear Proteins/immunology , Phylogeny , Rats , Transcriptome
20.
RNA ; 16(9): 1760-8, 2010 Sep.
Article in English | MEDLINE | ID: mdl-20651029

ABSTRACT

The heterogeneous nuclear ribonucleoproteins (hnRNPs) A/B are a family of RNA-binding proteins that participate in various aspects of nucleic acid metabolism, including mRNA trafficking, telomere maintenance, and splicing. They are both regulators and targets of alternative splicing, and the patterns of alternative splicing of their transcripts have diverged between paralogs and between orthologs in different species. Surprisingly, the extent of this splicing variation and its implications for post-transcriptional regulation have remained largely unexplored. Here, we conducted a detailed analysis of hnRNP A/B sequences and expression patterns across six vertebrates. Alternative exons emerged via the introduction of new splice sites, changes in the strengths of existing splice sites, and the accumulation of auxiliary splicing regulatory motifs. Observed isoform expression patterns could be attributed to the frequency and strength of cis-elements. We found a trend toward increased splicing variation in mammals and identified novel alternatively spliced isoforms in human and chicken. Pulldown and translational assays demonstrated that the inclusion of alternative exons altered the affinity of hnRNP A/B proteins for their cognate nucleic acids and modified protein expression levels. As the hnRNPs A/B regulate several key steps in mRNA processing, the involvement of diverse hnRNP isoforms in multiple cellular contexts and species implies concomitant differences in the transcriptional output of these systems. We conclude that the emergence of alternative splicing in the hnRNPs A/B has contributed to the diversification of their roles in the regulation of alternative splicing and has thus added an unexpected layer of regulatory complexity to transcription in vertebrates.


Subject(s)
Alternative Splicing , Heterogeneous-Nuclear Ribonucleoprotein Group A-B/metabolism , Animals , Evolution, Molecular , HeLa Cells , Heterogeneous-Nuclear Ribonucleoprotein Group A-B/genetics , Humans , Mice , RNA Splice Sites , RNA, Messenger/metabolism , Rats , Regulatory Sequences, Ribonucleic Acid
SELECTION OF CITATIONS
SEARCH DETAIL