RESUMO
Achieving high accuracy in orthology inference is essential for many comparative, evolutionary and functional genomic analyses, yet the true evolutionary history of genes is generally unknown and orthologs are used for very different applications across phyla, requiring different precision-recall trade-offs. As a result, it is difficult to assess the performance of orthology inference methods. Here, we present a community effort to establish standards and an automated web-based service to facilitate orthology benchmarking. Using this service, we characterize 15 well-established inference methods and resources on a battery of 20 different benchmarks. Standardized benchmarking provides a way for users to identify the most effective methods for the problem at hand, sets a minimum requirement for new tools and resources, and guides the development of more accurate orthology inference methods.
Assuntos
Biologia Computacional/normas , Genômica/normas , Filogenia , Proteômica/normas , Archaea/classificação , Archaea/genética , Bactérias/classificação , Bactérias/genética , Biologia Computacional/métodos , Bases de Dados Genéticas , Eucariotos/classificação , Eucariotos/genética , Ontologia Genética , Genômica/métodos , Modelos Genéticos , Proteômica/métodos , Análise de Sequência de Proteína , Homologia de Sequência , Especificidade da EspécieRESUMO
UNLABELLED: Roundup is an online database of gene orthologs for over 1800 genomes, including 226 Eukaryota, 1447 Bacteria, 113 Archaea and 21 Viruses. Orthologs are inferred using the Reciprocal Smallest Distance algorithm. Users may query Roundup for single-linkage clusters of orthologous genes based on any group of genomes. Annotated query results may be viewed in a variety of ways including as clusters of orthologs and as phylogenetic profiles. Genomic results may be downloaded in formats suitable for functional as well as phylogenetic analysis, including the recent OrthoXML standard. In addition, gene IDs can be retrieved using FASTA sequence search. All source code and orthologs are freely available. AVAILABILITY: http://roundup.hms.harvard.edu.
Assuntos
Algoritmos , Genômica/métodos , Filogenia , Animais , Archaea/genética , Bactérias/genética , Análise por Conglomerados , Evolução Molecular , Genoma , Humanos , Vírus/genéticaRESUMO
SUMMARY: We developed a package TripletSearch to compute relationships within triplets of genes based on Roundup, an orthologous gene database containing >1500 genomes. These relationships, derived from the coevolution of genes, provide valuable information in the detection of biological network organization from the local to the system level, in the inference of protein functions and in the identification of functional orthologs. To run the computation, users need to provide the GI IDs of the genes of interest. AVAILABILITY: http://wall.hms.harvard.edu/sites/default/files/tripletSearch.tar.gz CONTACT: dpwall@hms.harvard.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genes Bacterianos , Software , Bases de Dados Genéticas , Evolução Molecular , Genoma Bacteriano , Genômica/métodos , Modelos Biológicos , Filogenia , Proteínas/genética , Proteínas/fisiologiaRESUMO
BACKGROUND: A "phylogenetic profile" refers to the presence or absence of a gene across a set of organisms, and it has been proven valuable for understanding gene functional relationships and network organization. Despite this success, few studies have attempted to search beyond just pairwise relationships among genes. Here we search for logic relationships involving three genes, and explore its potential application in gene network analyses. RESULTS: Taking advantage of a phylogenetic matrix constructed from the large orthologs database Roundup, we invented a method to create balanced profiles for individual triplets of genes that guarantee equal weight on the different phylogenetic scenarios of coevolution between genes. When we applied this idea to LAPP, the method to search for logic triplets of genes, the balanced profiles resulted in significant performance improvement and the discovery of hundreds of thousands more putative triplets than unadjusted profiles. We found that logic triplets detected biological network organization and identified key proteins and their functions, ranging from neighbouring proteins in local pathways, to well separated proteins in the whole pathway, and to the interactions among different pathways at the system level. Finally, our case study suggested that the directionality in a logic relationship and the profile of a triplet could disclose the connectivity between the triplet and surrounding networks. CONCLUSION: Balanced profiles are superior to the raw profiles employed by traditional methods of phylogenetic profiling in searching for high order gene sets. Gene triplets can provide valuable information in detection of biological network organization and identification of key genes at different levels of cellular interaction.
Assuntos
Evolução Molecular , Redes Reguladoras de Genes , Filogenia , Bases de Dados Genéticas , Humanos , Lógica , Proteínas/genética , Proteínas/metabolismoRESUMO
Micro(mi)RNA-based post-transcriptional regulatory mechanisms have been broadly implicated in the assembly and modulation of synaptic connections required to shape neural circuits, however, relatively few specific miRNAs have been identified that control synapse formation. Using a conditional transgenic toolkit for competitive inhibition of miRNA function in Drosophila, we performed an unbiased screen for novel regulators of synapse morphogenesis at the larval neuromuscular junction (NMJ). From a set of ten new validated regulators of NMJ growth, we discovered that miR-34 mutants display synaptic phenotypes and cell type-specific functions suggesting distinct downstream mechanisms in the presynaptic and postsynaptic compartments. A search for conserved downstream targets for miR-34 identified the junctional receptor CNTNAP4/Neurexin-IV (Nrx-IV) and the membrane cytoskeletal effector Adducin/Hu-li tai shao (Hts) as proteins whose synaptic expression is restricted by miR-34. Manipulation of miR-34, Nrx-IV or Hts-M function in motor neurons or muscle supports a model where presynaptic miR-34 inhibits Nrx-IV to influence active zone formation, whereas, postsynaptic miR-34 inhibits Hts to regulate the initiation of bouton formation from presynaptic terminals.
Assuntos
Proteínas de Ligação a Calmodulina/genética , Moléculas de Adesão Celular Neuronais/genética , Proteínas de Drosophila/genética , Regulação da Expressão Gênica no Desenvolvimento , MicroRNAs/metabolismo , Terminações Pré-Sinápticas/fisiologia , Animais , Animais Geneticamente Modificados , Proteínas de Ligação a Calmodulina/metabolismo , Moléculas de Adesão Celular Neuronais/metabolismo , Proteínas de Drosophila/metabolismo , Drosophila melanogaster/fisiologia , Larva/crescimento & desenvolvimento , Morfogênese/genética , Mutação , Junção Neuromuscular/citologia , Junção Neuromuscular/crescimento & desenvolvimentoRESUMO
SUMMARY: We have created a tool for ortholog and phylogenetic profile retrieval called Roundup. Roundup is backed by a massive repository of orthologs and associated evolutionary distances that was built using the reciprocal smallest distance algorithm, an approach that has been shown to improve upon alternative approaches of ortholog detection, such as reciprocal blast. Presently, the Roundup repository contains all possible pair-wise comparisons for over 250 genomes, including 32 Eukaryotes, more than doubling the coverage of any similar resource. The orthologs are accessible through an intuitive web interface that allows searches by genome or gene identifier, presenting results as phylogenetic profiles together with gene and molecular function annotations. Results may be downloaded as phylogenetic matrices for subsequent analysis, including the construction of whole-genome phylogenies based on gene-content data. AVAILABILITY: http://rodeo.med.harvard.edu/tools/roundup.
Assuntos
Evolução Molecular , Genoma , Algoritmos , Biologia Computacional , Interpretação Estatística de Dados , Bases de Dados Genéticas , Genoma Bacteriano , Internet , Reconhecimento Automatizado de Padrão , Filogenia , Software , Especificidade da EspécieRESUMO
All protein coding genes have a phylogenetic history that when understood can lead to deep insights into the diversification or conservation of function, the evolution of developmental complexity, and the molecular basis of disease. One important part to reconstructing the relationships among genes in different organisms is an accurate method to find orthologs as well as an accurate measure of evolutionary diversification. The present chapter details such a method, called the reciprocal smallest distance algorithm (RSD). This approach improves upon the common procedure of taking reciprocal best Basic Local Alignment Search Tool hits (RBH) in the identification of orthologs by using global sequence alignment and maximum likelihood estimation of evolutionary distances to detect orthologs between two genomes. RSD finds many putative orthologs missed by RBH because it is less likely to be misled by the presence of close paralogs in genomes. The package offers a tremendous amount of flexibility in investigating parameter settings allowing the user to search for increasingly distant orthologs between highly divergent species, among other advantages. The flexibility of this tool makes it a unique and powerful addition to other available approaches for ortholog detection.
Assuntos
Algoritmos , Sequência de Aminoácidos , Evolução Molecular , Dados de Sequência Molecular , Filogenia , Alinhamento de SequênciaRESUMO
Although the impact of microRNAs (miRNAs) in development and disease is well established, understanding the function of individual miRNAs remains challenging. Development of competitive inhibitor molecules such as miRNA sponges has allowed the community to address individual miRNA function in vivo. However, the application of these loss-of-function strategies has been limited. Here we offer a comprehensive library of 141 conditional miRNA sponges targeting well-conserved miRNAs in Drosophila. Ubiquitous miRNA sponge delivery and consequent systemic miRNA inhibition uncovers a relatively small number of miRNA families underlying viability and gross morphogenesis, with false discovery rates in the 4-8% range. In contrast, tissue-specific silencing of muscle-enriched miRNAs reveals a surprisingly large number of novel miRNA contributions to the maintenance of adult indirect flight muscle structure and function. A strong correlation between miRNA abundance and physiological relevance is not observed, underscoring the importance of unbiased screens when assessing the contributions of miRNAs to complex biological processes.
Assuntos
Drosophila/genética , MicroRNAs/antagonistas & inibidores , Animais , Animais Geneticamente Modificados , Drosophila/metabolismo , Feminino , Biblioteca Gênica , Masculino , MicroRNAs/metabolismo , Músculos/metabolismoRESUMO
OBJECTIVE: To extract disorder-associated genes from the scientific literature in PubMed with greater sensitivity for literature-based support than existing methods. METHODS: We developed a PubMed query to retrieve disorder-related, original research articles. Then we applied a rule-based text-mining algorithm with keyword matching to extract target disorders, genes with significant results, and the type of study described by the article. RESULTS: We compared our resulting candidate disorder genes and supporting references with existing databases. We demonstrated that our candidate gene set covers nearly all genes in manually curated databases, and that the references supporting the disorder-gene link are more extensive and accurate than other general purpose gene-to-disorder association databases. CONCLUSIONS: We implemented a novel publication search tool to find target articles, specifically focused on links between disorders and genotypes. Through comparison against gold-standard manually updated gene-disorder databases and comparison with automated databases of similar functionality we show that our tool can search through the entirety of PubMed to extract the main gene findings for human diseases rapidly and accurately.
Assuntos
Mineração de Dados/métodos , Bases de Dados Genéticas , Genes , Doenças Genéticas Inatas/genética , Predisposição Genética para Doença/genética , PubMed , Algoritmos , Humanos , Medical Subject HeadingsRESUMO
Autism is on the rise, with 1 in 88 children receiving a diagnosis in the United States, yet the process for diagnosis remains cumbersome and time consuming. Research has shown that home videos of children can help increase the accuracy of diagnosis. However the use of videos in the diagnostic process is uncommon. In the present study, we assessed the feasibility of applying a gold-standard diagnostic instrument to brief and unstructured home videos and tested whether video analysis can enable more rapid detection of the core features of autism outside of clinical environments. We collected 100 public videos from YouTube of children ages 1-15 with either a self-reported diagnosis of an ASD (Nâ=â45) or not (Nâ=â55). Four non-clinical raters independently scored all videos using one of the most widely adopted tools for behavioral diagnosis of autism, the Autism Diagnostic Observation Schedule-Generic (ADOS). The classification accuracy was 96.8%, with 94.1% sensitivity and 100% specificity, the inter-rater correlation for the behavioral domains on the ADOS was 0.88, and the diagnoses matched a trained clinician in all but 3 of 22 randomly selected video cases. Despite the diversity of videos and non-clinical raters, our results indicate that it is possible to achieve high classification accuracy, sensitivity, and specificity as well as clinically acceptable inter-rater reliability with nonclinical personnel. Our results also demonstrate the potential for video-based detection of autism in short, unstructured home videos and further suggests that at least a percentage of the effort associated with detection and monitoring of autism may be mobilized and moved outside of traditional clinical environments.
Assuntos
Transtorno Autístico/diagnóstico , Mídias Sociais , Gravação em Vídeo , Adolescente , Criança , Pré-Escolar , Diagnóstico Precoce , Humanos , Lactente , Estados UnidosRESUMO
Cloud computing services have emerged as a cost-effective alternative for cluster systems as the number of genomes and required computation power to analyze them increased in recent years. Here we introduce the Microsoft Azure platform with detailed execution steps and a cost comparison with Amazon Web Services.
RESUMO
The Autism Diagnostic Interview-Revised (ADI-R) is one of the most commonly used instruments for assisting in the behavioral diagnosis of autism. The exam consists of 93 questions that must be answered by a care provider within a focused session that often spans 2.5 hours. We used machine learning techniques to study the complete sets of answers to the ADI-R available at the Autism Genetic Research Exchange (AGRE) for 891 individuals diagnosed with autism and 75 individuals who did not meet the criteria for an autism diagnosis. Our analysis showed that 7 of the 93 items contained in the ADI-R were sufficient to classify autism with 99.9% statistical accuracy. We further tested the accuracy of this 7-question classifier against complete sets of answers from two independent sources, a collection of 1654 individuals with autism from the Simons Foundation and a collection of 322 individuals with autism from the Boston Autism Consortium. In both cases, our classifier performed with nearly 100% statistical accuracy, properly categorizing all but one of the individuals from these two resources who previously had been diagnosed with autism through the standard ADI-R. Our ability to measure specificity was limited by the small numbers of non-spectrum cases in the research data used, however, both real and simulated data demonstrated a range in specificity from 99% to 93.8%. With incidence rates rising, the capacity to diagnose autism quickly and effectively requires careful design of behavioral assessment methods. Ours is an initial attempt to retrospectively analyze large data repositories to derive an accurate, but significantly abbreviated approach that may be used for rapid detection and clinical prioritization of individuals likely to have an autism spectrum disorder. Such a tool could assist in streamlining the clinical diagnostic process overall, leading to faster screening and earlier treatment of individuals with autism.
Assuntos
Inteligência Artificial , Transtorno Autístico/diagnóstico , Comportamento , Adolescente , Adulto , Estudos de Casos e Controles , Criança , Pré-Escolar , Árvores de Decisões , Humanos , Lactente , Recém-Nascido , Pessoa de Meia-Idade , Estudos Retrospectivos , Inquéritos e Questionários , Fatores de Tempo , Adulto JovemRESUMO
BACKGROUND: The genetic etiology of autism is heterogeneous. Multiple disorders share genotypic and phenotypic traits with autism. Network based cross-disorder analysis can aid in the understanding and characterization of the molecular pathology of autism, but there are few tools that enable us to conduct cross-disorder analysis and to visualize the results. DESCRIPTION: We have designed Autworks as a web portal to bring together gene interaction and gene-disease association data on autism to enable network construction, visualization, network comparisons with numerous other related neurological conditions and disorders. Users may examine the structure of gene interactions within a set of disorder-associated genes, compare networks of disorder/disease genes with those of other disorders/diseases, and upload their own sets for comparative analysis. CONCLUSIONS: Autworks is a web application that provides an easy-to-use resource for researchers of varied backgrounds to analyze the autism gene network structure within and between disorders. AVAILABILITY: http://autworks.hms.harvard.edu/
Assuntos
Transtorno Autístico/genética , Bases de Dados Genéticas , Redes Reguladoras de Genes/genética , Estudos de Associação Genética , Humanos , InternetRESUMO
BACKGROUND: Comparative genomics resources, such as ortholog detection tools and repositories are rapidly increasing in scale and complexity. Cloud computing is an emerging technological paradigm that enables researchers to dynamically build a dedicated virtual cluster and may represent a valuable alternative for large computational tools in bioinformatics. In the present manuscript, we optimize the computation of a large-scale comparative genomics resource-Roundup-using cloud computing, describe the proper operating principles required to achieve computational efficiency on the cloud, and detail important procedures for improving cost-effectiveness to ensure maximal computation at minimal costs. METHODS: Utilizing the comparative genomics tool, Roundup, as a case study, we computed orthologs among 902 fully sequenced genomes on Amazon's Elastic Compute Cloud. For managing the ortholog processes, we designed a strategy to deploy the web service, Elastic MapReduce, and maximize the use of the cloud while simultaneously minimizing costs. Specifically, we created a model to estimate cloud runtime based on the size and complexity of the genomes being compared that determines in advance the optimal order of the jobs to be submitted. RESULTS: We computed orthologous relationships for 245,323 genome-to-genome comparisons on Amazon's computing cloud, a computation that required just over 200 hours and cost $8,000 USD, at least 40% less than expected under a strategy in which genome comparisons were submitted to the cloud randomly with respect to runtime. Our cost savings projections were based on a model that not only demonstrates the optimal strategy for deploying RSD to the cloud, but also finds the optimal cluster size to minimize waste and maximize usage. Our cost-reduction model is readily adaptable for other comparative genomics tools and potentially of significant benefit to labs seeking to take advantage of the cloud as an alternative to local computing infrastructure.
RESUMO
BACKGROUND: Disease-specific genetic information has been increasing at rapid rates as a consequence of recent improvements and massive cost reductions in sequencing technologies. Numerous systems designed to capture and organize this mounting sea of genetic data have emerged, but these resources differ dramatically in their disease coverage and genetic depth. With few exceptions, researchers must manually search a variety of sites to assemble a complete set of genetic evidence for a particular disease of interest, a process that is both time-consuming and error-prone. METHODS: We designed a real-time aggregation tool that provides both comprehensive coverage and reliable gene-to-disease rankings for any disease. Our tool, called Genotator, automatically integrates data from 11 externally accessible clinical genetics resources and uses these data in a straightforward formula to rank genes in order of disease relevance. We tested the accuracy of coverage of Genotator in three separate diseases for which there exist specialty curated databases, Autism Spectrum Disorder, Parkinson's Disease, and Alzheimer Disease. Genotator is freely available at http://genotator.hms.harvard.edu. RESULTS: Genotator demonstrated that most of the 11 selected databases contain unique information about the genetic composition of disease, with 2514 genes found in only one of the 11 databases. These findings confirm that the integration of these databases provides a more complete picture than would be possible from any one database alone. Genotator successfully identified at least 75% of the top ranked genes for all three of our use cases, including a 90% concordance with the top 40 ranked candidates for Alzheimer Disease. CONCLUSIONS: As a meta-query engine, Genotator provides high coverage of both historical genetic research as well as recent advances in the genetic understanding of specific diseases. As such, Genotator provides a real-time aggregation of ranked data that remains current with the pace of research in the disease fields. Genotator's algorithm appropriately transforms query terms to match the input requirements of each targeted databases and accurately resolves named synonyms to ensure full coverage of the genetic results with official nomenclature. Genotator generates an excel-style output that is consistent across disease queries and readily importable to other applications.