RESUMEN
BACKGROUND: The availability of newly sequenced vertebrate genomes, along with more efficient and accurate alignment algorithms, have enabled the expansion of the field of comparative genomics. Large-scale genome rearrangement events modify the order of genes and non-coding conserved regions on chromosomes. While certain large genomic regions have remained intact over much of vertebrate evolution, others appear to be hotspots for genomic breakpoints. The cause of the non-uniformity of breakpoints that occurred during vertebrate evolution is poorly understood. RESULTS: We describe a machine learning method to distinguish genomic regions where breakpoints would be expected to have deleterious effects (called breakpoint-refractory regions) from those where they are expected to be neutral (called breakpoint-susceptible regions). Our predictor is trained using breakpoints that took place along the human lineage since amniote divergence. Based on our predictions, refractory and susceptible regions have very distinctive features. Refractory regions are significantly enriched for conserved non-coding elements as well as for genes involved in development, whereas susceptible regions are enriched for housekeeping genes, likely to have simpler transcriptional regulation. CONCLUSION: We postulate that long-range transcriptional regulation strongly influences chromosome break fixation. In many regions, the fitness cost of altering the spatial association between long-range regulatory regions and their target genes may be so high that rearrangements are not allowed. Consequently, only a limited, identifiable fraction of the genome is susceptible to genome rearrangements.
Asunto(s)
Evolución Molecular , Regulación de la Expresión Génica , Inestabilidad Genómica , Genómica/métodos , Animales , Inteligencia Artificial , Pollos/genética , Rotura Cromosómica , Mapeo Cromosómico/métodos , Hibridación Genómica Comparativa , Humanos , Zarigüeyas/genética , SinteníaRESUMEN
The BASC system provides tools for the integrated mining and browsing of genetic, genomic and phenotypic data. This public resource hosts information on Brassica species supporting the Multinational Brassica Genome Sequencing Project, and is based upon five distinct modules, ESTDB, Microarray, MarkerQTL, CMap and EnsEMBL. ESTDB hosts expressed gene sequences and related annotation derived from comparison with GenBank, UniRef and the genome sequence of Arabidopsis. The Microarray module hosts gene expression information related to genes annotated within ESTDB. MarkerQTL is the most complex module and integrates information on genetic markers, maps, individuals, genotypes and traits. Two further modules include an Arabidopsis EnsEMBL genome viewer and the CMap comparative genetic map viewer for the visualization and integration of genetic and genomic data. The database is accessible at http://bioinformatics.pbcbasc.latrobe.edu.au.
Asunto(s)
Brassica/genética , Bases de Datos Genéticas , Arabidopsis/genética , Mapeo Cromosómico , Biología Computacional , Etiquetas de Secuencia Expresada/química , Perfilación de la Expresión Génica , Marcadores Genéticos , Genoma de Planta , Genómica , Internet , Fenotipo , Sitios de Carácter Cuantitativo , Programas Informáticos , Integración de Sistemas , Interfaz Usuario-ComputadorRESUMEN
Transmission of malaria is dependent on the successful completion of the Plasmodium lifecycle in the Anopheles vector. Major obstacles are encountered in the midgut tissue, where most parasites are killed by the mosquito's immune system. In the present study, DNA microarray analyses have been used to compare Anopheles gambiae responses to invasion of the midgut epithelium by the ookinete stage of the human pathogen Plasmodium falciparum and the rodent experimental model pathogen P. berghei. Invasion by P. berghei had a more profound impact on the mosquito transcriptome, including a variety of functional gene classes, while P. falciparum elicited a broader immune response at the gene transcript level. Ingestion of human malaria-infected blood lacking invasive ookinetes also induced a variety of immune genes, including several anti-Plasmodium factors. Twelve selected genes were assessed for effect on infection with both parasite species and bacteria using RNAi gene silencing assays, and seven of these genes were found to influence mosquito resistance to both parasite species. An MD2-like receptor, AgMDL1, and an immunolectin, FBN39, showed specificity in regulating only resistance to P. falciparum, while the antimicrobial peptide gambicin and a novel putative short secreted peptide, IRSP5, were more specific for defense against the rodent parasite P. berghei. While all the genes that affected Plasmodium development also influenced mosquito resistance to bacterial infection, four of the antimicrobial genes had no effect on Plasmodium development. Our study shows that the impact of P. falciparum and P. berghei infection on A. gambiae biology at the gene transcript level is quite diverse, and the defense against the two Plasmodium species is mediated by antimicrobial factors with both universal and Plasmodium-species specific activities. Furthermore, our data indicate that the mosquito is capable of sensing infected blood constituents in the absence of invading ookinetes, thereby inducing anti-Plasmodium immune responses.
Asunto(s)
Anopheles/genética , Anopheles/inmunología , Anticuerpos Antiprotozoarios/biosíntesis , Plasmodium berghei/inmunología , Plasmodium falciparum/inmunología , Animales , Antígenos Bacterianos/biosíntesis , Susceptibilidad a Enfermedades , Humanos , Mucosa Intestinal/parasitología , Malaria/sangre , Malaria Falciparum/sangre , Análisis de Secuencia por Matrices de Oligonucleótidos , Roedores , Transcripción Genética , Cigoto/fisiologíaRESUMEN
SNPServer is a real-time flexible tool for the discovery of SNPs (single nucleotide polymorphisms) within DNA sequence data. The program uses BLAST, to identify related sequences, and CAP3, to cluster and align these sequences. The alignments are parsed to the SNP discovery software autoSNP, a program that detects SNPs and insertion/deletion polymorphisms (indels). Alternatively, lists of related sequences or pre-assembled sequences may be entered for SNP discovery. SNPServer and autoSNP use redundancy to differentiate between candidate SNPs and sequence errors. For each candidate SNP, two measures of confidence are calculated, the redundancy of the polymorphism at a SNP locus and the co-segregation of the candidate SNP with other SNPs in the alignment. SNPServer is available at http://hornbill.cspp.latrobe.edu.au/snpdiscovery.html.
Asunto(s)
Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Internet , Alineación de Secuencia , Factores de Tiempo , Interfaz Usuario-ComputadorRESUMEN
As more of the human genome draft sequence is finished, and genomes from other organisms begin to be sequenced, the demand for accurate and reliable genome annotation will increase significantly. To facilitate this industrial-scale genome annotation, automated bioinformatics solutions are increasingly required. As a result, automatic genome annotation systems have become more important in gene discovery within recent years. The design of such large-scale bioinformatics systems is an evolving and dynamic field, based on central cores of bioinformatics software tools and relational databases. Not only must these systems efficiently manage and integrate large volumes of genomic data, but they must also deliver accurate gene predictions and effectively distribute annotation data to the biosciences community.
Asunto(s)
Biología Computacional/métodos , Genoma Humano , Anotación de Secuencia Molecular/métodos , Bases de Datos Genéticas , HumanosRESUMEN
As a result of an international collaborative effort, the first draft of the Anopheles gambiae genome sequence and its preliminary annotation were published in October 2002. Since then, the assembly, annotation and means of accession of the An. gambiae genome have been under continuous development. This article reviews progress and considers limitations in the current sequence assembly and gene annotation, as well as approaches to address these problems and outstanding issues that users of the data must bear in mind.
Asunto(s)
Anopheles/genética , Genoma , Animales , Mapeo Cromosómico/veterinaria , Cromosomas Artificiales Bacterianos , Biología Computacional , Variación Genética , Hibridación in Situ/veterinariaRESUMEN
In chordates, long-range cis-regulatory regions are involved in the control of transcription initiation (either as repressors or enhancers). Their main characteristics are that (i) they can be located as far as 1 Mb away from the transcription start site of the target gene, (ii) they can regulate more than one gene, and (iii) they are usually orientation-independent. Therefore, proper characterization of functional interactions between long-range cis-regulatory regions and their target genes remains problematic. We present a novel method to predict such interactions based on the analysis of rearrangements between the human and 16 other vertebrate genomes. Our method is based on the assumption that genome rearrangements that would disrupt the functional interaction between a cis-regulatory region and its target gene are likely to be deleterious. Therefore, conservation of synteny through evolution would be an indication of a functional interaction. We use our algorithm to predict the association between a set of 123,905 human candidate regulatory regions to their target gene(s). This genome-wide map of interactions has many potential applications, including the selection of candidate regions prior to in vivo experimental characterization, a better characterization of regulatory regions involved in position effect diseases, and an improved understanding of the mechanisms and importance of long-range regulation.
Asunto(s)
Mapeo Cromosómico/métodos , Elementos Reguladores de la Transcripción , Sintenía , Animales , Cordados/genética , Simulación por Computador , Secuencia Conservada , Epistasis Genética , Regulación de la Expresión Génica , Reordenamiento Génico , Genoma , Histonas/genética , Humanos , Funciones de Verosimilitud , Modelos Genéticos , Mutación , FilogeniaRESUMEN
The developing vertebrate nervous system contains a remarkable array of neural cells organized into complex, evolutionarily conserved structures. The labeling of living cells in these structures is key for the understanding of brain development and function, yet the generation of stable lines expressing reporter genes in specific spatio-temporal patterns remains a limiting step. In this study we present a fast and reliable pipeline to efficiently generate a set of stable lines expressing a reporter gene in multiple neuronal structures in the developing nervous system in medaka. The pipeline combines both the accurate computational genome-wide prediction of neuronal specific cis-regulatory modules (CRMs) and a newly developed experimental setup to rapidly obtain transgenic lines in a cost-effective and highly reproducible manner. 95% of the CRMs tested in our experimental setup show enhancer activity in various and numerous neuronal structures belonging to all major brain subdivisions. This pipeline represents a significant step towards the dissection of embryonic neuronal development in vertebrates.
Asunto(s)
Biología Computacional/métodos , Elementos de Facilitación Genéticos/genética , Genes Reporteros , Neuronas/metabolismo , Oryzias/genética , Animales , Animales Modificados Genéticamente , Regulación del Desarrollo de la Expresión Génica , Genoma/genéticaRESUMEN
As more genomes are sequenced, there is an increasing need for automated first-pass annotation which allows timely access to important genomic information. The Ensembl gene-building system enables fast automated annotation of eukaryotic genomes. It annotates genes based on evidence derived from known protein, cDNA, and EST sequences. The gene-building system rests on top of the core Ensembl (MySQL) database schema and Perl Application Programming Interface (API), and the data generated are accessible through the Ensembl genome browser (http://www.ensembl.org). To date, the Ensembl predicted gene sets are available for the A. gambiae, C. briggsae, zebrafish, mouse, rat, and human genomes and have been heavily relied upon in the publication of the human, mouse, rat, and A. gambiae genome sequence analysis. Here we describe in detail the gene-building system and the algorithms involved. All code and data are freely available from http://www.ensembl.org.
Asunto(s)
Automatización , Biología Computacional/métodos , Genes/fisiología , Animales , Anopheles/genética , Caenorhabditis/genética , ADN/genética , ADN de Helmintos/genética , Etiquetas de Secuencia Expresada , Dosificación de Gen , Genes de Helminto/fisiología , Genes de Insecto/fisiología , Genoma , Genoma Humano , Proteínas del Helminto/genética , Humanos , Proteínas de Insectos/genética , Ratones , Valor Predictivo de las Pruebas , Proteínas/genética , Seudogenes/genética , Ratas , Alineación de Secuencia/métodos , Homología de Secuencia de Aminoácido , Programas Informáticos , Secuencias Repetidas en Tándem/genética , Regiones no Traducidas/genéticaRESUMEN
The Ensembl pipeline is an extension to the Ensembl system which allows automated annotation of genomic sequence. The software comprises two parts. First, there is a set of Perl modules ("Runnables" and "RunnableDBs") which are 'wrappers' for a variety of commonly used analysis tools. These retrieve sequence data from a relational database, run the analysis, and write the results back to the database. They inherit from a common interface, which simplifies the writing of new wrapper modules. On top of this sits a job submission system (the "RuleManager") which allows efficient and reliable submission of large numbers of jobs to a compute farm. Here we describe the fundamental software components of the pipeline, and we also highlight some features of the Sanger installation which were necessary to enable the pipeline to scale to whole-genome analysis.
Asunto(s)
Biología Computacional/métodos , Secuencia de Bases/genética , ADN/genética , Bases de Datos Genéticas/normas , Lenguajes de Programación , Proteínas/clasificación , Programas Informáticos , Diseño de SoftwareRESUMEN
Ensembl (http://www.ensembl.org/) is a bioinformatics project to organize biological information around the sequences of large genomes. It is a comprehensive source of stable automatic annotation of individual genomes, and of the synteny and orthology relationships between them. It is also a framework for integration of any biological data that can be mapped onto features derived from the genomic sequence. Ensembl is available as an interactive Web site, a set of flat files, and as a complete, portable open source software system for handling genomes. All data are provided without restriction, and code is freely available. Ensembl's aims are to continue to "widen" this biological integration to include other model organisms relevant to understanding human biology as they become available; to "deepen" this integration to provide an ever more seamless linkage between equivalent components in different species; and to provide further classification of functional elements in the genome that have been previously elusive.