RESUMEN
Here we analyse genetic variation, population structure and diversity among 3,010 diverse Asian cultivated rice (Oryza sativa L.) genomes from the 3,000 Rice Genomes Project. Our results are consistent with the five major groups previously recognized, but also suggest several unreported subpopulations that correlate with geographic location. We identified 29 million single nucleotide polymorphisms, 2.4 million small indels and over 90,000 structural variations that contribute to within- and between-population variation. Using pan-genome analyses, we identified more than 10,000 novel full-length protein-coding genes and a high number of presence-absence variations. The complex patterns of introgression observed in domestication genes are consistent with multiple independent rice domestication events. The public availability of data from the 3,000 Rice Genomes Project provides a resource for rice genomics research and breeding.
Asunto(s)
Productos Agrícolas/clasificación , Productos Agrícolas/genética , Variación Genética , Genoma de Planta/genética , Oryza/clasificación , Oryza/genética , Asia , Evolución Molecular , Genes de Plantas/genética , Genética de Población , Genómica , Haplotipos , Mutación INDEL/genética , Filogenia , Fitomejoramiento , Polimorfismo de Nucleótido Simple/genéticaRESUMEN
Mutations in the GJB2 gene are known to be a major cause of autosomal recessive deafness 1A (OMIM 220290). The most common pathogenic variants of the GJB2 gene have a high ethno-geographic specificity in their distribution, being attributed to a founder effect related to the Neolithic migration routes of Homo sapiens. The c.-23 + 1G > A splice site variant is frequently found among deaf patients of both Caucasian and Asian origins. It is currently unknown whether the spread of this mutation across Eurasia is a result of the founder effect or if it could have multiple local centers of origin. To determine the origin of c.-23 + 1G > A, we reconstructed haplotypes by genotyping SNPs on an Illumina OmniExpress 730 K platform of 23 deaf individuals homozygous for this variant from different populations of Eurasia. The analyses revealed the presence of common regions of homozygosity in different individual genomes in the sample. These data support the hypothesis of the common founder effect in the distribution of the c.-23 + 1G > A variant of the GJB2 gene. Based on the published data on the c.-23 + 1G > A prevalence among 16,177 deaf people and the calculation of the TMRCA of the modified f2-haplotypes carrying this variant, we reconstructed the potential migration routes of the carriers of this mutation around the world. This analysis indicates that the c.-23 + 1G > A variant in the GJB2 gene may have originated approximately 6000 years ago in the territory of the Caucasus or the Middle East then spread throughout Europe, South and Central Asia and other regions of the world.
Asunto(s)
Sordera , Efecto Fundador , Conexina 26/genética , Conexinas/genética , Sordera/epidemiología , Sordera/genética , Pérdida Auditiva Sensorineural , Humanos , MutaciónRESUMEN
Multiple sequence alignments (MSAs) are a prerequisite for a wide variety of evolutionary analyses. Published assessments and benchmark data sets for protein and, to a lesser extent, global nucleotide MSAs are available, but less effort has been made to establish benchmarks in the more general problem of whole-genome alignment (WGA). Using the same model as the successful Assemblathon competitions, we organized a competitive evaluation in which teams submitted their alignments and then assessments were performed collectively after all the submissions were received. Three data sets were used: Two were simulated and based on primate and mammalian phylogenies, and one was comprised of 20 real fly genomes. In total, 35 submissions were assessed, submitted by 10 teams using 12 different alignment pipelines. We found agreement between independent simulation-based and statistical assessments, indicating that there are substantial accuracy differences between contemporary alignment tools. We saw considerable differences in the alignment quality of differently annotated regions and found that few tools aligned the duplications analyzed. We found that many tools worked well at shorter evolutionary distances, but fewer performed competitively at longer distances. We provide all data sets, submissions, and assessment programs for further study and provide, as a resource for future benchmarking, a convenient repository of code and data for reproducing the simulation assessments.
Asunto(s)
Genoma , Genómica/métodos , Alineación de Secuencia/métodos , Programas Informáticos , Animales , Biología Computacional/métodos , Simulación por Computador , Conjuntos de Datos como Asunto , Estudio de Asociación del Genoma Completo , Humanos , Mamíferos/genética , Filogenia , Reproducibilidad de los ResultadosRESUMEN
The U.S. Department of Energy (DOE) Joint Genome Institute (JGI), a national user facility, serves the diverse scientific community by providing integrated high-throughput sequencing and computational analysis to enable system-based scientific approaches in support of DOE missions related to clean energy generation and environmental characterization. The JGI Genome Portal (http://genome.jgi.doe.gov) provides unified access to all JGI genomic databases and analytical tools. The JGI maintains extensive data management systems and specialized analytical capabilities to manage and interpret complex genomic data. A user can search, download and explore multiple data sets available for all DOE JGI sequencing projects including their status, assemblies and annotations of sequenced genomes. Here we describe major updates of the Genome Portal in the past 2 years with a specific emphasis on efficient handling of the rapidly growing amount of diverse genomic data accumulated in JGI.
Asunto(s)
Bases de Datos de Ácidos Nucleicos , Genómica , Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Internet , Análisis de Secuencia de ADN , Integración de SistemasRESUMEN
BACKGROUND: To examine the contributions of sequence and function conservation in the evolution of enhancers, we systematically identified enhancers whose sequences are not conserved among distant groups of vertebrate species, but have homologous function and are likely to be derived from a common ancestral sequence. Our approach combined comparative genomics and epigenomics to identify potential enhancer sequences in the genomes of three groups of distantly related vertebrate species. RESULTS: We searched for sequences that were conserved within groups of closely related species but not between groups of more distant species, and were associated with an epigenetic mark of enhancer activity. To facilitate inferring orthology between non-conserved sequences, we limited our search to introns whose orthology could be unambiguously established by mapping the bracketing exons. We show that a subset of these non-conserved but syntenic sequences from the mouse and zebrafish genomes have homologous functions in a zebrafish transgenic enhancer assay. The conserved expression patterns driven by these enhancers are probably associated with short transcription factor-binding motifs present in the divergent sequences. CONCLUSIONS: We have identified numerous potential enhancers with divergent sequences but a conserved function. These results indicate that selection on function, rather than sequence, may be a common mode of enhancer evolution; evidence for selection at the sequence level is not a necessary criterion to define a gene regulatory element.
Asunto(s)
Secuencia Conservada , Elementos de Facilitación Genéticos , Variación Genética , Vertebrados/genética , Animales , Animales Modificados Genéticamente , Sitios de Unión , Biología Computacional/métodos , Evolución Molecular , Perfilación de la Expresión Génica , Estudio de Asociación del Genoma Completo , Motivos de Nucleótidos , Posición Específica de Matrices de Puntuación , Unión Proteica , Reproducibilidad de los Resultados , Factores de Transcripción/metabolismoRESUMEN
Sorghum, an African grass related to sugar cane and maize, is grown for food, feed, fibre and fuel. We present an initial analysis of the approximately 730-megabase Sorghum bicolor (L.) Moench genome, placing approximately 98% of genes in their chromosomal context using whole-genome shotgun sequence validated by genetic, physical and syntenic information. Genetic recombination is largely confined to about one-third of the sorghum genome with gene order and density similar to those of rice. Retrotransposon accumulation in recombinationally recalcitrant heterochromatin explains the approximately 75% larger genome size of sorghum compared with rice. Although gene and repetitive DNA distributions have been preserved since palaeopolyploidization approximately 70 million years ago, most duplicated gene sets lost one member before the sorghum-rice divergence. Concerted evolution makes one duplicated chromosomal segment appear to be only a few million years old. About 24% of genes are grass-specific and 7% are sorghum-specific. Recent gene and microRNA duplications may contribute to sorghum's drought tolerance.
Asunto(s)
Evolución Molecular , Genoma de Planta/genética , Poaceae/genética , Sorghum/genética , Arabidopsis/genética , Cromosomas de las Plantas/genética , Duplicación de Gen , Genes de Plantas , Oryza/genética , Populus/genética , Recombinación Genética/genética , Alineación de Secuencia , Análisis de Secuencia de ADN , Eliminación de Secuencia/genética , Zea mays/genéticaRESUMEN
The VISTA portal for comparative genomics is designed to give biomedical scientists a unified set of tools to lead them from the raw DNA sequences through the alignment and annotation to the visualization of the results. The VISTA portal also hosts the alignments of a number of genomes computed by our group, allowing users to study the regions of their interest without having to manually download the individual sequences. Here we describe various algorithmic and functional improvements implemented in the VISTA portal over the last 2 years. The VISTA Portal is accessible at http://genome.lbl.gov/vista.
Asunto(s)
Biología Computacional/métodos , Genómica , Internet , Alineación de Secuencia , Programas Informáticos , Animales , Secuencia de Bases , Pollos , Perros , Genoma Humano , Humanos , Datos de Secuencia Molecular , Polimorfismo de Nucleótido Simple , Análisis de Secuencia de ADN , Homología de Secuencia de Ácido NucleicoRESUMEN
Correlation of motif occurrences with gene expression intensity is an effective strategy for elucidating transcriptional cis-regulatory logic. Here we demonstrate that this approach can also identify cis-regulatory elements for alternative pre-mRNA splicing. Using data from a human exon microarray, we identified 56 cassette exons that exhibited higher transcript-normalized expression in muscle than in other normal adult tissues. Intron sequences flanking these exons were then analyzed to identify candidate regulatory motifs for muscle-specific alternative splicing. Correlation of motif parameters with gene-normalized exon expression levels was examined using linear regression and linear splines on RNA words and degenerate weight matrices, respectively. Our unbiased analysis uncovered multiple candidate regulatory motifs for muscle-specific splicing, many of which are phylogenetically conserved among vertebrate genomes. The most prominent downstream motifs were binding sites for Fox1- and CELF-related splicing factors, and a branchpoint-like element acuaac; pyrimidine-rich elements resembling PTB-binding sites were most significant in upstream introns. Intriguingly, our systematic study indicates a paucity of novel muscle-specific elements that are dominant in short proximal intronic regions. We propose that Fox and CELF proteins play major roles in enforcing the muscle-specific alternative splicing program, facilitating expression of unique isoforms of cytoskeletal proteins critical to muscle cell function.
Asunto(s)
Empalme Alternativo , Biología Computacional/métodos , Intrones , Secuencias Reguladoras de Ácido Ribonucleico , Análisis de Secuencia de ARN/métodos , Animales , Secuencia de Bases , Sitios de Unión , Secuencia Conservada , Proteínas del Citoesqueleto/genética , Proteínas del Citoesqueleto/metabolismo , Exones , Perfilación de la Expresión Génica , Humanos , Músculo Esquelético/metabolismo , Miocardio/metabolismo , Precursores del ARN/química , ARN Mensajero/química , ARN Mensajero/metabolismo , Proteínas de Unión al ARN/metabolismo , Transcripción GenéticaRESUMEN
Comparison of DNA sequences from different species is a fundamental method for identifying functional elements in genomes. Here, we describe the VISTA family of tools created to assist biologists in carrying out this task. Our first VISTA server at http://www-gsd.lbl.gov/vista/ was launched in the summer of 2000 and was designed to align long genomic sequences and visualize these alignments with associated functional annotations. Currently the VISTA site includes multiple comparative genomics tools and provides users with rich capabilities to browse pre-computed whole-genome alignments of large vertebrate genomes and other groups of organisms with VISTA Browser, to submit their own sequences of interest to several VISTA servers for various types of comparative analysis and to obtain detailed comparative analysis results for a set of cardiovascular genes. We illustrate capabilities of the VISTA site by the analysis of a 180 kb interval on human chromosome 5 that encodes for the kinesin family member 3A (KIF3A) protein.
Asunto(s)
Biología Computacional , Genómica , Programas Informáticos , Sitios de Unión , Cromosomas Humanos Par 5 , Genoma Humano , Humanos , Internet , Cinesinas/genética , Filogenia , Secuencias Reguladoras de Ácidos Nucleicos , Alineación de Secuencia , Análisis de Secuencia de ADN , Factores de Transcripción/metabolismoRESUMEN
An essential step in the discovery of molecular mechanisms contributing to disease phenotypes and efficient experimental planning is the development of weighted hypotheses that estimate the functional effects of sequence variants discovered by high-throughput genomics. With the increasing specialization of the bioinformatics resources, creating analytical workflows that seamlessly integrate data and bioinformatics tools developed by multiple groups becomes inevitable. Here we present a case study of a use of the distributed analytical environment integrating four complementary specialized resources, namely the Lynx platform, VISTA RViewer, the Developmental Brain Disorders Database (DBDB), and the RaptorX server, for the identification of high-confidence candidate genes contributing to pathogenesis of spina bifida. The analysis resulted in prediction and validation of deleterious mutations in the SLC19A placental transporter in mothers of the affected children that causes narrowing of the outlet channel and therefore leads to the reduced folate permeation rate. The described approach also enabled correct identification of several genes, previously shown to contribute to pathogenesis of spina bifida, and suggestion of additional genes for experimental validations. The study demonstrates that the seamless integration of bioinformatics resources enables fast and efficient prioritization and characterization of genomic factors and molecular networks contributing to the phenotypes of interest.
Asunto(s)
Mutación , Proteína Portadora de Folato Reducido/genética , Disrafia Espinal/genética , Niño , Femenino , Ácido Fólico/metabolismo , Genómica/métodos , Humanos , Modelos Moleculares , Embarazo , Conformación Proteica , Proteína Portadora de Folato Reducido/química , Proteína Portadora de Folato Reducido/metabolismo , Programas Informáticos , Disrafia Espinal/metabolismoRESUMEN
It was proposed that if some mRNA characteristics resulted in a low efficiency of termination signal, an additional closely located stop codon (tandem stop codons) could be used to prevent the harmful readthrough. However, the role of tandem terminators in higher eukaryotes was not verified and remains hypothetical. In this work the sequence features of Arabidopsis thaliana and Oryza sativa mRNAs were analyzed. It was found that plant mRNAs with UGA terminator were characterized by a higher frequency of nonsense codons in the first triplet position of 3'-UTR that could result from a weak natural selection for "reserve" stop signal. Interestingly, the presence of tandem stop codons positively correlated with a specific amino acid composition in the C-terminal position of the encoded proteins. In particular, C-terminal glycine positively correlated with significantly higher frequencies of reserve terminators at the beginning positions of 3'-UTR in UGA-containing mRNAs. This finding coincides with some earlier observations concerning the role of glycine and its codons in inefficient termination of translation and recoding (e.g., 2A oligopeptide).
Asunto(s)
Arabidopsis/genética , Codón de Terminación , Oryza/genética , ARN Mensajero/genética , ARN de Planta , Regiones no Traducidas 3' , Glicina/genéticaRESUMEN
The western clawed frog Xenopus tropicalis is an important model for vertebrate development that combines experimental advantages of the African clawed frog Xenopus laevis with more tractable genetics. Here we present a draft genome sequence assembly of X. tropicalis. This genome encodes more than 20,000 protein-coding genes, including orthologs of at least 1700 human disease genes. Over 1 million expressed sequence tags validated the annotation. More than one-third of the genome consists of transposable elements, with unusually prevalent DNA transposons. Like that of other tetrapods, the genome of X. tropicalis contains gene deserts enriched for conserved noncoding elements. The genome exhibits substantial shared synteny with human and chicken over major parts of large chromosomes, broken by lineage-specific chromosome fusions and fissions, mainly in the mammalian lineage.
Asunto(s)
Genoma , Análisis de Secuencia de ADN , Xenopus/genética , Animales , Pollos/genética , Mapeo Cromosómico , Cromosomas/genética , Biología Computacional , Secuencia Conservada , Elementos Transponibles de ADN , ADN Complementario , Embrión no Mamífero/metabolismo , Evolución Molecular , Etiquetas de Secuencia Expresada , Duplicación de Gen , Genes , Humanos , Filogenia , Sintenía , Vertebrados/genética , Xenopus/embriología , Proteínas de Xenopus/genéticaRESUMEN
Multiple sequence alignments have become one of the most commonly used resources in genomics research. Most algorithms for multiple alignment of whole genomes rely either on a reference genome, against which all of the other sequences are laid out, or require a one-to-one mapping between the nucleotides of the genomes, preventing the alignment of recently duplicated regions. Both approaches have drawbacks for whole-genome comparisons. In this paper we present a novel symmetric alignment algorithm. The resulting alignments not only represent all of the genomes equally well, but also include all relevant duplications that occurred since the divergence from the last common ancestor. Our algorithm, implemented as a part of the VISTA Genome Pipeline (VGP), was used to align seven vertebrate and six Drosophila genomes. The resulting whole-genome alignments demonstrate a higher sensitivity and specificity than the pairwise alignments previously available through the VGP and have higher exon alignment accuracy than comparable public whole-genome alignments. Of the multiple alignment methods tested, ours performed the best at aligning genes from multigene families-perhaps the most challenging test for whole-genome alignments. Our whole-genome multiple alignments are available through the VISTA Browser at http://genome.lbl.gov/vista/index.shtml.
Asunto(s)
Biología Computacional , Genoma , Alineación de Secuencia/métodos , Alineación de Secuencia/estadística & datos numéricos , Análisis de Secuencia de ADN/estadística & datos numéricos , Algoritmos , Bases de Datos de Ácidos Nucleicos , Exones/genética , Familia de Multigenes , Programas InformáticosRESUMEN
BACKGROUND: Independently evolving lineages mostly accumulate different changes, which leads to their gradual divergence. However, parallel accumulation of identical changes is also common, especially in traits with only a small number of possible states. RESULTS: We characterize parallelism in evolution of coding sequences in three four-species sets of genomes of mammals, Drosophila, and yeasts. Each such set contains two independent evolutionary paths, which we call paths I and II. An amino acid replacement which occurred along path I also occurs along path II with the probability 50-80% of that expected under selective neutrality. Thus, the per site rate of parallel evolution of proteins is several times higher than their average rate of evolution, but still lower than the rate of evolution of neutral sequences. This deficit may be caused by changes in the fitness landscape, leading to a replacement being possible along path I but not along path II. However, constant, weak selection assumed by the nearly neutral model of evolution appears to be a more likely explanation. Then, the average coefficient of selection associated with an amino acid replacement, in the units of the effective population size, must exceed approximately 0.4, and the fraction of effectively neutral replacements must be below approximately 30%. At a majority of evolvable amino acid sites, only a relatively small number of different amino acids is permitted. CONCLUSION: High, but below-neutral, rates of parallel amino acid replacements suggest that a majority of amino acid replacements that occur in evolution are subject to weak, but non-trivial, selection, as predicted by Ohta's nearly-neutral theory.
Asunto(s)
Aminoácidos/genética , Evolución Molecular , Proteínas/química , Alelos , Secuencia de Aminoácidos , Sustitución de Aminoácidos , Aminoácidos/química , Animales , Perros , Drosophila/genética , Humanos , Ratones , Filogenia , Proteínas/genética , Ratas , Saccharomyces cerevisiae/genética , Selección Genética , Alineación de SecuenciaRESUMEN
The availability of the assembled mouse genome makes possible, for the first time, an alignment and comparison of two large vertebrate genomes. We investigated different strategies of alignment for the subsequent analysis of conservation of genomes that are effective for assemblies of different quality. These strategies were applied to the comparison of the working draft of the human genome with the Mouse Genome Sequencing Consortium assembly, as well as other intermediate mouse assemblies. Our methods are fast and the resulting alignments exhibit a high degree of sensitivity, covering more than 90% of known coding exons in the human genome. We obtained such coverage while preserving specificity. With a view towards the end user, we developed a suite of tools and Web sites for automatically aligning and subsequently browsing and working with whole-genome comparisons. We describe the use of these tools to identify conserved non-coding regions between the human and mouse genomes, some of which have not been identified by other methods.
Asunto(s)
Genoma Humano , Genoma , Proyectos de Investigación , Alineación de Secuencia/instrumentación , Alineación de Secuencia/métodos , Algoritmos , Animales , Cromosomas/genética , Cromosomas Humanos/genética , Redes de Comunicación de Computadores/instrumentación , Bases de Datos Genéticas , Humanos , Internet/instrumentación , Ratones , Programas InformáticosRESUMEN
MOTIVATION: To compare entire genomes from different species, biologists increasingly need alignment methods that are efficient enough to handle long sequences, and accurate enough to correctly align the conserved biological features between distant species. The two main classes of pairwise alignments are global alignment, where one string is transformed into the other, and local alignment, where all locations of similarity between the two strings are returned. Global alignments are less prone to demonstrating false homology as each letter of one sequence is constrained to being aligned to only one letter of the other. Local alignments, on the other hand, can cope with rearrangements between non-syntenic, orthologous sequences by identifying similar regions in sequences; this, however, comes at the expense of a higher false positive rate due to the inability of local aligners to take into account overall conservation maps. RESULTS: In this paper we introduce the notion of glocal alignment, a combination of global and local methods, where one creates a map that transforms one sequence into the other while allowing for rearrangement events. We present Shuffle-LAGAN, a glocal alignment algorithm that is based on the CHAOS local alignment algorithm and the LAGAN global aligner, and is able to align long genomic sequences. To test Shuffle-LAGAN we split the mouse genome into BAC-sized pieces, and aligned these pieces to the human genome. We demonstrate that Shuffle-LAGAN compares favorably in terms of sensitivity and specificity with standard local and global aligners. From the alignments we conclude that about 9% of human/mouse homology may be attributed to small rearrangements, 63% of which are duplications.
Asunto(s)
Mapeo Cromosómico/métodos , ADN/análisis , ADN/química , Perfilación de la Expresión Génica/métodos , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Algoritmos , Animales , Secuencia de Bases , ADN/genética , Genoma Humano , Humanos , Ratones , Datos de Secuencia Molecular , Homología de Secuencia de Ácido NucleicoRESUMEN
We have built a whole-genome multiple alignment of the three currently available mammalian genomes using a fully automated pipeline that combines the local/global approach of the Berkeley Genome Pipeline and the LAGAN program. The strategy is based on progressive alignment and consists of two main steps: (1) alignment of the mouse and rat genomes, and (2) alignment of human to either the mouse-rat alignments from step 1, or the remaining unaligned mouse and rat sequences. The resulting alignments demonstrate high sensitivity, with 87% of all human gene-coding areas aligned in both mouse and rat. The specificity is also high: <7% of the rat contigs are aligned to multiple places in human, and 97% of all alignments with human sequence >100 kb agree with a three-way synteny map built independently, using predicted exons in the three genomes. At the nucleotide level <1% of the rat nucleotides are mapped to multiple places in the human sequence in the alignment, and 96.5% of human nucleotides within all alignments agree with the synteny map. The alignments are publicly available online, with visualization through the novel Multi-VISTA browser that we also present.