Your browser doesn't support javascript.
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
BMC Genet ; 21(1): 25, 2020 Mar 06.
Artigo em Inglês | MEDLINE | ID: mdl-32138667

RESUMO

BACKGROUND: POLG, located on nuclear chromosome 15, encodes the DNA polymerase γ(Pol γ). Pol γ is responsible for the replication and repair of mitochondrial DNA (mtDNA). Pol γ is the only DNA polymerase found in mitochondria for most animal cells. Mutations in POLG are the most common single-gene cause of diseases of mitochondria and have been mapped over the coding region of the POLG ORF. RESULTS: Using PhyloCSF to survey alternative reading frames, we found a conserved coding signature in an alternative frame in exons 2 and 3 of POLG, herein referred to as ORF-Y that arose de novo in placental mammals. Using the synplot2 program, synonymous site conservation was found among mammals in the region of the POLG ORF that is overlapped by ORF-Y. Ribosome profiling data revealed that ORF-Y is translated and that initiation likely occurs at a CUG codon. Inspection of an alignment of mammalian sequences containing ORF-Y revealed that the CUG codon has a strong initiation context and that a well-conserved predicted RNA stem-loop begins 14 nucleotides downstream. Such features are associated with enhanced initiation at near-cognate non-AUG codons. Reanalysis of the Kim et al. (2014) draft human proteome dataset yielded two unique peptides that map unambiguously to ORF-Y. An additional conserved uORF, herein referred to as ORF-Z, was also found in exon 2 of POLG. Lastly, we surveyed Clinvar variants that are synonymous with respect to the POLG ORF and found that most of these variants cause amino acid changes in ORF-Y or ORF-Z. CONCLUSIONS: We provide evidence for a novel coding sequence, ORF-Y, that overlaps the POLG ORF. Ribosome profiling and mass spectrometry data show that ORF-Y is expressed. PhyloCSF and synplot2 analysis show that ORF-Y is subject to strong purifying selection. An abundance of disease-correlated mutations that map to exons 2 and 3 of POLG but also affect ORF-Y provides potential clinical significance to this finding.

2.
NPJ Genom Med ; 4: 31, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31814998

RESUMO

The developmental and epileptic encephalopathies (DEE) are a group of rare, severe neurodevelopmental disorders, where even the most thorough sequencing studies leave 60-65% of patients without a molecular diagnosis. Here, we explore the incompleteness of transcript models used for exome and genome analysis as one potential explanation for a lack of current diagnoses. Therefore, we have updated the GENCODE gene annotation for 191 epilepsy-associated genes, using human brain-derived transcriptomic libraries and other data to build 3,550 putative transcript models. Our annotations increase the transcriptional 'footprint' of these genes by over 674 kb. Using SCN1A as a case study, due to its close phenotype/genotype correlation with Dravet syndrome, we screened 122 people with Dravet syndrome or a similar phenotype with a panel of exon sequences representing eight established genes and identified two de novo SCN1A variants that now - through improved gene annotation - are ascribed to residing among our exons. These two (from 122 screened people, 1.6%) molecular diagnoses carry significant clinical implications. Furthermore, we identified a previously classified SCN1A intronic Dravet syndrome-associated variant that now lies within a deeply conserved exon. Our findings illustrate the potential gains of thorough gene annotation in improving diagnostic yields for genetic disorders.

3.
Genome Res ; 29(12): 2073-2087, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31537640

RESUMO

The most widely appreciated role of DNA is to encode protein, yet the exact portion of the human genome that is translated remains to be ascertained. We previously developed PhyloCSF, a widely used tool to identify evolutionary signatures of protein-coding regions using multispecies genome alignments. Here, we present the first whole-genome PhyloCSF prediction tracks for human, mouse, chicken, fly, worm, and mosquito. We develop a workflow that uses machine learning to predict novel conserved protein-coding regions and efficiently guide their manual curation. We analyze more than 1000 high-scoring human PhyloCSF regions and confidently add 144 conserved protein-coding genes to the GENCODE gene set, as well as additional coding regions within 236 previously annotated protein-coding genes, and 169 pseudogenes, most of them disabled after primates diverged. The majority of these represent new discoveries, including 70 previously undetected protein-coding genes. The novel coding genes are additionally supported by single-nucleotide variant evidence indicative of continued purifying selection in the human lineage, coding-exon splicing evidence from new GENCODE transcripts using next-generation transcriptomic data sets, and mass spectrometry evidence of translation for several new genes. Our discoveries required simultaneous comparative annotation of other vertebrate genomes, which we show is essential to remove spurious ORFs and to distinguish coding from pseudogene regions. Our new coding regions help elucidate disease-associated regions by revealing that 118 GWAS variants previously thought to be noncoding are in fact protein altering. Altogether, our PhyloCSF data sets and algorithms will help researchers seeking to interpret these genomes, while our new annotations present exciting loci for further experimental characterization.

4.
Sci Rep ; 9(1): 10757, 2019 Jul 24.
Artigo em Inglês | MEDLINE | ID: mdl-31341188

RESUMO

Major urinary proteins (MUP) are the major component of the urinary protein fraction in house mice (Mus spp.) and rats (Rattus spp.). The structure, polymorphism and functions of these lipocalins have been well described in the western European house mouse (Mus musculus domesticus), clarifying their role in semiochemical communication. The complexity of these roles in the mouse raises the question of similar functions in other rodents, including the Norway rat, Rattus norvegicus. Norway rats express MUPs in urine but information about specific MUP isoform sequences and functions is limited. In this study, we present a detailed molecular characterization of the MUP proteoforms expressed in the urine of two laboratory strains, Wistar Han and Brown Norway, and wild caught animals, using a combination of manual gene annotation, intact protein mass spectrometry and bottom-up mass spectrometry-based proteomic approaches. Cluster analysis shows the existence of only 10 predicted mup genes. Further, detailed sequencing of the urinary MUP isoforms reveals a less complex pattern of primary sequence polymorphism in the rat than the mouse. However, unlike the mouse, rat MUPs exhibit added complexity in the form of post-translational modifications, including the phosphorylation of Ser4 in some isoforms, and exoproteolytic trimming of specific isoforms. Our results raise the possibility that urinary MUPs may have different roles in rat chemical communication than those they play in the house mouse. Shotgun proteomics data are available via ProteomExchange with identifier PXD013986.

5.
Nucleic Acids Res ; 47(D1): D766-D773, 2019 Jan 08.
Artigo em Inglês | MEDLINE | ID: mdl-30357393

RESUMO

The accurate identification and description of the genes in the human and mouse genomes is a fundamental requirement for high quality analysis of data informing both genome biology and clinical genomics. Over the last 15 years, the GENCODE consortium has been producing reference quality gene annotations to provide this foundational resource. The GENCODE consortium includes both experimental and computational biology groups who work together to improve and extend the GENCODE gene annotation. Specifically, we generate primary data, create bioinformatics tools and provide analysis to support the work of expert manual gene annotators and automated gene annotation pipelines. In addition, manual and computational annotation workflows use any and all publicly available data and analysis, along with the research literature to identify and characterise gene loci to the highest standard. GENCODE gene annotations are accessible via the Ensembl and UCSC Genome Browsers, the Ensembl FTP site, Ensembl Biomart, Ensembl Perl and REST APIs as well as https://www.gencodegenes.org.

6.
Nucleic Acids Res ; 47(D1): D745-D751, 2019 Jan 08.
Artigo em Inglês | MEDLINE | ID: mdl-30407521

RESUMO

The Ensembl project (https://www.ensembl.org) makes key genomic data sets available to the entire scientific community without restrictions. Ensembl seeks to be a fundamental resource driving scientific progress by creating, maintaining and updating reference genome annotation and comparative genomics resources. This year we describe our new and expanded gene, variant and comparative annotation capabilities, which led to a 50% increase in the number of vertebrate genomes we support. We have also doubled the number of available human variants and added regulatory regions for many mouse cell types and developmental stages. Our data sets and tools are available via the Ensembl website as well as a through a RESTful webservice, Perl application programming interface and as data files for download.

7.
Epilepsia ; 59(8): 1557-1566, 2018 08.
Artigo em Inglês | MEDLINE | ID: mdl-30009487

RESUMO

OBJECTIVE: With the exception of specific metabolic disorders, predictors of response to ketogenic dietary therapies (KDTs) are unknown. We aimed to determine whether common variation across the genome influences the response to KDT for epilepsy. METHODS: We genotyped individuals who were negative for glucose transporter type 1 deficiency syndrome or other metabolic disorders, who received KDT for epilepsy. Genotyping was performed with the Infinium HumanOmniExpressExome Beadchip. Hospital records were used to obtain demographic and clinical data. KDT response (≥50% seizure reduction) at 3-month follow-up was used to dissect out nonresponders and responders. We then performed a genome-wide association study (GWAS) in nonresponders vs responders, using a linear mixed model and correcting for population stratification. Variants with minor allele frequency <0.05 and those that did not pass quality control filtering were excluded. RESULTS: After quality control filtering, the GWAS of 112 nonresponders vs 123 responders revealed an association locus at 6p25.1, 61 kb upstream of CDYL (rs12204701, P = 3.83 × 10-8 , odds ratio [A] = 13.5, 95% confidence interval [CI] 4.07-44.8). Although analysis of regional linkage disequilibrium around rs12204701 did not strengthen the likelihood of CDYL being the candidate gene, additional bioinformatic analyses suggest it is the most likely candidate. SIGNIFICANCE: CDYL deficiency has been shown to disrupt neuronal migration and to influence susceptibility to epilepsy in mice. Further exploration with a larger replication cohort is warranted to clarify whether CDYL is the causal gene underlying the association signal.


Assuntos
Dieta Cetogênica/métodos , Epilepsia Resistente a Medicamentos/dietoterapia , Epilepsia Resistente a Medicamentos/genética , Farmacognosia , Criança , Pré-Escolar , Estudos de Coortes , Feminino , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla , Genótipo , Transportador de Glucose Tipo 1/genética , Transportador de Glucose Tipo 1/metabolismo , Humanos , Cooperação Internacional , Modelos Logísticos , Masculino , Polimorfismo de Nucleotídeo Único/genética , Proteínas/genética , Proteínas/metabolismo
8.
Nucleic Acids Res ; 46(D1): D221-D228, 2018 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-29126148

RESUMO

The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.


Assuntos
Sequência Consenso , Bases de Dados Genéticas , Fases de Leitura Aberta , Animais , Curadoria de Dados/métodos , Curadoria de Dados/normas , Bases de Dados Genéticas/normas , Guias como Assunto , Humanos , Camundongos , Anotação de Sequência Molecular , National Library of Medicine (U.S.) , Estados Unidos , Interface Usuário-Computador
9.
Nat Rev Genet ; 17(12): 758-772, 2016 12.
Artigo em Inglês | MEDLINE | ID: mdl-27773922

RESUMO

A genome sequence is worthless if it cannot be deciphered; therefore, efforts to describe - or 'annotate' - genes began as soon as DNA sequences became available. Whereas early work focused on individual protein-coding genes, the modern genomic ocean is a complex maelstrom of alternative splicing, non-coding transcription and pseudogenes. Scientists - from clinicians to evolutionary biologists - need to navigate these waters, and this has led to the design of high-throughput, computationally driven annotation projects. The catalogues that are being produced are key resources for genome exploration, especially as they become integrated with expression, epigenomic and variation data sets. Their creation, however, remains challenging.


Assuntos
Eucariotos/genética , Genômica/métodos , Anotação de Sequência Molecular/métodos , Análise de Sequência de DNA/métodos , Animais , Humanos
10.
J Proteome Res ; 15(12): 4686-4695, 2016 12 02.
Artigo em Inglês | MEDLINE | ID: mdl-27786492

RESUMO

Proteogenomics leverages information derived from proteomic data to improve genome annotations. Of particular interest are "novel" peptides that provide direct evidence of protein expression for genomic regions not previously annotated as protein-coding. We present a modular, automated data analysis pipeline aimed at detecting such "novel" peptides in proteomic data sets. This pipeline implements criteria developed by proteomics and genome annotation experts for high-stringency peptide identification and filtering. Our pipeline is based on the OpenMS computational framework; it incorporates multiple database search engines for peptide identification and applies a machine-learning approach (Percolator) to post-process search results. We describe several new and improved software tools that we developed to facilitate proteogenomic analyses that enhance the wealth of tools provided by OpenMS. We demonstrate the application of our pipeline to a human testis tissue data set previously acquired for the Chromosome-Centric Human Proteome Project, which led to the addition of five new gene annotations on the human reference genome.


Assuntos
Mineração de Dados/métodos , Anotação de Sequência Molecular , Proteogenômica/métodos , Genoma Humano , Humanos , Aprendizado de Máquina , Masculino , Proteômica/métodos , Ferramenta de Busca , Software , Testículo
11.
Nat Commun ; 7: 12339, 2016 08 17.
Artigo em Inglês | MEDLINE | ID: mdl-27531712

RESUMO

Long non-coding RNAs (lncRNAs) constitute a large, yet mostly uncharacterized fraction of the mammalian transcriptome. Such characterization requires a comprehensive, high-quality annotation of their gene structure and boundaries, which is currently lacking. Here we describe RACE-Seq, an experimental workflow designed to address this based on RACE (rapid amplification of cDNA ends) and long-read RNA sequencing. We apply RACE-Seq to 398 human lncRNA genes in seven tissues, leading to the discovery of 2,556 on-target, novel transcripts. About 60% of the targeted loci are extended in either 5' or 3', often reaching genomic hallmarks of gene boundaries. Analysis of the novel transcripts suggests that lncRNAs are as long, have as many exons and undergo as much alternative splicing as protein-coding genes, contrary to current assumptions. Overall, we show that RACE-Seq is an effective tool to annotate an organism's deep transcriptome, and compares favourably to other targeted sequencing techniques.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Reação em Cadeia da Polimerase/métodos , RNA Longo não Codificante/genética , Análise de Sequência de RNA/métodos , Éxons/genética , Loci Gênicos , Humanos , Anotação de Sequência Molecular , Especificidade de Órgãos/genética , Estudo de Prova de Conceito , Isoformas de Proteínas/genética , Isoformas de Proteínas/metabolismo , Sítios de Splice de RNA/genética , RNA Longo não Codificante/metabolismo , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Transcriptoma/genética
12.
Mamm Genome ; 26(9-10): 366-78, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26187010

RESUMO

Annotation on the reference genome of the C57BL6/J mouse has been an ongoing project ever since the draft genome was first published. Initially, the principle focus was on the identification of all protein-coding genes, although today the importance of describing long non-coding RNAs, small RNAs, and pseudogenes is recognized. Here, we describe the progress of the GENCODE mouse annotation project, which combines manual annotation from the HAVANA group with Ensembl computational annotation, alongside experimental and in silico validation pipelines from other members of the consortium. We discuss the more recent incorporation of next-generation sequencing datasets into this workflow, including the usage of mass-spectrometry data to potentially identify novel protein-coding genes. Finally, we will outline how the C57BL6/J genebuild can be used to gain insights into the variant sites that distinguish different mouse strains and species.


Assuntos
Sequência de Aminoácidos/genética , Genoma , Anotação de Sequência Molecular , Pseudogenes/genética , Animais , Biologia Computacional/métodos , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Camundongos , Camundongos Endogâmicos C57BL , Alinhamento de Sequência
13.
BMC Genomics ; 16 Suppl 8: S2, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26110515

RESUMO

BACKGROUND: A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based. RESULTS: We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome. CONCLUSIONS: The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.


Assuntos
Biologia Computacional , Genoma Humano , Anotação de Sequência Molecular , Isoformas de Proteínas/metabolismo , Software , Processamento Alternativo , Bases de Dados Genéticas , Humanos , Isoformas de Proteínas/genética , Transcriptoma
14.
Nucleic Acids Res ; 42(Database issue): D865-72, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24217909

RESUMO

The Consensus Coding Sequence (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies by the National Center for Biotechnology Information (NCBI) and Ensembl genome annotation pipelines. Identical annotations that pass quality assurance tests are tracked with a stable identifier (CCDS ID). Members of the collaboration, who are from NCBI, the Wellcome Trust Sanger Institute and the University of California Santa Cruz, provide coordinated and continuous review of the dataset to ensure high-quality CCDS representations. We describe here the current status and recent growth in the CCDS dataset, as well as recent changes to the CCDS web and FTP sites. These changes include more explicit reporting about the NCBI and Ensembl annotation releases being compared, new search and display options, the addition of biologically descriptive information and our approach to representing genes for which support evidence is incomplete. We also present a summary of recent and future curation targets.


Assuntos
Bases de Dados Genéticas , Proteínas/genética , Animais , Éxons , Genômica , Humanos , Internet , Camundongos , Anotação de Sequência Molecular , Análise de Sequência
15.
Genome Res ; 23(12): 1961-73, 2013 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-24172201

RESUMO

The last decade has seen tremendous effort committed to the annotation of the human genome sequence, most notably perhaps in the form of the ENCODE project. One of the major findings of ENCODE, and other genome analysis projects, is that the human transcriptome is far larger and more complex than previously thought. This complexity manifests, for example, as alternative splicing within protein-coding genes, as well as in the discovery of thousands of long noncoding RNAs. It is also possible that significant numbers of human transcripts have not yet been described by annotation projects, while existing transcript models are frequently incomplete. The question as to what proportion of this complexity is truly functional remains open, however, and this ambiguity presents a serious challenge to genome scientists. In this article, we will discuss the current state of human transcriptome annotation, drawing on our experience gained in generating the GENCODE gene annotation set. We highlight the gaps in our knowledge of transcript functionality that remain, and consider the potential computational and experimental strategies that can be used to help close them. We propose that an understanding of the true overlap between transcriptional complexity and functionality will not be gained in the short term. However, significant steps toward obtaining this knowledge can now be taken by using an integrated strategy, combining all of the experimental resources at our disposal.


Assuntos
Genômica/métodos , Anotação de Sequência Molecular , Proteínas/genética , Transcriptoma , Processamento Alternativo , Animais , Bases de Dados Genéticas , Evolução Molecular , Genoma Humano , Humanos , Proteômica , RNA Longo não Codificante , Alinhamento de Sequência
16.
Database (Oxford) ; 2012: bas014, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22434846

RESUMO

While alternative splicing (AS) can potentially expand the functional repertoire of vertebrate genomes, relatively few AS transcripts have been experimentally characterized. We describe our detailed manual annotation of vertebrate genomes, which is generating a publicly available geneset rich in AS. In order to achieve this we have adopted a highly sensitive approach to annotating gene models supported by correctly mapped, canonically spliced transcriptional evidence combined with a highly cautious approach to adding unsupported extensions to models and making decisions on their functional potential. We use information about the predicted functional potential and structural properties of every AS transcript annotated at a protein-coding or non-coding locus to place them into one of eleven subclasses. We describe the incorporation of new sequencing and proteomics technologies into our annotation pipelines, which are used to identify and validate AS. Combining all data sources has led to the production of a rich geneset containing an average of 6.3 AS transcripts for every human multi-exon protein-coding gene. The datasets produced have proved very useful in providing context to studies investigating the functional potential of genes and the effect of variation may have on gene structure and function. DATABASE URL: http://www.ensembl.org/index.html, http://vega.sanger.ac.uk/index.html.


Assuntos
Bases de Dados Genéticas , Genoma , Anotação de Sequência Molecular/métodos , Vertebrados/genética , Processamento Alternativo , Animais , Humanos , Camundongos , Modelos Genéticos , Anotação de Sequência Molecular/normas
17.
PLoS One ; 7(1): e28213, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22238572

RESUMO

The classic organization of a gene structure has followed the Jacob and Monod bacterial gene model proposed more than 50 years ago. Since then, empirical determinations of the complexity of the transcriptomes found in yeast to human has blurred the definition and physical boundaries of genes. Using multiple analysis approaches we have characterized individual gene boundaries mapping on human chromosomes 21 and 22. Analyses of the locations of the 5' and 3' transcriptional termini of 492 protein coding genes revealed that for 85% of these genes the boundaries extend beyond the current annotated termini, most often connecting with exons of transcripts from other well annotated genes. The biological and evolutionary importance of these chimeric transcripts is underscored by (1) the non-random interconnections of genes involved, (2) the greater phylogenetic depth of the genes involved in many chimeric interactions, (3) the coordination of the expression of connected genes and (4) the close in vivo and three dimensional proximity of the genomic regions being transcribed and contributing to parts of the chimeric RNAs. The non-random nature of the connection of the genes involved suggest that chimeric transcripts should not be studied in isolation, but together, as an RNA network.


Assuntos
Células/metabolismo , Redes Reguladoras de Genes/fisiologia , RNA/fisiologia , Transcriptoma/fisiologia , Algoritmos , Proteínas Quimerinas/química , Proteínas Quimerinas/genética , Cromossomos Humanos Par 1/genética , Feminino , Perfilação da Expressão Gênica , Redes Reguladoras de Genes/genética , Humanos , Masculino , Análise em Microsséries/métodos , Modelos Biológicos , Técnicas de Amplificação de Ácido Nucleico/métodos , RNA/genética , Isoformas de RNA/química , Isoformas de RNA/genética , Isoformas de RNA/metabolismo , Transcrição Genética/genética , Estudos de Validação como Assunto
18.
Mol Biol Evol ; 28(10): 2949-59, 2011 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-21551269

RESUMO

Alternative splicing (AS) has the potential to greatly expand the functional repertoire of mammalian transcriptomes. However, few variant transcripts have been characterized functionally, making it difficult to assess the contribution of AS to the generation of phenotypic complexity and to study the evolution of splicing patterns. We have compared the AS of 309 protein-coding genes in the human ENCODE pilot regions against their mouse orthologs in unprecedented detail, utilizing traditional transcriptomic and RNAseq data. The conservation status of every transcript has been investigated, and each functionally categorized as coding (separated into coding sequence [CDS] or nonsense-mediated decay [NMD] linked) or noncoding. In total, 36.7% of human and 19.3% of mouse coding transcripts are species specific, and we observe a 3.6 times excess of human NMD transcripts compared with mouse; in contrast to previous studies, the majority of species-specific AS is unlinked to transposable elements. We observe one conserved CDS variant and one conserved NMD variant per 2.3 and 11.4 genes, respectively. Subsequently, we identify and characterize equivalent AS patterns for 22.9% of these CDS or NMD-linked events in nonmammalian vertebrate genomes, and our data indicate that functional NMD-linked AS is more widespread and ancient than previously thought. Furthermore, although we observe an association between conserved AS and elevated sequence conservation, as previously reported, we emphasize that 30% of conserved AS exons display sequence conservation below the average score for constitutive exons. In conclusion, we demonstrate the value of detailed comparative annotation in generating a comprehensive set of AS transcripts, increasing our understanding of AS evolution in vertebrates. Our data supports a model whereby the acquisition of functional AS has occurred throughout vertebrate evolution and is considered alongside amino acid change as a key mechanism in gene evolution.


Assuntos
Processamento Alternativo , Evolução Molecular , Genoma/genética , Animais , Sequência Conservada , Bases de Dados Genéticas , Humanos , Camundongos , Reprodutibilidade dos Testes , Transcriptoma
19.
Genome Biol ; 9(5): R91, 2008.
Artigo em Inglês | MEDLINE | ID: mdl-18507838

RESUMO

BACKGROUND: The major urinary proteins (MUPs) of Mus musculus domesticus are deposited in urine in large quantities, where they bind and release pheromones and also provide an individual 'recognition signal' via their phenotypic polymorphism. Whilst important information about MUP functionality has been gained in recent years, the gene cluster is poorly studied in terms of structure, genic polymorphism and evolution. RESULTS: We combine targeted sequencing, manual genome annotation and phylogenetic analysis to compare the Mup clusters of C57BL/6J and 129 strains of mice. We describe organizational heterogeneity within both clusters: a central array of cassettes containing Mup genes highly similar at the protein level, flanked by regions containing Mup genes displaying significantly elevated divergence. Observed genomic rearrangements in all regions have likely been mediated by endogenous retroviral elements. Mup loci with coding sequences that differ between the strains are identified--including a gene/pseudogene pair--suggesting that these inbred lineages exhibit variation that exists in wild populations. We have characterized the distinct MUP profiles in the urine of both strains by mass spectrometry. The total MUP phenotype data is reconciled with our genomic sequence data, matching all proteins identified in urine to annotated genes. CONCLUSION: Our observations indicate that the MUP phenotypic polymorphism observed in wild populations results from a combination of Mup gene turnover coupled with currently unidentified mechanisms regulating gene expression patterns. We propose that the structural heterogeneity described within the cluster reflects functional divergence within the Mup gene family.


Assuntos
Camundongos/genética , Proteínas/genética , Animais , Evolução Molecular , Feminino , Masculino , Espectrometria de Massas , Camundongos Endogâmicos C57BL , Camundongos Endogâmicos , Peso Molecular , Proteínas/química , Especificidade da Espécie
20.
Genome Res ; 13(9): 2059-68, 2003 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-12915487

RESUMO

The existence of latent centromeres has been proposed as a possible explanation for the ectopic emergence of neocentromeres in humans. This hypothesis predicts an association between the position of neocentromeres and the position of ancient centromeres inactivated during karyotypic evolution. Human chromosomal region 15q24-26 is one of several hotspots where multiple cases of neocentromere emergence have been reported, and it harbors a high density of chromosome-specific duplicons, rearrangements of which have been implicated as a susceptibility factor for panic and phobic disorders with joint laxity. We investigated the evolutionary history of this region in primates and found that it contains the site of an ancestral centromere which became inactivated about 25 million years ago, after great apes/Old World monkeys diverged. This inactivation has followed a noncentromeric chromosomal fission of an ancestral chromosome which gave rise to phylogenetic chromosomes XIV and XV in human and great apes. Detailed mapping of the ancient centromere and two neocentromeres in 15q24-26 has established that the neocentromere domains map approximately 8 Mb proximal and 1.5 Mb distal of the ancestral centromeric region, but that all three map within 500 kb of duplicons, copies of which flank the centromere in Old World Monkey species. This suggests that the association between neocentromere and ancestral centromere position on this chromosome may be due to the persistence of recombinogenic duplications accrued within the ancient pericentromere, rather than the retention of "centromere-competent" sequences per se. The high frequency of neocentromere emergence in the 15q24-26 region and the high density of clinically important duplicons are, therefore, understandable in the light of the evolutionary history of this region.


Assuntos
Centrômero/genética , Cromossomos Humanos Par 15/genética , Evolução Molecular , Duplicação Gênica , Animais , Cercopithecidae/genética , Inversão Cromossômica , Cromossomos Humanos Par 14/genética , Rearranjo Gênico/genética , Marcadores Genéticos/genética , Genoma Humano , Humanos , Mapeamento Físico do Cromossomo , Recombinação Genética/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA