RESUMO
Regulation of transcript structure generates transcript diversity and plays an important role in human disease1-7. The advent of long-read sequencing technologies offers the opportunity to study the role of genetic variation in transcript structure8-16. In this Article, we present a large human long-read RNA-seq dataset using the Oxford Nanopore Technologies platform from 88 samples from Genotype-Tissue Expression (GTEx) tissues and cell lines, complementing the GTEx resource. We identified just over 70,000 novel transcripts for annotated genes, and validated the protein expression of 10% of novel transcripts. We developed a new computational package, LORALS, to analyse the genetic effects of rare and common variants on the transcriptome by allele-specific analysis of long reads. We characterized allele-specific expression and transcript structure events, providing new insights into the specific transcript alterations caused by common and rare genetic variants and highlighting the resolution gained from long-read data. We were able to perturb the transcript structure upon knockdown of PTBP1, an RNA binding protein that mediates splicing, thereby finding genetic regulatory effects that are modified by the cellular environment. Finally, we used this dataset to enhance variant interpretation and study rare variants leading to aberrant splicing patterns.
Assuntos
Alelos , Perfilação da Expressão Gênica , Especificidade de Órgãos , RNA-Seq , Transcriptoma , Processamento Alternativo/genética , Linhagem Celular , Conjuntos de Dados como Assunto , Genótipo , Ribonucleoproteínas Nucleares Heterogêneas/deficiência , Ribonucleoproteínas Nucleares Heterogêneas/genética , Humanos , Especificidade de Órgãos/genética , Proteína de Ligação a Regiões Ricas em Polipirimidinas/deficiência , Proteína de Ligação a Regiões Ricas em Polipirimidinas/genética , Reprodutibilidade dos Testes , Transcriptoma/genéticaRESUMO
BACKGROUND: The circum-basmati group of cultivated Asian rice (Oryza sativa) contains many iconic varieties and is widespread in the Indian subcontinent. Despite its economic and cultural importance, a high-quality reference genome is currently lacking, and the group's evolutionary history is not fully resolved. To address these gaps, we use long-read nanopore sequencing and assemble the genomes of two circum-basmati rice varieties. RESULTS: We generate two high-quality, chromosome-level reference genomes that represent the 12 chromosomes of Oryza. The assemblies show a contig N50 of 6.32 Mb and 10.53 Mb for Basmati 334 and Dom Sufid, respectively. Using our highly contiguous assemblies, we characterize structural variations segregating across circum-basmati genomes. We discover repeat expansions not observed in japonica-the rice group most closely related to circum-basmati-as well as the presence and absence variants of over 20 Mb, one of which is a circum-basmati-specific deletion of a gene regulating awn length. We further detect strong evidence of admixture between the circum-basmati and circum-aus groups. This gene flow has its greatest effect on chromosome 10, causing both structural variation and single-nucleotide polymorphism to deviate from genome-wide history. Lastly, population genomic analysis of 78 circum-basmati varieties shows three major geographically structured genetic groups: Bhutan/Nepal, India/Bangladesh/Myanmar, and Iran/Pakistan. CONCLUSION: The availability of high-quality reference genomes allows functional and evolutionary genomic analyses providing genome-wide evidence for gene flow between circum-aus and circum-basmati, describes the nature of circum-basmati structural variation, and reveals the presence/absence variation in this important and iconic rice variety group.
Assuntos
Sequenciamento por Nanoporos/métodos , Oryza/genética , Sequenciamento Completo do Genoma/métodos , Cromossomos de Plantas/genética , Mapeamento de Sequências Contíguas/métodos , Evolução Molecular , Genoma de Planta , Oryza/classificação , FilogeniaRESUMO
The MinION sequencer has made in situ sequencing feasible in remote locations. Following our initial demonstration of its high performance off planet with Earth-prepared samples, we developed and tested an end-to-end, sample-to-sequencer process that could be conducted entirely aboard the International Space Station (ISS). Initial experiments demonstrated the process with a microbial mock community standard. The DNA was successfully amplified, primers were degraded, and libraries prepared and sequenced. The median percent identities for both datasets were 84%, as assessed from alignment of the mock community. The ability to correctly identify the organisms in the mock community standard was comparable for the sequencing data obtained in flight and on the ground. To validate the process on microbes collected from and cultured aboard the ISS, bacterial cells were selected from a NASA Environmental Health Systems Surface Sample Kit contact slide. The locations of bacterial colonies chosen for identification were labeled, and a small number of cells were directly added as input into the sequencing workflow. Prepared DNA was sequenced, and the data were downlinked to Earth. Return of the contact slide to the ground allowed for standard laboratory processing for bacterial identification. The identifications obtained aboard the ISS, Staphylococcus hominis and Staphylococcus capitis, matched those determined on the ground down to the species level. This marks the first ever identification of microbes entirely off Earth, and this validated process could be used for in-flight microbial identification, diagnosis of infectious disease in a crewmember, and as a research platform for investigators around the world.
Assuntos
Sequenciamento por Nanoporos/métodos , RNA Ribossômico 16S/genética , Manejo de Espécimes/métodos , Bactérias/genética , DNA Bacteriano/genética , DNA Ribossômico/genética , Exobiologia/métodos , Meio Ambiente Extraterreno , Genoma Bacteriano/genética , Microbiota/genética , Nanoporos , Análise de Sequência de DNA/métodos , Astronave/instrumentaçãoRESUMO
Segmented filamentous bacteria (SFB) are host-specific intestinal symbionts that comprise a distinct clade within the Clostridiaceae, designated Candidatus Arthromitus. SFB display a unique life cycle within the host, involving differentiation into multiple cell types. The latter include filaments that attach intimately to intestinal epithelial cells, and from which "holdfasts" and spores develop. SFB induce a multifaceted immune response, leading to host protection from intestinal pathogens. Cultivation resistance has hindered characterization of these enigmatic bacteria. In the present study, we isolated five SFB filaments from a mouse using a microfluidic device equipped with laser tweezers, generated genome sequences from each, and compared these sequences with each other, as well as to recently published SFB genome sequences. Based on the resulting analyses, SFB appear to be dependent on the host for a variety of essential nutrients. SFB have a relatively high abundance of predicted proteins devoted to cell cycle control and to envelope biogenesis, and have a group of SFB-specific autolysins and a dynamin-like protein. Among the five filament genomes, an average of 8.6% of predicted proteins were novel, including a family of secreted SFB-specific proteins. Four ADP-ribosyltransferase (ADPRT) sequence types, and a myosin-cross-reactive antigen (MCRA) protein were discovered; we hypothesize that they are involved in modulation of host responses. The presence of polymorphisms among mouse SFB genomes suggests the evolution of distinct SFB lineages. Overall, our results reveal several aspects of SFB adaptation to the mammalian intestinal tract.
Assuntos
Proteínas de Bactérias/genética , Genoma Bacteriano , Bactérias Gram-Positivas Formadoras de Endosporo/fisiologia , Intestinos/microbiologia , Análise de Célula Única/métodos , ADP Ribose Transferases/genética , ADP Ribose Transferases/metabolismo , Adaptação Fisiológica , Sequência de Aminoácidos , Animais , Proteínas de Bactérias/metabolismo , Diferenciação Celular/genética , DNA Ribossômico , Células Epiteliais/microbiologia , Bactérias Gram-Positivas Formadoras de Endosporo/genética , Camundongos , Técnicas Analíticas Microfluídicas , Dados de Sequência Molecular , Filogenia , Polimorfismo Genético , Análise de Sequência de DNARESUMO
SUMMARY: SmashCommunity is a stand-alone metagenomic annotation and analysis pipeline suitable for data from Sanger and 454 sequencing technologies. It supports state-of-the-art software for essential metagenomic tasks such as assembly and gene prediction. It provides tools to estimate the quantitative phylogenetic and functional compositions of metagenomes, to compare compositions of multiple metagenomes and to produce intuitive visual representations of such analyses. AVAILABILITY: SmashCommunity source code and documentation are available at http://www.bork.embl.de/software/smash CONTACT: bork@embl.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Metagenômica/métodos , Software , Genes , Anotação de Sequência Molecular , Filogenia , Análise de Sequência de DNARESUMO
SUMMARY: Recent advances in single-cell manipulation technology, whole genome amplification and high-throughput sequencing have now made it possible to sequence the genome of an individual cell. The bioinformatic analysis of these genomes, however, is far more complicated than the analysis of those generated using traditional, culture-based methods. In order to simplify this analysis, we have developed SmashCell (Simple Metagenomics Analysis SHell-for sequences from single Cells). It is designed to automate the main steps in microbial genome analysis-assembly, gene prediction, functional annotation-in a way that allows parameter and algorithm exploration at each step in the process. It also manages the data created by these analyses and provides visualization methods for rapid analysis of the results. AVAILABILITY: The SmashCell source code and a comprehensive manual are available at http://asiago.stanford.edu/SmashCell CONTACT: eoghanh@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genômica/métodos , Software , Algoritmos , Mapeamento Cromossômico/métodos , Genoma , Técnicas de Amplificação de Ácido Nucleico , Análise de Sequência de DNA/métodos , Análise de Célula ÚnicaRESUMO
Cis-acting short sequence motifs play important roles in alternative splicing. It is now possible to identify such sequence motifs as conserved sequence patterns in genome sequence alignments. Here, we report the systematic search for motifs in the neighboring introns of alternatively spliced exons by using comparative analysis of mammalian genome alignments. We identified 11 conserved sequence motifs that might be involved in the regulation of alternative splicing. These motifs are not only significantly overrepresented near alternatively spliced exons, but they also co-occur with each other, thus, forming a network of cis-elements, likely to be the basis for context-dependent regulation. Based on this finding, we applied the motif co-occurrence to predict alternatively skipped exons. We verified exon skipping in 29 cases out of 118 predictions (25%) by EST and mRNA sequences in the databases. For the predictions not verified by the database sequences, we confirmed exon skipping in 10 additional cases by using both RT-PCR experiments and the publicly available RNA-Seq data. These results indicate that even more alternative splicing events will be found with the progress of large-scale and high-throughput analyses for various tissue samples and developmental stages.
Assuntos
Processamento Alternativo , Íntrons , Sequências Reguladoras de Ácido Ribonucleico , Animais , Sequência de Bases , Sequência Conservada , Éxons , Genômica , Humanos , Dados de Sequência Molecular , Alinhamento de SequênciaRESUMO
UNLABELLED: Sircah is a flexible tool for the detection, analysis and visualization of alternative transcripts. It takes as input gene models or spliced alignments and creates a database of alternative transcription events: alternative transcription initiation and polyadenylation, alternative 3' and 5' splice-site usage, skipped exons and retained introns. The results can be visualized in a variety of ways, allowing the creation of publication quality images. AVAILABILITY: The Sircah is available for download under a creative commons license along with additional documentation and a tutorial from http://www.bork.embl.de/Sircah.
Assuntos
Algoritmos , Gráficos por Computador , Sítios de Splice de RNA/genética , Análise de Sequência de DNA/métodos , Software , Fatores de Transcrição/genética , Interface Usuário-Computador , Sequência de Bases , Dados de Sequência MolecularRESUMO
BACKGROUND: Environments and their organic content are generally not static and isolated, but in a constant state of exchange and interaction with each other. Through physical or biological processes, organisms, especially microbes, may be transferred between environments whose characteristics may be quite different. The transferred microbes may not survive in their new environment, but their DNA will be deposited. In this study, we compare two environmental sequencing projects to find molecular evidence of transfer of microbes over vast geographical distances. METHODOLOGY: By studying synonymous nucleotide composition, oligomer frequency and orthology between predicted genes in metagenomics data from two environments, terrestrial and aquatic, and by correlating with phylogenetic mappings, we find that both environments are likely to contain trace amounts of microbes which have been far removed from their original habitat. We also suggest a bias in direction from soil to sea, which is consistent with the cycles of planetary wind and water. CONCLUSIONS: Our findings support the Baas-Becking hypothesis formulated in 1934, which states that due to dispersion and population sizes, microbes are likely to be found in widely disparate environments. Furthermore, the availability of genetic material from distant environments is a possible font of novel gene functions for lateral gene transfer.
Assuntos
Meio Ambiente , Genes Bacterianos , Ecologia , Ecossistema , Transferência Genética Horizontal , Filogenia , Microbiologia da ÁguaRESUMO
BACKGROUND: Across the fully sequenced microbial genomes there are thousands of examples of overlapping genes. Many of these are only a few nucleotides long and are thought to function by permitting the coordinated regulation of gene expression. However, there should also be selective pressure against long overlaps, as the existence of overlapping reading frames increases the risk of deleterious mutations. Here we examine the longest overlaps and assess whether they are the product of special functional constraints or of erroneous annotation. RESULTS: We analysed the genes that overlap by 60 bps or more among 338 fully-sequenced prokaryotic genomes. The likely functional significance of an overlap was determined by comparing each of the genes to its respective orthologs. If a gene showed a significantly different length from its orthologs it was considered unlikely to be functional and therefore the result of an error either in sequencing or gene prediction. Focusing on 715 co-directional overlaps longer than 60 bps, we classified the erroneous ones into five categories: i) 5'-end extension of the downstream gene due to either a mispredicted start codon or a frameshift at 5'-end of the gene (409 overlaps), ii) fragmentation of a gene caused by a frameshift (163), iii) 3'-end extension of the upstream gene due to either a frameshift at 3'-end of a gene or point mutation at the stop codon (68), iv) Redundant gene predictions (4), v) 5' & 3'-end extension which is a combination of i) and iii) (71). We also studied 75 divergent overlaps that could be classified as misannotations of group i). Nevertheless we found some convergent long overlaps (54) that might be true overlaps, although an important part of convergent overlaps could be classified as group iii) (124). CONCLUSION: Among the 968 overlaps larger than 60 bps which we analysed, we did not find a single real one among the co-directional and divergent orientations and concluded that there had been an excessive number of misannotations. Only convergent orientation seems to permit some long overlaps, although convergent overlaps are also hampered by misannotations. We propose a simple rule to flag these erroneous gene length predictions to facilitate automatic annotation.
Assuntos
Homologia de Genes/genética , Genoma , Células Procarióticas/metabolismo , Sequência de Aminoácidos , Sequência de Bases , Códon de Iniciação , Códon de Terminação , Bases de Dados Factuais , Evolução Molecular , Mutação da Fase de Leitura , Dados de Sequência Molecular , Fases de Leitura Aberta , Homologia de Sequência de AminoácidosRESUMO
Continuing improvements in DNA sequencing technologies are providing us with vast amounts of genomic data from an ever-widening range of organisms. The resulting challenge for bioinformatics is to interpret this deluge of data and place it back into its biological context. Biological networks provide a conceptual framework with which we can describe part of this context, namely the different interactions that occur between the molecular components of a cell. Here, we review the computational methods available to predict biological networks from genomic sequence data and discuss how they relate to high-throughput experimental methods.
Assuntos
Genômica , Biologia Computacional , FilogeniaRESUMO
We suggest an annotation strategy for genes encoded by retroviruses and transposable elements (RETRA genes) based on a set of marker protein domains. Usually RETRA genes are masked in vertebrate genomes prior to the application of automated gene prediction pipelines under the assumption that they provide no selective advantage to the host. Yet, we show that about 1000 genes in four vertebrate gene sets analyzed contain at least one RETRA gene marker domain. Using the conservation of genomic neighborhood (synteny), we were able to discriminate between RETRA genes with putative functionality in the vertebrates and those that probably function only in the context of mobile elements. We identified 35 such genes in human, along with their corresponding mouse and rat orthologs; which included almost all known human genes with similarity to mobile elements. The results also imply that the vast majority of the remaining RETRA genes in current gene sets are unlikely to encode vertebrate functions. To automatically annotate RETRA genes in other vertebrate genomes, we provide as a tool a set of marker protein domains and a manually refined list of domesticated or ancestral RETRA genes for rescuing genes with vertebrate functions.
Assuntos
Elementos de DNA Transponíveis , Retrovirus Endógenos/genética , Proteínas/genética , Retroelementos , Animais , Biologia Computacional , Código Genético , Marcadores Genéticos , Genômica , Humanos , Camundongos , Estrutura Terciária de Proteína , Ratos , Sintenia , Takifugu/genéticaRESUMO
Orthologous genes that maintain a single-copy status in a broad range of species may indicate a selection against gene duplication. If this is the case, then duplicates of such genes that do survive may have escaped the dosage control by rapid and sizable changes in their function. To test this hypothesis and to develop a strategy for the identification of novel gene functions, we have analyzed 22 primate-specific intrachromosomal duplications of genes with a single-copy ortholog in all other completely sequenced metazoans. When comparing this set to genes not exposed to the single-copy status constraint, we observed a higher tendency of the former to modify their gene structure, often through complex genomic rearrangements. The analysis of the most dramatic of these duplications, affecting approximately 10% of human Chromosome 2, enabled a detailed reconstruction of the events leading to the appearance of a novel gene family. The eight members of this family originated from the highly conserved nucleoporin RanBP2 by several genetic rearrangements such as segmental duplications, inversions, translocations, exon loss, and domain accretion. We have experimentally verified that at least one of the newly formed proteins has a cellular localization different from RanBP2's, and we show that positive selection did act on specific domains during evolution.