Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Nature ; 517(7536): 608-11, 2015 Jan 29.
Artículo en Inglés | MEDLINE | ID: mdl-25383537

RESUMEN

The human genome is arguably the most complete mammalian reference assembly, yet more than 160 euchromatic gaps remain and aspects of its structural variation remain poorly understood ten years after its completion. To identify missing sequence and genetic variation, here we sequence and analyse a haploid human genome (CHM1) using single-molecule, real-time DNA sequencing. We close or extend 55% of the remaining interstitial gaps in the human GRCh37 reference genome--78% of which carried long runs of degenerate short tandem repeats, often several kilobases in length, embedded within (G+C)-rich genomic regions. We resolve the complete sequence of 26,079 euchromatic structural variants at the base-pair level, including inversions, complex insertions and long tracts of tandem repeats. Most have not been previously reported, with the greatest increases in sensitivity occurring for events less than 5 kilobases in size. Compared to the human reference, we find a significant insertional bias (3:1) in regions corresponding to complex insertions and long short tandem repeats. Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology.


Asunto(s)
Variación Genética/genética , Genoma Humano/genética , Genómica , Análisis de Secuencia de ADN/métodos , Inversión Cromosómica/genética , Cromosomas Humanos Par 10/genética , Clonación Molecular , Secuencia Rica en GC/genética , Haploidia , Humanos , Mutagénesis Insercional/genética , Estándares de Referencia , Secuencias Repetidas en Tándem/genética
2.
Genome Res ; 25(11): 1771-80, 2015 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-26294686

RESUMEN

Alternative splicing is regulated by RNA binding proteins (RBPs) that recognize pre-mRNA sequence elements and activate or repress adjacent exons. Here, we used RNA interference and RNA-seq to identify splicing events regulated by 56 Drosophila proteins, some previously unknown to regulate splicing. Nearly all proteins affected alternative first exons, suggesting that RBPs play important roles in first exon choice. Half of the splicing events were regulated by multiple proteins, demonstrating extensive combinatorial regulation. We observed that SR and hnRNP proteins tend to act coordinately with each other, not antagonistically. We also identified a cross-regulatory network where splicing regulators affected the splicing of pre-mRNAs encoding other splicing regulators. This large-scale study substantially enhances our understanding of recent models of splicing regulation and provides a resource of thousands of exons that are regulated by 56 diverse RBPs.


Asunto(s)
Empalme Alternativo , Proteínas de Drosophila/genética , Drosophila/genética , Proteínas de Unión al ARN/genética , Factores Asociados con la Proteína de Unión a TATA/genética , Animales , Proteínas de Drosophila/metabolismo , Exones , Ribonucleoproteínas Nucleares Heterogéneas/genética , Ribonucleoproteínas Nucleares Heterogéneas/metabolismo , Interferencia de ARN , Precursores del ARN/genética , Precursores del ARN/metabolismo , Proteínas de Unión al ARN/metabolismo , Análisis de Secuencia de ARN , Factores Asociados con la Proteína de Unión a TATA/metabolismo
3.
Nature ; 471(7339): 473-9, 2011 Mar 24.
Artículo en Inglés | MEDLINE | ID: mdl-21179090

RESUMEN

Drosophila melanogaster is one of the most well studied genetic model organisms; nonetheless, its genome still contains unannotated coding and non-coding genes, transcripts, exons and RNA editing sites. Full discovery and annotation are pre-requisites for understanding how the regulation of transcription, splicing and RNA editing directs the development of this complex organism. Here we used RNA-Seq, tiling microarrays and cDNA sequencing to explore the transcriptome in 30 distinct developmental stages. We identified 111,195 new elements, including thousands of genes, coding and non-coding transcripts, exons, splicing and editing events, and inferred protein isoforms that previously eluded discovery using established experimental, prediction and conservation-based approaches. These data substantially expand the number of known transcribed elements in the Drosophila genome and provide a high-resolution view of transcriptome dynamics throughout development.


Asunto(s)
Drosophila melanogaster/crecimiento & desarrollo , Drosophila melanogaster/genética , Perfilación de la Expresión Génica , Regulación del Desarrollo de la Expresión Génica/genética , Transcripción Genética/genética , Empalme Alternativo/genética , Animales , Secuencia de Bases , Proteínas de Drosophila/genética , Drosophila melanogaster/embriología , Exones/genética , Femenino , Genes de Insecto/genética , Genoma de los Insectos/genética , Masculino , MicroARNs/genética , Análisis de Secuencia por Matrices de Oligonucleótidos , Isoformas de Proteínas/genética , Edición de ARN/genética , ARN Mensajero/análisis , ARN Mensajero/genética , ARN Pequeño no Traducido/análisis , ARN Pequeño no Traducido/genética , Análisis de Secuencia , Caracteres Sexuales
4.
Genome Res ; 21(2): 182-92, 2011 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-21177961

RESUMEN

Core promoters are critical regions for gene regulation in higher eukaryotes. However, the boundaries of promoter regions, the relative rates of initiation at the transcription start sites (TSSs) distributed within them, and the functional significance of promoter architecture remain poorly understood. We produced a high-resolution map of promoters active in the Drosophila melanogaster embryo by integrating data from three independent and complementary methods: 21 million cap analysis of gene expression (CAGE) tags, 1.2 million RNA ligase mediated rapid amplification of cDNA ends (RLM-RACE) reads, and 50,000 cap-trapped expressed sequence tags (ESTs). We defined 12,454 promoters of 8037 genes. Our analysis indicates that, due to non-promoter-associated RNA background signal, previous studies have likely overestimated the number of promoter-associated CAGE clusters by fivefold. We show that TSS distributions form a complex continuum of shapes, and that promoters active in the embryo and adult have highly similar shapes in 95% of cases. This suggests that these distributions are generally determined by static elements such as local DNA sequence and are not modulated by dynamic signals such as histone modifications. Transcription factor binding motifs are differentially enriched as a function of promoter shape, and peaked promoter shape is correlated with both temporal and spatial regulation of gene expression. Our results contribute to the emerging view that core promoters are functionally diverse and control patterning of gene expression in Drosophila and mammals.


Asunto(s)
Biología Computacional , Drosophila melanogaster/genética , Genoma de los Insectos/genética , Regiones Promotoras Genéticas , Regiones no Traducidas 3'/genética , Animales , Mapeo Cromosómico , Drosophila melanogaster/embriología , Etiquetas de Secuencia Expresada , Perfilación de la Expresión Génica , Regulación de la Expresión Génica/genética , Estudio de Asociación del Genoma Completo , Sitio de Iniciación de la Transcripción
5.
Genome Res ; 21(2): 301-14, 2011 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-21177962

RESUMEN

Drosophila melanogaster cell lines are important resources for cell biologists. Here, we catalog the expression of exons, genes, and unannotated transcriptional signals for 25 lines. Unannotated transcription is substantial (typically 19% of euchromatic signal). Conservatively, we identify 1405 novel transcribed regions; 684 of these appear to be new exons of neighboring, often distant, genes. Sixty-four percent of genes are expressed detectably in at least one line, but only 21% are detected in all lines. Each cell line expresses, on average, 5885 genes, including a common set of 3109. Expression levels vary over several orders of magnitude. Major signaling pathways are well represented: most differentiation pathways are "off" and survival/growth pathways "on." Roughly 50% of the genes expressed by each line are not part of the common set, and these show considerable individuality. Thirty-one percent are expressed at a higher level in at least one cell line than in any single developmental stage, suggesting that each line is enriched for genes characteristic of small sets of cells. Most remarkable is that imaginal disc-derived lines can generally be assigned, on the basis of expression, to small territories within developing discs. These mappings reveal unexpected stability of even fine-grained spatial determination. No two cell lines show identical transcription factor expression. We conclude that each line has retained features of an individual founder cell superimposed on a common "cell line" gene expression pattern.


Asunto(s)
Drosophila melanogaster/genética , Variación Genética , Transcripción Genética , Animales , Línea Celular , Análisis por Conglomerados , Exones , Femenino , Perfilación de la Expresión Génica , Masculino , Datos de Secuencia Molecular , Transducción de Señal/genética , Factores de Transcripción/genética
6.
Genome Res ; 20(7): 890-8, 2010 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-20501695

RESUMEN

Promoters are important regulatory elements that contain the necessary sequence features for cells to initiate transcription. To functionally characterize a large set of human promoters, we measured the transcriptional activities of 4575 putative promoters across eight cell lines using transient transfection reporter assays. In parallel, we measured gene expression in the same cell lines and observed a significant correlation between promoter activity and endogenous gene expression (r = 0.43). As transient transfection assays directly measure the promoting effect of a defined fragment of DNA sequence, decoupled from epigenetic, chromatin, or long-range regulatory effects, we sought to predict whether a promoter was active using sequence features alone. CG dinucleotide content was highly predictive of ubiquitous promoter activity, necessitating the separation of promoters into two groups: high CG promoters, mostly ubiquitously active, and low CG promoters, mostly cell line-specific. Computational models trained on the binding potential of transcriptional factor (TF) binding motifs could predict promoter activities in both high and low CG groups: average area under the receiver operating characteristic curve (AUC) of the models was 91% and exceeded the AUC of CG content by an average of 23%. Known relationships, for example, between HNF4A and hepatocytes, were recapitulated in the corresponding cell lines, in this case the liver-derived cell line HepG2. Half of the associations between tissue-specific TFs and cell line-specific promoters were new. Our study underscores the importance of collecting functional information from complementary assays and conditions to understand biology in a systematic framework.


Asunto(s)
Secuencia de Bases/fisiología , Especificidad de Órganos/genética , Regiones Promotoras Genéticas/genética , Regiones Promotoras Genéticas/fisiología , Composición de Base/fisiología , Sitios de Unión/genética , Línea Celular , Biología Computacional/métodos , Epigénesis Genética/fisiología , Expresión Génica/genética , Expresión Génica/fisiología , Células Hep G2 , Factor Nuclear 4 del Hepatocito/genética , Humanos , Unión Proteica , Transcripción Genética , Transfección
7.
Sci Data ; 7(1): 399, 2020 11 17.
Artículo en Inglés | MEDLINE | ID: mdl-33203859

RESUMEN

The PacBio® HiFi sequencing method yields highly accurate long-read sequencing datasets with read lengths averaging 10-25 kb and accuracies greater than 99.5%. These accurate long reads can be used to improve results for complex applications such as single nucleotide and structural variant detection, genome assembly, assembly of difficult polyploid or highly repetitive genomes, and assembly of metagenomes. Currently, there is a need for sample data sets to both evaluate the benefits of these long accurate reads as well as for development of bioinformatic tools including genome assemblers, variant callers, and haplotyping algorithms. We present deep coverage HiFi datasets for five complex samples including the two inbred model genomes Mus musculus and Zea mays, as well as two complex genomes, octoploid Fragaria × ananassa and the diploid anuran Rana muscosa. Additionally, we release sequence data from a mock metagenome community. The datasets reported here can be used without restriction to develop new algorithms and explore complex genome structure and evolution. Data were generated on the PacBio Sequel II System.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Ratones/genética , Zea mays/genética , Animales , Fragaria/genética , Genoma de Planta , Metagenoma , Ranidae/genética , Análisis de Secuencia de ADN
8.
Nat Biotechnol ; 33(6): 623-30, 2015 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-26006009

RESUMEN

Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.


Asunto(s)
Genoma Fúngico , Genoma Humano , Genoma de los Insectos , Genoma de Planta , Análisis de Secuencia de ADN , Animales , Arabidopsis/genética , Secuencia de Bases , Cromosomas/genética , Drosophila melanogaster/genética , Heterocromatina , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Humanos , Saccharomyces cerevisiae/genética , Alineación de Secuencia
9.
Sci Data ; 1: 140045, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-25977796

RESUMEN

Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characteristics of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4C2 and P5C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.


Asunto(s)
Arabidopsis/genética , Drosophila melanogaster/genética , Escherichia coli/genética , Genoma Bacteriano , Genoma Fúngico , Genoma de los Insectos , Genoma de Planta , Neurospora crassa/genética , Saccharomyces cerevisiae/genética , Análisis de Secuencia de ADN , Animales , Modelos Animales
11.
Science ; 330(6012): 1787-97, 2010 Dec 24.
Artículo en Inglés | MEDLINE | ID: mdl-21177974

RESUMEN

To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications, chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding, RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.


Asunto(s)
Cromatina , Drosophila melanogaster/genética , Redes Reguladoras de Genes , Genoma de los Insectos , Anotación de Secuencia Molecular , Animales , Sitios de Unión , Cromatina/genética , Cromatina/metabolismo , Biología Computacional/métodos , Proteínas de Drosophila/genética , Proteínas de Drosophila/metabolismo , Drosophila melanogaster/crecimiento & desarrollo , Drosophila melanogaster/metabolismo , Epigénesis Genética , Regulación de la Expresión Génica , Genes de Insecto , Genómica/métodos , Histonas/metabolismo , Nucleosomas/genética , Nucleosomas/metabolismo , Regiones Promotoras Genéticas , ARN Pequeño no Traducido/genética , ARN Pequeño no Traducido/metabolismo , Factores de Transcripción/metabolismo , Transcripción Genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA