Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 22
Filtrar
1.
PLoS Comput Biol ; 17(6): e1009078, 2021 06.
Artigo em Inglês | MEDLINE | ID: mdl-34153026

RESUMO

It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).


Assuntos
Mapeamento de Sequências Contíguas/estatística & dados numéricos , Alinhamento de Sequência/estatística & dados numéricos , Software , Análise por Conglomerados , Biologia Computacional , Simulação por Computador , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Variação Genética , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Programação Linear , Análise de Sequência de DNA
2.
PLoS One ; 14(9): e0216885, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31498807

RESUMO

Unknown sequences, or gaps, are present in many published genomes across public databases. Gap filling is an important finishing step in de novo genome assembly, especially in large genomes. The gap filling problem is nontrivial and while there are many computational tools partially solving the problem, several have shortcomings as to the reliability and correctness of the output, i.e. the gap filled draft genome. SSPACE-LongRead is a scaffolding tool that utilizes long reads from multiple third-generation sequencing platforms in finding links between contigs and combining them. The long reads potentially contain sequence information to fill the gaps created in the scaffolding, but SSPACE-LongRead currently lacks this functionality. We present an automated pipeline called gapFinisher to process SSPACE-LongRead output to fill gaps after the scaffolding. gapFinisher is based on the controlled use of a previously published gap filling tool FGAP and works on all standard Linux/UNIX command lines. We compare the performance of gapFinisher against two other published gap filling tools PBJelly and GMcloser. We conclude that gapFinisher can fill gaps in draft genomes quickly and reliably. In addition, the serial design of gapFinisher makes it scale well from prokaryote genomes to larger genomes with no increase in the computational footprint.


Assuntos
Algoritmos , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Genoma , Genômica/métodos , Análise de Sequência de DNA/estatística & dados numéricos , Software , Animais , Bactérias/genética , Benchmarking , Bases de Dados Genéticas , Genômica/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala , Focas Verdadeiras/genética
3.
PLoS One ; 13(1): e0190938, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29351302

RESUMO

When human samples are sequenced, many assembled contigs are "unknown", as conventional alignments find no similarity to known sequences. Hidden Markov models (HMM) exploit the positions of specific nucleotides in protein-encoding codons in various microbes. The algorithm HMMER3 implements HMM using a reference set of sequences encoding viral proteins, "vFam". We used HMMER3 analysis of "unknown" human sample-derived sequences and identified 510 contigs distantly related to viruses (Anelloviridae (n = 1), Baculoviridae (n = 34), Circoviridae (n = 35), Caulimoviridae (n = 3), Closteroviridae (n = 5), Geminiviridae (n = 21), Herpesviridae (n = 10), Iridoviridae (n = 12), Marseillevirus (n = 26), Mimiviridae (n = 80), Phycodnaviridae (n = 165), Poxviridae (n = 23), Retroviridae (n = 6) and 89 contigs related to described viruses not yet assigned to any taxonomic family). In summary, we find that analysis using the HMMER3 algorithm and the "vFam" database greatly extended the detection of viruses in biospecimens from humans.


Assuntos
Microbiota , Vírus/genética , Vírus/isolamento & purificação , Algoritmos , Biologia Computacional , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Cadeias de Markov , Metagenômica/estatística & dados numéricos , Filogenia , Proteínas Virais/genética , Vírus/classificação
4.
Proc Natl Acad Sci U S A ; 114(47): 12512-12517, 2017 11 21.
Artigo em Inglês | MEDLINE | ID: mdl-29078313

RESUMO

Accurate detection of variants and long-range haplotypes in genomes of single human cells remains very challenging. Common approaches require extensive in vitro amplification of genomes of individual cells using DNA polymerases and high-throughput short-read DNA sequencing. These approaches have two notable drawbacks. First, polymerase replication errors could generate tens of thousands of false-positive calls per genome. Second, relatively short sequence reads contain little to no haplotype information. Here we report a method, which is dubbed SISSOR (single-stranded sequencing using microfluidic reactors), for accurate single-cell genome sequencing and haplotyping. A microfluidic processor is used to separate the Watson and Crick strands of the double-stranded chromosomal DNA in a single cell and to randomly partition megabase-size DNA strands into multiple nanoliter compartments for amplification and construction of barcoded libraries for sequencing. The separation and partitioning of large single-stranded DNA fragments of the homologous chromosome pairs allows for the independent sequencing of each of the complementary and homologous strands. This enables the assembly of long haplotypes and reduction of sequence errors by using the redundant sequence information and haplotype-based error removal. We demonstrated the ability to sequence single-cell genomes with error rates as low as 10-8 and average 500-kb-long DNA fragments that can be assembled into haplotype contigs with N50 greater than 7 Mb. The performance could be further improved with more uniform amplification and more accurate sequence alignment. The ability to obtain accurate genome sequences and haplotype information from single cells will enable applications of genome sequencing for diverse clinical needs.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma Humano , Haplótipos , Técnicas Analíticas Microfluídicas/métodos , Análise de Célula Única/métodos , Sequenciamento Completo do Genoma/métodos , Alelos , Linhagem Celular , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Fibroblastos/citologia , Fibroblastos/metabolismo , Antígenos HLA/genética , Antígenos HLA/metabolismo , Humanos , Técnicas Analíticas Microfluídicas/instrumentação , Mutação , Polimorfismo de Nucleotídeo Único , Análise de Célula Única/instrumentação , Sequenciamento Completo do Genoma/instrumentação
5.
J Comput Biol ; 22(5): 367-76, 2015 May.
Artigo em Inglês | MEDLINE | ID: mdl-25535824

RESUMO

Metatranscriptomic analysis provides information on how a microbial community reacts to environmental changes. Using next-generation sequencing (NGS) technology, biologists can study the microbe community by sampling short reads from a mixture of mRNAs (metatranscriptomic data). As most microbial genome sequences are unknown, it would seem that de novo assembly of the mRNAs is needed. However, NGS reads are short and mRNAs share many similar regions and differ tremendously in abundance levels, making de novo assembly challenging. The existing assembler, IDBA-MT, designed specifically for the assembly of metatranscriptomic data and performs well only on high-expressed mRNAs. This article introduces IDBA-MTP, which adopts a novel approach to metatranscriptomic assembly that makes use of the fact that there is a database of millions of known protein sequences associated with mRNAs. How to effectively use the protein information is nontrivial given the size of the database and given that different mRNAs might lead to proteins with similar functions (because different amino acids might have similar characteristics). IDBA-MTP employs a similarity measure between mRNAs and protein sequences, dynamic programming techniques, and seed-and-extend heuristics to tackle the problem effectively and efficiently. Experimental results show that IDBA-MTP outperforms existing assemblers by reconstructing 14% more mRNAs.


Assuntos
Proteínas de Bactérias/química , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Consórcios Microbianos/genética , RNA Mensageiro/química , Software , Transcriptoma , Algoritmos , Proteínas de Bactérias/genética , Mapeamento de Sequências Contíguas/métodos , Mineração de Dados , Sequenciamento de Nucleotídeos em Larga Escala , Metagenômica/métodos , Metagenômica/estatística & dados numéricos , Proteoma/química , Proteoma/genética , RNA Bacteriano/química , RNA Bacteriano/genética , RNA Mensageiro/genética , Análise de Sequência de DNA
6.
Nat Commun ; 5: 5695, 2014 Dec 17.
Artigo em Inglês | MEDLINE | ID: mdl-25517223

RESUMO

Closing gaps in draft genome assemblies can be costly and time-consuming, and published genomes are therefore often left 'unfinished.' Here we show that genome-wide chromosome conformation capture (3C) data can be used to overcome these limitations, and present a computational approach rooted in polymer physics that determines the most likely genome structure using chromosomal contact data. This algorithm--named GRAAL--generates high-quality assemblies of genomes in which repeated and duplicated regions are accurately represented and offers a direct probabilistic interpretation of the computed structures. We first validated GRAAL on the reference genome of Saccharomyces cerevisiae, as well as other yeast isolates, where GRAAL recovered both known and unknown complex chromosomal structural variations. We then applied GRAAL to the finishing of the assembly of Trichoderma reesei and obtained a number of contigs congruent with the know karyotype of this species. Finally, we showed that GRAAL can accurately reconstruct human chromosomes from either fragments generated in silico or contigs obtained from de novo assembly. In all these applications, GRAAL compared favourably to recently published programmes implementing related approaches.


Assuntos
Algoritmos , Cromossomos Fúngicos , Cromossomos Humanos , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Genoma , Modelos Estatísticos , Mapeamento de Sequências Contíguas/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Cariótipo , Saccharomyces cerevisiae/genética , Análise de Sequência de DNA , Trichoderma/genética
7.
Comput Biol Chem ; 53 Pt A: 97-107, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25262360

RESUMO

Selecting the values of parameters used by de novo genomic assembly programs, or choosing an optimal de novo assembly from several runs obtained with different parameters or programs, are tasks that can require complex decision-making. A key parameter that must be supplied to typical next generation sequencing (NGS) assemblers is the k-mer length, i.e., the word size that determines which de Bruijn graph the program should map out and use. The topic of assembly selection criteria was recently revisited in the Assemblathon 2 study (Bradnam et al., 2013). Although no clear message was delivered with regard to optimal k-mer lengths, it was shown with examples that it is sometimes important to decide if one is most interested in optimizing the sequences of protein-coding genes (the gene space) or in optimizing the whole genome sequence including the intergenic DNA, as what is best for one criterion may not be best for the other. In the present study, our aim was to better understand how the assembly of unicellular fungi (which are typically intermediate in size and complexity between prokaryotes and metazoan eukaryotes) can change as one varies the k-mer values over a wide range. We used two different de novo assembly programs (SOAPdenovo2 and ABySS), and simple assembly metrics that also focused on success in assembling the gene space and repetitive elements. A recent increase in Illumina read length to around 150 bp allowed us to attempt de novo assemblies with a larger range of k-mers, up to 127 bp. We applied these methods to Illumina paired-end sequencing read sets of fungal strains of Paracoccidioides brasiliensis and other species. By visualizing the results in simple plots, we were able to track the effect of changing k-mer size and assembly program, and to demonstrate how such plots can readily reveal discontinuities or other unexpected characteristics that assembly programs can present in practice, especially when they are used in a traditional molecular microbiology laboratory with a 'genomics corner'. Here we propose and apply a component of a first pass validation methodology for benchmarking and understanding fungal genome de novo assembly processes.


Assuntos
Algoritmos , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Genoma Fúngico , Paracoccidioides/genética , Análise de Sequência de DNA/estatística & dados numéricos , Benchmarking , DNA Intergênico , Sequenciamento de Nucleotídeos em Larga Escala , Fases de Leitura Aberta , Sequências Repetitivas de Ácido Nucleico
8.
BMC Res Notes ; 7: 371, 2014 Jun 18.
Artigo em Inglês | MEDLINE | ID: mdl-24938749

RESUMO

BACKGROUND: The fast reduction of prices of DNA sequencing allowed rapid accumulation of genome data. However, the process of obtaining complete genome sequences is still very time consuming and labor demanding. In addition, data produced from various sequencing technologies or alternative assemblies remain underexplored to improve assembly of incomplete genome sequences. FINDINGS: We have developed FGAP, a tool for closing gaps of draft genome sequences that takes advantage of different datasets. FGAP uses BLAST to align multiple contigs against a draft genome assembly aiming to find sequences that overlap gaps. The algorithm selects the best sequence to fill and eliminate the gap. CONCLUSIONS: FGAP reduced the number of gaps by 78% in an E. coli draft genome assembly using two different sequencing technologies, Illumina and 454. Using PacBio long reads, 98% of gaps were solved. In human chromosome 14 assemblies, FGAP reduced the number of gaps by 35%. All the inserted sequences were validated with a reference genome using QUAST. The source code and a web tool are available at http://www.bioinfo.ufpr.br/fgap/.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Escherichia coli/genética , Genoma Bacteriano , Genoma Humano , Software , Algoritmos , Sequência de Bases , Cromossomos Humanos Par 14 , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Dados de Sequência Molecular
9.
Genome Biol ; 14(9): R101, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24034426

RESUMO

BACKGROUND: The short reads output by first- and second-generation DNA sequencing instruments cannot completely reconstruct microbial chromosomes. Therefore, most genomes have been left unfinished due to the significant resources required to manually close gaps in draft assemblies. Third-generation, single-molecule sequencing addresses this problem by greatly increasing sequencing read length, which simplifies the assembly problem. RESULTS: To measure the benefit of single-molecule sequencing on microbial genome assembly, we sequenced and assembled the genomes of six bacteria and analyzed the repeat complexity of 2,267 complete bacteria and archaea. Our results indicate that the majority of known bacterial and archaeal genomes can be assembled without gaps, at finished-grade quality, using a single PacBio RS sequencing library. These single-library assemblies are also more accurate than typical short-read assemblies and hybrid assemblies of short and long reads. CONCLUSIONS: Automated assembly of long, single-molecule sequencing data reduces the cost of microbial finishing to $1,000 for most genomes, and future advances in this technology are expected to drive the cost lower. This is expected to increase the number of completed genomes, improve the quality of microbial genome databases, and enable high-fidelity, population-scale studies of pan-genomes and chromosomal organization.


Assuntos
Mapeamento de Sequências Contíguas/métodos , Genoma Arqueal , Genoma Bacteriano , Análise de Sequência de DNA/métodos , Software , Algoritmos , Sequência de Bases , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Escherichia coli/genética , Francisella tularensis/genética , Tamanho do Genoma , Biblioteca Genômica , Mannheimia haemolytica/genética , Dados de Sequência Molecular , Salmonella enterica/genética , Análise de Sequência de DNA/economia , Análise de Sequência de DNA/estatística & dados numéricos
10.
Genome Biol ; 14(9): R100, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24028704

RESUMO

BACKGROUND: Haplotypes are important for assessing genealogy and disease susceptibility of individual genomes,but are difficult to obtain with routine sequencing approaches. Experimental haplotype reconstruction based on assembling fragments of individual chromosomes is promising, but with variable yields due to incompletely understood parameter choices. RESULTS: We parameterize the clone-based haplotyping problem in order to provide theoretical and empirical assessments of the impact of different parameters on haplotype assembly. We confirm the intuition that long clones help link together heterozygous variants and thus improve haplotype length. Furthermore, given the length of the clones, we address how to choose the other parameters, including number of pools, clone coverage and sequencing coverage, so as to maximize haplotype length. We model the problem theoretically and show empirically the benefits of using larger clones with moderate number of pools and sequencing coverage. In particular, using 140 kb BAC clones, we construct haplotypes for a personal genome and assemble haplotypes with N50 values greater than 2.6 Mb. These assembled haplotypes are longer and at least as accurate as haplotypes of existing clone-based strategies, whether in vivo or in vitro. CONCLUSIONS: Our results provide practical guidelines for the development and design of clone-based methods to achieve long range, high-resolution and accurate haplotypes.


Assuntos
Algoritmos , Mapeamento de Sequências Contíguas/métodos , Genoma Humano , Antígenos HLA/genética , Haplótipos , Tipagem Molecular/métodos , Cromossomos Artificiais Bacterianos , Clonagem Molecular , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Humanos , Tipagem Molecular/estatística & dados numéricos , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA
11.
BMC Res Notes ; 6: 334, 2013 Aug 22.
Artigo em Inglês | MEDLINE | ID: mdl-23965294

RESUMO

BACKGROUND: The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments "read" by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. These "gold standards" can be expensive to produce and may only cover a small fraction of the genome, which limits their applicability to newly generated genome sequences. Here we introduce a de novo probabilistic measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. RESULTS: We demonstrate that our de novo score can be computed quickly and accurately in a practical setting even for large datasets, by estimating the score from a relatively small sample of the reads. To demonstrate the benefits of our score, we measure the quality of the assemblies generated in the GAGE and Assemblathon 1 assembly "bake-offs" with our metric. Even without knowledge of the true reference sequence, our de novo metric closely matches the reference-based evaluation metrics used in the studies and outperforms other de novo metrics traditionally used to measure assembly quality (such as N50). Finally, we highlight the application of our score to optimize assembly parameters used in genome assemblers, which enables better assemblies to be produced, even without prior knowledge of the genome being assembled. CONCLUSION: Likelihood-based measures, such as ours proposed here, will become the new standard for de novo assembly evaluation.


Assuntos
Mapeamento de Sequências Contíguas/estatística & dados numéricos , Genoma Bacteriano , Rhodobacter sphaeroides/genética , Software , Staphylococcus aureus/genética , Staphylococcus epidermidis/genética , Algoritmos , Genômica/métodos , Funções Verossimilhança , Análise de Sequência de DNA
12.
BMC Genomics ; 5: 84, 2004 Nov 03.
Artigo em Inglês | MEDLINE | ID: mdl-15527499

RESUMO

BACKGROUND: The ongoing efforts to sequence the honey bee genome require additional initiatives to define its transcriptome. Towards this end, we employed the Open Reading frame ESTs (ORESTES) strategy to generate profiles for the life cycle of Apis mellifera workers. RESULTS: Of the 5,021 ORESTES, 35.2% matched with previously deposited Apis ESTs. The analysis of the remaining sequences defined a set of putative orthologs whose majority had their best-match hits with Anopheles and Drosophila genes. CAP3 assembly of the Apis ORESTES with the already existing 15,500 Apis ESTs generated 3,408 contigs. BLASTX comparison of these contigs with protein sets of organisms representing distinct phylogenetic clades revealed a total of 1,629 contigs that Apis mellifera shares with different taxa. Most (41%) represent genes that are in common to all taxa, another 21% are shared between metazoans (Bilateria), and 16% are shared only within the Insecta clade. A set of 23 putative genes presented a best match with human genes, many of which encode factors related to cell signaling/signal transduction. 1,779 contigs (52%) did not match any known sequence. Applying a correction factor deduced from a parallel analysis performed with Drosophila melanogaster ORESTES, we estimate that approximately half of these no-match ESTs contigs (22%) should represent Apis-specific genes. CONCLUSIONS: The versatile and cost-efficient ORESTES approach produced minilibraries for honey bee life cycle stages. Such information on central gene regions contributes to genome annotation and also lends itself to cross-transcriptome comparisons to reveal evolutionary trends in insect genomes.


Assuntos
Abelhas/genética , Etiquetas de Sequências Expressas , Fases de Leitura Aberta/genética , Transcrição Gênica/genética , Animais , Anopheles/genética , Caenorhabditis elegans , Classificação , Análise por Conglomerados , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Drosophila melanogaster/genética , Genes de Helmintos/genética , Genes de Insetos/genética , Genoma , Genoma Fúngico , Genoma Humano , Genoma de Protozoário , Humanos
13.
BMC Genomics ; 5: 89, 2004 Nov 16.
Artigo em Inglês | MEDLINE | ID: mdl-15546486

RESUMO

BACKGROUND: The cellular response of plants to water-deficits has both economic and evolutionary importance directly affecting plant productivity in agriculture and plant survival in the natural environment. Genes induced by water-deficit stress have been successfully enumerated in plants that are relatively sensitive to cellular dehydration, however we have little knowledge as to the adaptive role of these genes in establishing tolerance to water loss at the cellular level. Our approach to address this problem has been to investigate the genetic responses of plants that are capable of tolerating extremes of dehydration, in particular the desiccation-tolerant bryophyte, Tortula ruralis. To establish a sound basis for characterizing the Tortula genome in regards to desiccation tolerance, we analyzed 10,368 expressed sequence tags (ESTs) from rehydrated rapid-dried Tortula gametophytes, a stage previously determined to exhibit the maximum stress induced change in gene expression. RESULTS: The 10, 368 ESTs formed 5,563 EST clusters (contig groups representing individual genes) of which 3,321 (59.7%) exhibited similarity to genes present in the public databases and 2,242 were categorized as unknowns based on protein homology scores. The 3,321 clusters were classified by function using the Gene Ontology (GO) hierarchy and the KEGG database. The results indicate that the transcriptome contains a diverse population of transcripts that reflects, as expected, a period of metabolic upheaval in the gametophyte cells. Much of the emphasis within the transcriptome is centered on the protein synthetic machinery, ion and metabolite transport, and membrane biosynthesis and repair. Rehydrating gametophytes also have an abundance of transcripts that code for enzymes involved in oxidative stress metabolism and phosphorylating activities. The functional classifications reflect a remarkable consistency with what we have previously established with regards to the metabolic activities that are important in the recovery of the gametophytes from desiccation. A comparison of the GO distribution of Tortula clusters with an identical analysis of 9,981 clusters from the desiccation sensitive bryophyte species Physcomitrella patens, revealed, and accentuated, the differences between stressed and unstressed transcriptomes. Cross species sequence comparisons indicated that on the whole the Tortula clusters were more closely related to those from Physcomitrella than Arabidopsis (complete genome BLASTx comparison) although because of the differences in the databases there were more high scoring matches to the Arabidopsis sequences. The most abundant transcripts contained within the Tortula ESTs encode Late Embryogenesis Abundant (LEA) proteins that are normally associated with drying plant tissues. This suggests that LEAs may also play a role in recovery from desiccation when water is reintroduced into a dried tissue. CONCLUSION: The establishment of a rehydration EST collection for Tortula ruralis, an important plant model for plant stress responses and vegetative desiccation tolerance, is an important step in understanding the genome level response to cellular dehydration. The type of transcript analysis performed here has laid the foundation for more detailed functional and genome level analyses of the genes involved in desiccation tolerance in plants.


Assuntos
Briófitas/genética , Briófitas/metabolismo , DNA de Plantas/classificação , Dessecação , Genes de Plantas/genética , Transcrição Gênica/genética , Água/metabolismo , Arabidopsis/genética , Análise por Conglomerados , Sequência Conservada/genética , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Bases de Dados Genéticas , Etiquetas de Sequências Expressas , Fases de Leitura Aberta/genética
14.
Genome Res ; 14(4): 493-506, 2004 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-15059990

RESUMO

We assessed the content, structure, and distribution of segmental duplications (> or =90% sequence identity, > or =5 kb length) within the published version of the Rattus norvegicus genome assembly (v.3.1). The overall fraction of duplicated sequence within the rat assembly (2.92%) is greater than that of the mouse (1%-1.2%) but significantly less than that of human ( approximately 5%). Duplications were nonuniformly distributed, occurring predominantly as tandem and tightly clustered intrachromosomal duplications. Regions containing extensive interchromosomal duplications were observed, particularly within subtelomeric and pericentromeric regions. We identified 41 discrete genomic regions greater than 1 Mb in size, termed "duplication blocks." These appear to have been the target of extensive duplication over millions of years of evolution. Gene content within duplicated regions ( approximately 1%) was lower than expected based on the genome representation. Interestingly, sequence contigs lacking chromosome assignment ("the unplaced chromosome") showed a marked enrichment for segmental duplication (45% of 75.2 Mb), indicating that segmental duplications have been problematic for sequence and assembly of the rat genome. Further targeted efforts are required to resolve the organization and complexity of these regions.


Assuntos
Duplicação Gênica , Ratos Endogâmicos BN/genética , Animais , Composição de Bases/genética , Cromossomos/genética , Biologia Computacional/métodos , Biologia Computacional/estatística & dados numéricos , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Conversão Gênica/genética , Genes/genética , Genoma , Ratos
15.
Genome Res ; 14(4): 679-84, 2004 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-15060010

RESUMO

CLONEPICKER is a software pipeline that integrates sequence data with BAC clone fingerprints to dynamically select a minimal overlapping clone set covering the whole genome. In the Rat Genome Sequencing Project (RGSP), a hybrid strategy of "clone by clone" and "whole genome shotgun" approaches was used to maximize the merits of both approaches. Like the "clone by clone" method, one key challenge for this strategy was to select a low-redundancy clone set that covered the whole genome while the sequencing is in progress. The CLONEPICKER pipeline met this challenge using restriction enzyme fingerprint data, BAC end sequence data, and sequences generated from individual BAC clones as well as WGS reads. In the RGSP, an average of 7.5 clones was identified from each side of a seed clone, and the minimal overlapping clones were reliably selected. Combined with the assembled BAC fingerprint map, a set of BAC clones that covered >97% of the genome was identified and used in the RGSP.


Assuntos
Cromossomos Artificiais Bacterianos/genética , Mapeamento de Sequências Contíguas/métodos , Genoma , Análise de Sequência de DNA/métodos , Animais , Biologia Computacional/métodos , Biologia Computacional/estatística & dados numéricos , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Impressões Digitais de DNA/métodos , Impressões Digitais de DNA/estatística & dados numéricos , Ratos , Análise de Sequência de DNA/estatística & dados numéricos , Software/estatística & dados numéricos
16.
Genome Res ; 14(1): 99-108, 2004 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-14672978

RESUMO

Comprehensive identification of DNA cis-regulatory elements is crucial for a predictive understanding of transcriptional network dynamics. Strong evidence suggests that these DNA sequence motifs are highly conserved between related species, reflecting strong selection on the network of regulatory interactions that underlie common cellular behavior. Here, we exploit a systems-level aspect of this conservation-the network-level topology of these interactions-to map transcription factor (TF) binding sites on a genomic scale. Using network-level conservation as a constraint, our algorithm finds 71% of known TF binding sites in the yeast Saccharomyces cerevisiae, using only 12% of the sequence of a phylogenetic neighbor. Most of the novel predicted motifs show strong features of known TF binding sites, such as functional category and/or expression profile coherence of their corresponding genes. Network-level conservation should provide a powerful constraint for the systematic mapping of TF binding sites in the larger genomes of higher eukaryotes.


Assuntos
Sequência Conservada/genética , Genoma Fúngico , Saccharomyces cerevisiae/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo , Algoritmos , Composição de Bases/genética , Sítios de Ligação/genética , Sítios de Ligação/fisiologia , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/estatística & dados numéricos , DNA Fúngico/genética , Humanos , Modelos Genéticos , Modelos Estatísticos , Valor Preditivo dos Testes
17.
Plant Cell ; 14(7): 1441-56, 2002 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-12119366

RESUMO

Analysis of a collection of 120,892 single-pass ESTs, derived from 26 different tomato cDNA libraries and reduced to a set of 27,274 unique consensus sequences (unigenes), revealed that 70% of the unigenes have identifiable homologs in the Arabidopsis genome. Genes corresponding to metabolism have remained most conserved between these two genomes, whereas genes encoding transcription factors are among the fastest evolving. The majority of the 10 largest conserved multigene families share similar copy numbers in tomato and Arabidopsis, suggesting that the multiplicity of these families may have occurred before the divergence of these two species. An exception to this multigene conservation was observed for the E8-like protein family, which is associated with fruit ripening and has higher copy number in tomato than in Arabidopsis. Finally, six BAC clones from different parts of the tomato genome were isolated, genetically mapped, sequenced, and annotated. The combined analysis of the EST database and these six sequenced BACs leads to the prediction that the tomato genome encodes approximately 35,000 genes, which are sequestered largely in euchromatic regions corresponding to less than one-quarter of the total DNA in the tomato nucleus.


Assuntos
Etiquetas de Sequências Expressas , Genoma de Planta , Solanum lycopersicum/genética , Arabidopsis/genética , Mapeamento Cromossômico , Cromossomos Bacterianos/genética , Clonagem Molecular , Sequência Consenso/genética , Sequência Conservada/genética , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/estatística & dados numéricos , DNA Bacteriano/genética , DNA de Plantas/genética , Evolução Molecular , Biblioteca Genômica , Medicago/genética , Dados de Sequência Molecular , Família Multigênica , Análise de Sequência de DNA/métodos
18.
Bioinformatics ; 18(3): 484-5, 2002 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-11934749

RESUMO

SUMMARY: One of the more common uses of the program FingerPrint Contigs (FPC) is to assemble random restriction digest 'fingerprints' of overlapping genomic clones into contigs. To improve the rate of assembling contigs from large fingerprint databases we have adapted FPC so that it can be run in parallel on multiple processors and servers. The current version of 'parallelized FPC' has been used in our laboratory to assemble mammalian BAC fingerprint databases, each containing more than 300000 BAC fingerprints. AVAILABILITY: This parallelized version of FPC is available under the GNU GPL licence, and can be downloaded from ftp://ftp.bcgsc.bc.ca/pub/fpcd.


Assuntos
Algoritmos , Clonagem Molecular , Metodologias Computacionais , Mapeamento de Sequências Contíguas/métodos , Bases de Dados Genéticas , Animais , Sequência de Bases , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Impressões Digitais de DNA/métodos , Humanos , Internet , Camundongos , Dados de Sequência Molecular , Mapeamento por Restrição , Sensibilidade e Especificidade , Sitios de Sequências Rotuladas , Software , Fatores de Tempo
19.
Genome Biol ; 3(12): RESEARCH0074, 2002.
Artigo em Inglês | MEDLINE | ID: mdl-12537563

RESUMO

BACKGROUND: Cardiovascular diseases are the primary cause of death worldwide; the identification of genes specifically expressed in the heart is thus of major biomedical interest. We carried out a comprehensive analysis of gene-expression profiles using expressed sequence tags (ESTs) to identify genes overexpressed in the human adult heart. The initial set of genes expressed in the heart was constructed by clustering and assembling ESTs from heart cDNA libraries. Expression profiles were then generated for each gene by counting their cognate ESTs in all libraries. Differential expression was assessed by applying a previously published statistical procedure to these profiles. RESULTS: We identified 35 cardiac-specific genes overexpressed in the heart, some of which displayed significant coexpression. Some genes had no previously recognized cardiac function. Of the 35 genes, 32 were mapped back onto the human genome sequence. According to Online Mendelian Inheritance in Man (OMIM), five genes were previously known as heart-disease genes and one gene was located in the locus of a bleeding disorder. Analysis of the promoter regions of this collection of genes provides the first list of putative regulatory elements associated with differential cardiac expression. CONCLUSION: This study shows that ESTs are still a powerful tool to identify differentially expressed genes. We present a list of genes specifically expressed in the human heart, one of which is a candidate for a bleeding disorder. In addition, we provide the first set of putative regulatory elements, the combination of which appears correlated with heart-specific gene expression.


Assuntos
Etiquetas de Sequências Expressas , Biblioteca Gênica , Miocárdio/metabolismo , Adulto , Mapeamento Cromossômico/métodos , Análise por Conglomerados , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Bases de Dados Genéticas , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Regulação da Expressão Gênica/genética , Regulação da Expressão Gênica/fisiologia , Cardiopatias/genética , Humanos , Músculo Liso Vascular/química , Músculo Liso Vascular/metabolismo , Miocárdio/química , Especificidade de Órgãos/genética , Regiões Promotoras Genéticas/genética
20.
Genes Chromosomes Cancer ; 32(2): 144-54, 2001 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-11550282

RESUMO

The Philadelphia translocation, t(9;22)(q34;q11), is the microscopically visible product of recombination between two genes, ABL1 on chromosome 9 and BCR on chromosome 22, and gives rise to a functional hybrid BCR-ABL1 gene with demonstrated leukemogenic properties. Breakpoints in BCR occur mostly within one of two regions: a 5 kb major breakpoint cluster region (M-Bcr) and a larger 35 kb minor breakpoint cluster region (m-Bcr) towards the 3' end of the first BCR intron. By contrast, breakpoints in ABL1 are reported to occur more widely across a >200 kb region which spans the large first and second introns. The mechanisms that determine preferential breakage sites in BCR, and which cause recombination between BCR and ABL1, are presently unknown. In some cases, Alu repeats have been identified at or near sequenced breakpoint sites in both genes, providing indications, albeit controversial, that they may be relevant. For the present study, we carried out a detailed analysis of genomic BCR and ABL1 sequences to identify, classify, and locate interspersed repeat sequences and to relate their distribution to precisely mapped BCR-ABL1 recombination sites. Our findings confirm that Alu are the most abundant class of repeat in both genes, but that they occupy fewer sites than previously estimated and that they are distributed nonrandomly. r-Scan statistics were applied to provide a measure of repeat distribution and to evaluate extremes in repeat spacing. A significant lack of Alu elements was observed across the major and minor breakpoint cluster regions of BCR and across a 25-kb region showing a high frequency of breakage in ABL1. These findings counter the suggestion that occurrence of Alu at BCR-ABL1 recombination sites is likely by chance because of the high density of Alu in these two genes. Instead, as yet unidentified DNA conformation or nucleotide characteristics peculiar to the preferentially recombining regions, including those Alu elements present within them, more likely influence their fragility.


Assuntos
Elementos Alu/genética , Quebra Cromossômica/genética , Genes abl/genética , Sequências Repetitivas Dispersas/genética , Proteínas Oncogênicas/genética , Proteínas Tirosina Quinases , Proteínas Proto-Oncogênicas , Mapeamento de Sequências Contíguas/métodos , Mapeamento de Sequências Contíguas/estatística & dados numéricos , Bases de Dados Factuais , Humanos , Proteínas Proto-Oncogênicas c-bcr , Distribuições Estatísticas
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA