Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
PLoS Genet ; 15(2): e1007858, 2019 02.
Artigo em Inglês | MEDLINE | ID: mdl-30735495

RESUMO

Complex chromosomal rearrangements (CCRs) are rearrangements involving more than two chromosomes or more than two breakpoints. Whole genome sequencing (WGS) allows for outstanding high resolution characterization on the nucleotide level in unique sequences of such rearrangements, but problems remain for mapping breakpoints in repetitive regions of the genome, which are known to be prone to rearrangements. Hence, multiple complementary WGS experiments are sometimes needed to solve the structures of CCRs. We have studied three individuals with CCRs: Case 1 and Case 2 presented with de novo karyotypically balanced, complex interchromosomal rearrangements (46,XX,t(2;8;15)(q35;q24.1;q22) and 46,XY,t(1;10;5)(q32;p12;q31)), and Case 3 presented with a de novo, extremely complex intrachromosomal rearrangement on chromosome 1. Molecular cytogenetic investigation revealed cryptic deletions in the breakpoints of chromosome 2 and 8 in Case 1, and on chromosome 10 in Case 2, explaining their clinical symptoms. In Case 3, 26 breakpoints were identified using WGS, disrupting five known disease genes. All rearrangements were subsequently analyzed using optical maps, linked-read WGS, and short-read WGS. In conclusion, we present a case series of three unique de novo CCRs where we by combining the results from the different technologies fully solved the structure of each rearrangement. The power in combining short-read WGS with long-molecule sequencing or optical mapping in these unique de novo CCRs in a clinical setting is demonstrated.


Assuntos
Cromossomos/genética , Rearranjo Gênico/genética , Variação Estrutural do Genoma/genética , Mapeamento Cromossômico/métodos , Feminino , Humanos , Masculino , Sequenciamento Completo do Genoma/métodos
2.
PLoS Genet ; 14(11): e1007780, 2018 11.
Artigo em Inglês | MEDLINE | ID: mdl-30419018

RESUMO

Clustered copy number variants (CNVs) as detected by chromosomal microarray analysis (CMA) are often reported as germline chromothripsis. However, such cases might need further investigations by massive parallel whole genome sequencing (WGS) in order to accurately define the underlying complex rearrangement, predict the occurrence mechanisms and identify additional complexities. Here, we utilized WGS to delineate the rearrangement structure of 21 clustered CNV carriers first investigated by CMA and identified a total of 83 breakpoint junctions (BPJs). The rearrangements were further sub-classified depending on the patterns observed: I) Cases with only deletions (n = 8) often had additional structural rearrangements, such as insertions and inversions typical to chromothripsis; II) cases with only duplications (n = 7) or III) combinations of deletions and duplications (n = 6) demonstrated mostly interspersed duplications and BPJs enriched with microhomology. In two cases the rearrangement mutational signatures indicated both a breakage-fusion-bridge cycle process and haltered formation of a ring chromosome. Finally, we observed two cases with Alu- and LINE-mediated rearrangements as well as two unrelated individuals with seemingly identical clustered CNVs on 2p25.3, possibly a rare European founder rearrangement. In conclusion, through detailed characterization of the derivative chromosomes we show that multiple mechanisms are likely involved in the formation of clustered CNVs and add further evidence for chromoanagenesis mechanisms in both "simple" and highly complex chromosomal rearrangements. Finally, WGS characterization adds positional information, important for a correct clinical interpretation and deciphering mechanisms involved in the formation of these rearrangements.


Assuntos
Variações do Número de Cópias de DNA , Replicação do DNA/genética , Elementos Alu , Pontos de Quebra do Cromossomo , Cromotripsia , Rearranjo Gênico , Genoma Humano , Humanos , Elementos Nucleotídeos Longos e Dispersos , Análise de Sequência com Séries de Oligonucleotídeos , Sequenciamento Completo do Genoma
3.
Nature ; 497(7451): 579-84, 2013 May 30.
Artigo em Inglês | MEDLINE | ID: mdl-23698360

RESUMO

Conifers have dominated forests for more than 200 million years and are of huge ecological and economic importance. Here we present the draft assembly of the 20-gigabase genome of Norway spruce (Picea abies), the first available for any gymnosperm. The number of well-supported genes (28,354) is similar to the >100 times smaller genome of Arabidopsis thaliana, and there is no evidence of a recent whole-genome duplication in the gymnosperm lineage. Instead, the large genome size seems to result from the slow and steady accumulation of a diverse set of long-terminal repeat transposable elements, possibly owing to the lack of an efficient elimination mechanism. Comparative sequencing of Pinus sylvestris, Abies sibirica, Juniperus communis, Taxus baccata and Gnetum gnemon reveals that the transposable element diversity is shared among extant conifers. Expression of 24-nucleotide small RNAs, previously implicated in transposable element silencing, is tissue-specific and much lower than in other plants. We further identify numerous long (>10,000 base pairs) introns, gene-like fragments, uncharacterized long non-coding RNAs and short RNAs. This opens up new genomic avenues for conifer forestry and breeding.


Assuntos
Evolução Molecular , Genoma de Planta/genética , Picea/genética , Sequência Conservada/genética , Elementos de DNA Transponíveis/genética , Inativação Gênica , Genes de Plantas/genética , Genômica , Internet , Íntrons/genética , Fenótipo , RNA não Traduzido/genética , Análise de Sequência de DNA , Sequências Repetidas Terminais/genética , Transcrição Gênica/genética
4.
Hum Mutat ; 38(2): 180-192, 2017 02.
Artigo em Inglês | MEDLINE | ID: mdl-27862604

RESUMO

Most balanced translocations are thought to result mechanistically from nonhomologous end joining or, in rare cases of recurrent events, by nonallelic homologous recombination. Here, we use low-coverage mate pair whole-genome sequencing to fine map rearrangement breakpoint junctions in both phenotypically normal and affected translocation carriers. In total, 46 junctions from 22 carriers of balanced translocations were characterized. Genes were disrupted in 48% of the breakpoints; recessive genes in four normal carriers and known dominant intellectual disability genes in three affected carriers. Finally, seven candidate disease genes were disrupted in five carriers with neurocognitive disabilities (SVOPL, SUSD1, TOX, NCALD, SLC4A10) and one XX-male carrier with Tourette syndrome (LYPD6, GPC5). Breakpoint junction analyses revealed microhomology and small templated insertions in a substantive fraction of the analyzed translocations (17.4%; n = 4); an observation that was substantiated by reanalysis of 37 previously published translocation junctions. Microhomology associated with templated insertions is a characteristic seen in the breakpoint junctions of rearrangements mediated by error-prone replication-based repair mechanisms. Our data implicate that a mechanism involving template switching might contribute to the formation of at least 15% of the interchromosomal translocation events.


Assuntos
Mapeamento Cromossômico , Translocação Genética , Sequenciamento Completo do Genoma , Sequência de Bases , Quebra Cromossômica , Hibridização Genômica Comparativa , Variações do Número de Cópias de DNA , Feminino , Estudos de Associação Genética , Genômica/métodos , Genótipo , Recombinação Homóloga , Humanos , Hibridização in Situ Fluorescente , Cariótipo , Masculino , Fenótipo
5.
Arch Toxicol ; 91(5): 2067-2078, 2017 May.
Artigo em Inglês | MEDLINE | ID: mdl-27838757

RESUMO

Arsenic, a carcinogen with immunotoxic effects, is a common contaminant of drinking water and certain food worldwide. We hypothesized that chronic arsenic exposure alters gene expression, potentially by altering DNA methylation of genes encoding central components of the immune system. We therefore analyzed the transcriptomes (by RNA sequencing) and methylomes (by target-enrichment next-generation sequencing) of primary CD4-positive T cells from matched groups of four women each in the Argentinean Andes, with fivefold differences in urinary arsenic concentrations (median concentrations of urinary arsenic in the lower- and high-arsenic groups: 65 and 276 µg/l, respectively). Arsenic exposure was associated with genome-wide alterations of gene expression; principal component analysis indicated that the exposure explained 53% of the variance in gene expression among the top variable genes and 19% of 28,351 genes were differentially expressed (false discovery rate <0.05) between the exposure groups. Key genes regulating the immune system, such as tumor necrosis factor alpha and interferon gamma, as well as genes related to the NF-kappa-beta complex, were significantly downregulated in the high-arsenic group. Arsenic exposure was associated with genome-wide DNA methylation; the high-arsenic group had 3% points higher genome-wide full methylation (>80% methylation) than the lower-arsenic group. Differentially methylated regions that were hyper-methylated in the high-arsenic group showed enrichment for immune-related gene ontologies that constitute the basic functions of CD4-positive T cells, such as isotype switching and lymphocyte activation and differentiation. In conclusion, chronic arsenic exposure from drinking water was related to changes in the transcriptome and methylome of CD4-positive T cells, both genome wide and in specific genes, supporting the hypothesis that arsenic causes immunotoxicity by interfering with gene expression and regulation.


Assuntos
Arsênio/toxicidade , Linfócitos T CD4-Positivos/efeitos dos fármacos , Metilação de DNA/efeitos dos fármacos , Exposição Ambiental/efeitos adversos , Regulação da Expressão Gênica/efeitos dos fármacos , Adulto , Argentina , Linfócitos T CD4-Positivos/fisiologia , Ilhas de CpG , Feminino , Perfilação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Pessoa de Meia-Idade , Regiões Promotoras Genéticas
6.
BMC Bioinformatics ; 17 Suppl 4: 69, 2016 Mar 02.
Artigo em Inglês | MEDLINE | ID: mdl-26961371

RESUMO

BACKGROUND: Bisulfite treatment of DNA followed by sequencing (BS-seq) has become a standard technique in epigenetic studies, providing researchers with tools for generating single-base resolution maps of whole methylomes. Aligning bisulfite-treated reads, however, is a computationally difficult task: bisulfite treatment decreases the (lexical) complexity of low-methylated genomic regions, and C-to-T mismatches may reflect cytosine unmethylation rather than SNPs or sequencing errors. Further challenges arise both during and after the alignment phase: data structures used by the aligner should be fast and should fit into main memory, and the methylation-caller output should be somehow compressed, due to its significant size. METHODS: As far as data structures employed to align bisulfite-treated reads are concerned, solutions proposed in the literature can be roughly grouped into two main categories: those storing pointers at each text position (e.g. hash tables, suffix trees/arrays), and those using the information-theoretic minimum number of bits (e.g. FM indexes and compressed suffix arrays). The former are fast and memory consuming. The latter are much slower and light. In this paper, we try to close this gap proposing a data structure for aligning bisulfite-treated reads which is at the same time fast, light, and very accurate. We reach this objective by combining a recent theoretical result on succinct hashing with a bisulfite-aware hash function. Furthermore, the new versions of the tools implementing our ideas|the aligner ERNE-BS5 2 and the caller ERNE-METH 2|have been extended with increased downstream compatibility (EPP/Bismark cov output formats), output compression, and support for target enrichment protocols. RESULTS: Experimental results on public and simulated WGBS libraries show that our algorithmic solution is a competitive tradeoff between hash-based and BWT-based indexes, being as fast and accurate as the former, and as memory-efficient as the latter. CONCLUSIONS: The new functionalities of our bisulfite aligner and caller make it a fast and memory efficient tool, useful to analyze big datasets with little computational resources, to easily process target enrichment data, and produce statistics such as protocol efficiency and coverage as a function of the distance from target regions.


Assuntos
Metilação de DNA , DNA/química , Epigenômica , Análise de Sequência de DNA/métodos , Software , Sulfitos/química , Ilhas de CpG , Compressão de Dados , Genoma Humano , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos
7.
BMC Evol Biol ; 16: 59, 2016 Mar 08.
Artigo em Inglês | MEDLINE | ID: mdl-26956800

RESUMO

BACKGROUND: Although most insect species are specialized on one or few groups of plants, there are phytophagous insects that seem to use virtually any kind of plant as food. Understanding the nature of this ability to feed on a wide repertoire of plants is crucial for the control of pest species and for the elucidation of the macroevolutionary mechanisms of speciation and diversification of insect herbivores. Here we studied Vanessa cardui, the species with the widest diet breadth among butterflies and a potential insect pest, by comparing tissue-specific transcriptomes from caterpillars that were reared on different host plants. We tested whether the similarities of gene-expression response reflect the evolutionary history of adaptation to these plants in the Vanessa and related genera, against the null hypothesis of transcriptional profiles reflecting plant phylogenetic relatedness. RESULT: Using both unsupervised and supervised methods of data analysis, we found that the tissue-specific patterns of caterpillar gene expression are better explained by the evolutionary history of adaptation of the insects to the plants than by plant phylogeny. CONCLUSION: Our findings suggest that V. cardui may use two sets of expressed genes to achieve polyphagy, one associated with the ancestral capability to consume Rosids and Asterids, and another allowing the caterpillar to incorporate a wide range of novel host-plants.


Assuntos
Evolução Biológica , Borboletas/genética , Animais , Borboletas/crescimento & desenvolvimento , Borboletas/fisiologia , Herbivoria , Larva/fisiologia , Magnoliopsida/genética , Magnoliopsida/fisiologia , Oviposição , Filogenia , Transcriptoma
8.
Eur Respir J ; 47(3): 898-909, 2016 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-26585430

RESUMO

In pulmonary sarcoidosis, CD4(+) T-cells expressing T-cell receptor Vα2.3 accumulate in the lungs of HLA-DRB1*03(+) patients. To investigate T-cell receptor-HLA-DRB1*03 interactions underlying recognition of hitherto unknown antigens, we performed detailed analyses of T-cell receptor expression on bronchoalveolar lavage fluid CD4(+) T-cells from sarcoidosis patients.Pulmonary sarcoidosis patients (n=43) underwent bronchoscopy with bronchoalveolar lavage. T-cell receptor α and ß chains of CD4(+) T-cells were analysed by flow cytometry, DNA-sequenced, and three-dimensional molecular models of T-cell receptor-HLA-DRB1*03 complexes generated.Simultaneous expression of Vα2.3 with the Vß22 chain was identified in the lungs of all HLA-DRB1*03(+) patients. Accumulated Vα2.3/Vß22-expressing T-cells were highly clonal, with identical or near-identical Vα2.3 chain sequences and inter-patient similarities in Vß22 chain amino acid distribution. Molecular modelling revealed specific T-cell receptor-HLA-DRB1*03-peptide interactions, with a previously identified, sarcoidosis-associated vimentin peptide, (Vim)429-443 DSLPLVDTHSKRTLL, matching both the HLA peptide-binding cleft and distinct T-cell receptor features perfectly.We demonstrate, for the first time, the accumulation of large clonal populations of specific Vα2.3/Vß22 T-cell receptor-expressing CD4(+) T-cells in the lungs of HLA-DRB1*03(+) sarcoidosis patients. Several distinct contact points between Vα2.3/Vß22 receptors and HLA-DRB1*03 molecules suggest presentation of prototypic vimentin-derived peptides.


Assuntos
Linfócitos T CD4-Positivos/imunologia , Cadeias HLA-DRB1/metabolismo , Receptores de Antígenos de Linfócitos T/imunologia , Sarcoidose Pulmonar/imunologia , Adulto , Líquido da Lavagem Broncoalveolar , Broncoscopia , Feminino , Citometria de Fluxo , Humanos , Pulmão/imunologia , Masculino , Pessoa de Meia-Idade , Modelos Moleculares , Suécia
9.
J Med Genet ; 52(2): 111-22, 2015 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-25473103

RESUMO

BACKGROUND: Cytogenetically visible chromosomal translocations are highly informative as they can pinpoint strong effect genes even in complex genetic disorders. METHODS AND RESULTS: Here, we report a mother and daughter, both with borderline intelligence and learning problems within the dyslexia spectrum, and two apparently balanced reciprocal translocations: t(1;8)(p22;q24) and t(5;18)(p15;q11). By low coverage mate-pair whole-genome sequencing, we were able to pinpoint the genomic breakpoints to 2 kb intervals. By direct sequencing, we then located the chromosome 5p breakpoint to intron 9 of CTNND2. An additional case with a 163 kb microdeletion exclusively involving CTNND2 was identified with genome-wide array comparative genomic hybridisation. This microdeletion at 5p15.2 is also present in mosaic state in the patient's mother but absent from the healthy siblings. We then investigated the effect of CTNND2 polymorphisms on normal variability and identified a polymorphism (rs2561622) with significant effect on phonological ability and white matter volume in the left frontal lobe, close to cortical regions previously associated with phonological processing. Finally, given the potential role of CTNND2 in neuron motility, we used morpholino knockdown in zebrafish embryos to assess its effects on neuronal migration in vivo. Analysis of the zebrafish forebrain revealed a subpopulation of neurons misplaced between the diencephalon and telencephalon. CONCLUSIONS: Taken together, our human genetic and in vivo data suggest that defective migration of subpopulations of neuronal cells due to haploinsufficiency of CTNND2 contribute to the cognitive dysfunction in our patients.


Assuntos
Cateninas/genética , Estudos de Associação Genética , Predisposição Genética para Doença , Deficiência Intelectual/genética , Leitura , Adolescente , Adulto , Sequência de Bases , Criança , Pontos de Quebra do Cromossomo , Cognição , Éxons/genética , Feminino , Loci Gênicos , Proteínas de Fluorescência Verde/metabolismo , Humanos , Íntrons/genética , Masculino , Dados de Sequência Molecular , Mutação/genética , Linhagem , Polimorfismo de Nucleotídeo Único/genética , Análise de Sequência de DNA , Translocação Genética , Substância Branca/patologia , Adulto Jovem , Proteínas de Peixe-Zebra/genética , delta Catenina
10.
BMC Bioinformatics ; 15: 281, 2014 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-25128196

RESUMO

BACKGROUND: The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features.We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance. RESULTS: We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. CONCLUSION: We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.


Assuntos
Algoritmos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Biblioteca Gênica , Humanos , Reprodutibilidade dos Testes
11.
BMC Genomics ; 15: 439, 2014 Jun 06.
Artigo em Inglês | MEDLINE | ID: mdl-24906298

RESUMO

BACKGROUND: Sampling genomes with Fosmid vectors and sequencing of pooled Fosmid libraries on the Illumina platform for massive parallel sequencing is a novel and promising approach to optimizing the trade-off between sequencing costs and assembly quality. RESULTS: In order to sequence the genome of Norway spruce, which is of great size and complexity, we developed and applied a new technology based on the massive production, sequencing, and assembly of Fosmid pools (FP). The spruce chromosomes were sampled with ~40,000 bp Fosmid inserts to obtain around two-fold genome coverage, in parallel with traditional whole genome shotgun sequencing (WGS) of haploid and diploid genomes. Compared to the WGS results, the contiguity and quality of the FP assemblies were high, and they allowed us to fill WGS gaps resulting from repeats, low coverage, and allelic differences. The FP contig sets were further merged with WGS data using a novel software package GAM-NGS. CONCLUSIONS: By exploiting FP technology, the first published assembly of a conifer genome was sequenced entirely with massively parallel sequencing. Here we provide a comprehensive report on the different features of the approach and the optimization of the process.We have made public the input data (FASTQ format) for the set of pools used in this study:ftp://congenie.org/congenie/Nystedt_2013/Assembly/ProcessedData/FosmidPools/.(alternatively accessible via http://congenie.org/downloads).The software used for running the assembly process is available at http://research.scilifelab.se/andrej_alexeyenko/downloads/fpools/.


Assuntos
Vetores Genéticos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Picea/genética , Clonagem Molecular , Genoma de Planta , Sequenciamento de Nucleotídeos em Larga Escala/economia , Software
12.
BMC Bioinformatics ; 14 Suppl 7: S6, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23815503

RESUMO

BACKGROUND: In recent years more than 20 assemblers have been proposed to tackle the hard task of assembling NGS data. A common heuristic when assembling a genome is to use several assemblers and then select the best assembly according to some criteria. However, recent results clearly show that some assemblers lead to better statistics than others on specific regions but are outperformed on other regions or on different evaluation measures. To limit these problems we developed GAM-NGS (Genomic Assemblies Merger for Next Generation Sequencing), whose primary goal is to merge two or more assemblies in order to enhance contiguity and correctness of both. GAM-NGS does not rely on global alignment: regions of the two assemblies representing the same genomic locus (called blocks) are identified through reads' alignments and stored in a weighted graph. The merging phase is carried out with the help of this weighted graph that allows an optimal resolution of local problematic regions. RESULTS: GAM-NGS has been tested on six different datasets and compared to other assembly reconciliation tools. The availability of a reference sequence for three of them allowed us to show how GAM-NGS is a tool able to output an improved reliable set of sequences. GAM-NGS is also a very efficient tool able to merge assemblies using substantially less computational resources than comparable tools. In order to achieve such goals, GAM-NGS avoids global alignment between contigs, making its strategy unique among other assembly reconciliation tools. CONCLUSIONS: The difficulty to obtain correct and reliable assemblies using a single assembler is forcing the introduction of new algorithms able to enhance de novo assemblies. GAM-NGS is a tool able to merge two or more assemblies in order to improve contiguity and correctness. It can be used on all NGS-based assembly projects and it shows its full potential with multi-library Illumina-based projects. With more than 20 available assemblers it is hard to select the best tool. In this context we propose a tool that improves assemblies (and, as a by-product, perhaps even assemblers) by merging them and selecting the generating that is most likely to be correct.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA/métodos , Algoritmos , Cromossomos/genética , Genoma Bacteriano , Genoma Humano , Humanos , Rhodobacter sphaeroides/genética , Software , Staphylococcus aureus/genética
13.
Bioinformatics ; 28(1): 123-4, 2012 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-22084252

RESUMO

SUMMARY: The advent of high-throughput sequencers (HTS) introduced the need of new tools in order to analyse the large amount of data that those machines are able to produce. The mandatory first step for a wide range of analyses is the alignment of the sequences against a reference genome. We present a major update to our rNA (randomized Numerical Aligner) tool. The main feature of rNA is the fact that it achieves an accuracy greater than the majority of other tools in a feasible amount of time. rNA executables and source codes are freely downloadable at http://iga-rna.sourceforge.net/. CONTACT: vezzi@appliedgenomics.org; delfabbro@appliedgenomics.org SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Humanos
14.
BMC Bioinformatics ; 13 Suppl 14: S8, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-23095524

RESUMO

BACKGROUND: Next Generation Sequencing technologies are able to provide high genome coverages at a relatively low cost. However, due to limited reads' length (from 30 bp up to 200 bp), specific bioinformatics problems have become even more difficult to solve. De novo assembly with short reads, for example, is more complicated at least for two reasons: first, the overall amount of "noisy" data to cope with increased and, second, as the reads' length decreases the number of unsolvable repeats grows. Our work's aim is to go at the root of the problem by providing a pre-processing tool capable to produce (in-silico) longer and highly accurate sequences from a collection of Next Generation Sequencing reads. RESULTS: In this paper a seed-and-extend local assembler is presented. The kernel algorithm is a loop that, starting from a read used as seed, keeps extending it using heuristics whose main goal is to produce a collection of error-free and longer sequences. In particular, GapFiller carefully detects reliable overlaps and operates clustering similar reads in order to reconstruct the missing part between the two ends of the same insert. Our tool's output has been validated on 24 experiments using both simulated and real paired reads datasets. The output sequences are declared correct when the seed-mate is found. In the experiments performed, GapFiller was able to extend high percentages of the processed seeds and find their mates, with a false positives rate that turned out to be nearly negligible. CONCLUSIONS: GapFiller, starting from a sufficiently high short reads coverage, is able to produce high coverages of accurate longer sequences (from 300 bp up to 3500 bp). The procedure to perform safe extensions, together with the mate-found check, turned out to be a powerful criterion to guarantee contigs' correctness. GapFiller has further potential, as it could be applied in a number of different scenarios, including the post-processing validation of insertions/deletions detection pipelines, pre-processing routines on datasets for de novo assembly pipelines, or in any hierarchical approach designed to assemble, analyse or validate pools of sequences.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA/métodos , Mapeamento de Sequências Contíguas , Genoma , Genoma Humano , Humanos , Rhodobacter sphaeroides/genética , Software , Staphylococcus aureus/genética
15.
Clin Chim Acta ; 512: 40-48, 2021 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-33227269

RESUMO

The aim of this study was to evaluate the performance of a novel NGS-based assay to monitor mixed chimerism (MC) and compare its technical capacity to established techniques for chimerism analysis. Artificial and clinical samples with increasing amounts of patient DNA were compared using real-time PCR detection of indels and SNP, fragment analysis of short-tandem repeats (STR) and NGS analysis of indels. Real-time PCR displayed excellent sensitivity (>0,01%) but poor accuracy (>20 CV% at MC > 20%), while fragment analysis exhibited good accuracy (<5 CV% at MC > 20%) with limited sensitivity (>2,5%). In contrast, NGS chimerism demonstrated a sensitivity (>0,1%) equal to real-time PCR and an accuracy equal or better than STR analysis throughout an extensive range of mixed chimerism (0,1 - 100%). To evaluate performance of the separate techniques for chimerism determination, 75 retrospective patient monitoring samples (3-7 weeks post-HSCT) with low (<5%), intermediate (5-20%) or high mixed chimerism (>20%) were analyzed. The between run precision for the NGS assay varied from 0,72% (>20% MC) to 7,38% (MC < 5%). In conclusion, NGS displayed a combination of high sensitivity with good accuracy in both artificial and clinical chimerism samples.


Assuntos
Quimerismo , Transplante de Células-Tronco Hematopoéticas , DNA , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Repetições de Microssatélites , Estudos Retrospectivos , Quimeras de Transplante
16.
Mol Ecol Resour ; 20(5): 1171-1181, 2020 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-30848092

RESUMO

The high-throughput capacities of the Illumina sequencing platforms and the possibility to label samples individually have encouraged wide use of sample multiplexing. However, this practice results in read misassignment (usually <1%) across samples sequenced on the same lane. Alarmingly high rates of read misassignment of up to 10% were reported for lllumina sequencing machines with exclusion amplification chemistry. This may make use of these platforms prohibitive, particularly in studies that rely on low-quantity and low-quality samples, such as historical and archaeological specimens. Here, we use barcodes, short sequences that are ligated to both ends of the DNA insert, to directly quantify the rate of index hopping in 100-year old museum-preserved gorilla (Gorilla beringei) samples. Correcting for multiple sources of noise, we identify on average 0.470% of reads containing a hopped index. We show that sample-specific quantity of misassigned reads depends on the number of reads that any given sample contributes to the total sequencing pool, so that samples with few sequenced reads receive the greatest proportion of misassigned reads. This particularly affects ancient DNA samples, as these frequently differ in their DNA quantity and endogenous content. Through simulations we show that even low rates of index hopping, as reported here, can lead to biases in ancient DNA studies when multiplexing samples with vastly different quantities of endogenous material.


Assuntos
DNA Antigo , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Animais , DNA , Código de Barras de DNA Taxonômico , Gorilla gorilla/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos
17.
Genes (Basel) ; 9(10)2018 Oct 09.
Artigo em Inglês | MEDLINE | ID: mdl-30304863

RESUMO

The current human reference sequence (GRCh38) is a foundation for large-scale sequencing projects. However, recent studies have suggested that GRCh38 may be incomplete and give a suboptimal representation of specific population groups. Here, we performed a de novo assembly of two Swedish genomes that revealed over 10 Mb of sequences absent from the human GRCh38 reference in each individual. Around 6 Mb of these novel sequences (NS) are shared with a Chinese personal genome. The NS are highly repetitive, have an elevated GC-content, and are primarily located in centromeric or telomeric regions. Up to 1 Mb of NS can be assigned to chromosome Y, and large segments are also missing from GRCh38 at chromosomes 14, 17, and 21. Inclusion of NS into the GRCh38 reference radically improves the alignment and variant calling from short-read whole-genome sequencing data at several genomic loci. A re-analysis of a Swedish population-scale sequencing project yields > 75,000 putative novel single nucleotide variants (SNVs) and removes > 10,000 false positive SNV calls per individual, some of which are located in protein coding regions. Our results highlight that the GRCh38 reference is not yet complete and demonstrate that personal genome assemblies from local populations can improve the analysis of short-read whole-genome sequencing data.

18.
PLoS One ; 13(3): e0193928, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29529047

RESUMO

The detection of recurrent somatic chromosomal rearrangements is standard of care for most leukemia types. Even though karyotype analysis-a low-resolution genome-wide chromosome analysis-is still the gold standard, it often needs to be complemented with other methods to increase resolution. To evaluate the feasibility and applicability of mate pair whole genome sequencing (MP-WGS) to detect structural chromosomal rearrangements in the diagnostic setting, we sequenced ten bone marrow samples from leukemia patients with recurrent rearrangements. Samples were selected based on cytogenetic and FISH results at leukemia diagnosis to include common rearrangements of prognostic relevance. Using MP-WGS and in-house bioinformatic analysis all sought rearrangements were successfully detected. In addition, unexpected complexity or additional, previously undetected rearrangements was unraveled in three samples. Finally, the MP-WGS analysis pinpointed the location of chromosome junctions at high resolution and we were able to identify the exact exons involved in the resulting fusion genes in all samples and the specific junction at the nucleotide level in half of the samples. The results show that our approach combines the screening character from karyotype analysis with the specificity and resolution of cytogenetic and molecular methods. As a result of the straightforward analysis and high-resolution detection of clinically relevant rearrangements, we conclude that MP-WGS is a feasible method for routine leukemia diagnostics of structural chromosomal rearrangements.


Assuntos
Aberrações Cromossômicas , Leucemia/genética , Sequenciamento Completo do Genoma/métodos , Medula Óssea , Biologia Computacional , Detecção Precoce de Câncer , Éxons , Estudos de Viabilidade , Humanos , Hibridização in Situ Fluorescente , Leucemia/patologia
19.
F1000Res ; 6: 664, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28781756

RESUMO

Reliable detection of large structural variation ( > 1000 bp) is important in both rare and common genetic disorders. Whole genome sequencing (WGS) is a technology that may be used to identify a large proportion of the genomic structural variants (SVs) in an individual in a single experiment. Even though SV callers have been extensively used in research to detect mutations, the potential usage of SV callers within routine clinical diagnostics is hindered by high computational costs, usage of non-standard output format, and limited support for the various sequencing platforms and libraries. Another well known, but not well-addressed problem is the large number of benign variants and reference errors present in the human genome that further complicates analysis. Here we present TIDDIT, a time efficient variant caller, that uses discordant read pairs as well as the depth of coverage and split reads to detect and classify a large spectrum of SVs. As part of the software suite, TIDDIT also includes a database functionality that enables filtering for rare variants and reduces the number of false positive calls and background noise. Benchmarked against five state-of-the-art SV callers, TIDDIT performs at an equal/superior level while using only 2 CPU hours per sample. Thanks to its speed, sensitivity, flexibility and ability to easily detect variants on a wide range of WGS library types, TIDDIT solves many of the problems that are currently hindering the utilization of WGS for SV calling in clinical settings.

20.
Eur J Hum Genet ; 25(11): 1253-1260, 2017 11.
Artigo em Inglês | MEDLINE | ID: mdl-28832569

RESUMO

Here we describe the SweGen data set, a comprehensive map of genetic variation in the Swedish population. These data represent a basic resource for clinical genetics laboratories as well as for sequencing-based association studies by providing information on genetic variant frequencies in a cohort that is well matched to national patient cohorts. To select samples for this study, we first examined the genetic structure of the Swedish population using high-density SNP-array data from a nation-wide cohort of over 10 000 Swedish-born individuals included in the Swedish Twin Registry. A total of 1000 individuals, reflecting a cross-section of the population and capturing the main genetic structure, were selected for whole-genome sequencing. Analysis pipelines were developed for automated alignment, variant calling and quality control of the sequencing data. This resulted in a genome-wide collection of aggregated variant frequencies in the Swedish population that we have made available to the scientific community through the website https://swefreq.nbis.se. A total of 29.2 million single-nucleotide variants and 3.8 million indels were detected in the 1000 samples, with 9.9 million of these variants not present in current databases. Each sample contributed with an average of 7199 individual-specific variants. In addition, an average of 8645 larger structural variants (SVs) were detected per individual, and we demonstrate that the population frequencies of these SVs can be used for efficient filtering analyses. Finally, our results show that the genetic diversity within Sweden is substantial compared with the diversity among continental European populations, underscoring the relevance of establishing a local reference data set.


Assuntos
Genoma Humano , Polimorfismo de Nucleotídeo Único , Sistema de Registros , Conjuntos de Dados como Assunto , Estudo de Associação Genômica Ampla , Humanos , Suécia , Gêmeos/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA