Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
Genome Res ; 33(7): 1162-1174, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37217253

RESUMO

Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness-sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using k-mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.


Assuntos
Algoritmos , Mutação INDEL , Alinhamento de Sequência , Entropia , Análise de Sequência de DNA , Software
2.
Genome Res ; 33(7): 1188-1197, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37399256

RESUMO

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Humanos , Algoritmos , Análise de Sequência de DNA , Genoma Humano
3.
Bioinformatics ; 40(4)2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38579261

RESUMO

MOTIVATION: Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080-94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy. RESULTS: In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign's accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes. AVAILABILITY AND IMPLEMENTATION: All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.

4.
Genome Res ; 31(11): 2080-2094, 2021 11.
Artigo em Inglês | MEDLINE | ID: mdl-34667119

RESUMO

k-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate k consecutive k-mers and make most k-mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches owing to the size of k Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.


Assuntos
Algoritmos , Biologia Computacional , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA/métodos , Software
5.
Bioinformatics ; 39(39 Suppl 1): i222-i231, 2023 06 30.
Artigo em Inglês | MEDLINE | ID: mdl-37387174

RESUMO

MOTIVATION: With advances in long-read transcriptome sequencing, we can now fully sequence transcripts, which greatly improves our ability to study transcription processes. A popular long-read transcriptome sequencing technique is Oxford Nanopore Technologies (ONT), which through its cost-effective sequencing and high throughput, has the potential to characterize the transcriptome in a cell. However, due to transcript variability and sequencing errors, long cDNA reads need substantial bioinformatic processing to produce a set of isoform predictions from the reads. Several genome and annotation-based methods exist to produce transcript predictions. However, such methods require high-quality genomes and annotations and are limited by the accuracy of long-read splice aligners. In addition, gene families with high heterogeneity may not be well represented by a reference genome and would benefit from reference-free analysis. Reference-free methods to predict transcripts from ONT, such as RATTLE, exist, but their sensitivity is not comparable to reference-based approaches. RESULTS: We present isONform, a high-sensitivity algorithm to construct isoforms from ONT cDNA sequencing data. The algorithm is based on iterative bubble popping on gene graphs built from fuzzy seeds from the reads. Using simulated, synthetic, and biological ONT cDNA data, we show that isONform has substantially higher sensitivity than RATTLE albeit with some loss in precision. On biological data, we show that isONform's predictions have substantially higher consistency with the annotation-based method StringTie2 compared with RATTLE. We believe isONform can be used both for isoform construction for organisms without well-annotated genomes and as an orthogonal method to verify predictions of reference-based methods. AVAILABILITY AND IMPLEMENTATION: https://github.com/aljpetri/isONform.


Assuntos
Nanoporos , DNA Complementar , Transcriptoma , Algoritmos , Biologia Computacional
6.
Bioinformatics ; 37(24): 4643-4651, 2021 12 11.
Artigo em Inglês | MEDLINE | ID: mdl-34302453

RESUMO

MOTIVATION: Long-read RNA sequencing technologies are establishing themselves as the primary techniques to detect novel isoforms, and many such analyses are dependent on read alignments. However, the error rate and sequencing length of the reads create new challenges for accurately aligning them, particularly around small exons. RESULTS: We present an alignment method uLTRA for long RNA sequencing reads based on a novel two-pass collinear chaining algorithm. We show that uLTRA produces higher accuracy over state-of-the-art aligners with substantially higher accuracy for small exons on simulated and synthetic data. On simulated data, uLTRA achieves an accuracy of about 60% for exons of length 10 nucleotides or smaller and close to 90% accuracy for exons of length between 11 and 20 nucleotides. On biological data where true read location is unknown, we show several examples where uLTRA aligns to known and novel isoforms containing small exons that are not detected with other aligners. While uLTRA obtains its accuracy using annotations, it can also be used as a wrapper around minimap2 to align reads outside annotated regions. AVAILABILITYAND IMPLEMENTATION: uLTRA is available at https://github.com/ksahlin/ultra. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de RNA , Nucleotídeos
7.
BMC Genomics ; 20(1): 641, 2019 Aug 09.
Artigo em Inglês | MEDLINE | ID: mdl-31399045

RESUMO

BACKGROUND: Although the Y chromosome plays an important role in male sex determination and fertility, it is currently understudied due to its haploid and repetitive nature. Methods to isolate Y-specific contigs from a whole-genome assembly broadly fall into two categories. The first involves retrieving Y-contigs using proportion sharing with a female, but such a strategy is prone to false positives in the absence of a high-quality, complete female reference. A second strategy uses the ratio of depth of coverage from male and female reads to select Y-contigs, but such a method requires high-depth sequencing of a female and cannot utilize existing female references. RESULTS: We develop a k-mer based method called DiscoverY, which combines proportion sharing with female with depth of coverage from male reads to classify contigs as Y-chromosomal. We evaluate the performance of DiscoverY on human and gorilla genomes, across different sequencing platforms including Illumina, 10X, and PacBio. In the cases where the male and female data are of high quality, DiscoverY has a high precision and recall and outperforms existing methods. For cases when a high quality female reference is not available, we quantify the effect of using draft reference or even just raw sequencing reads from a female. CONCLUSION: DiscoverY is an effective method to isolate Y-specific contigs from a whole-genome assembly. However, regions homologous to the X chromosome remain difficult to detect.


Assuntos
Cromossomos Humanos Y/genética , Análise de Sequência de DNA/métodos , Feminino , Haploidia , Humanos , Masculino , Análise de Sequência de DNA/economia , Fatores de Tempo
8.
Nature ; 497(7451): 579-84, 2013 May 30.
Artigo em Inglês | MEDLINE | ID: mdl-23698360

RESUMO

Conifers have dominated forests for more than 200 million years and are of huge ecological and economic importance. Here we present the draft assembly of the 20-gigabase genome of Norway spruce (Picea abies), the first available for any gymnosperm. The number of well-supported genes (28,354) is similar to the >100 times smaller genome of Arabidopsis thaliana, and there is no evidence of a recent whole-genome duplication in the gymnosperm lineage. Instead, the large genome size seems to result from the slow and steady accumulation of a diverse set of long-terminal repeat transposable elements, possibly owing to the lack of an efficient elimination mechanism. Comparative sequencing of Pinus sylvestris, Abies sibirica, Juniperus communis, Taxus baccata and Gnetum gnemon reveals that the transposable element diversity is shared among extant conifers. Expression of 24-nucleotide small RNAs, previously implicated in transposable element silencing, is tissue-specific and much lower than in other plants. We further identify numerous long (>10,000 base pairs) introns, gene-like fragments, uncharacterized long non-coding RNAs and short RNAs. This opens up new genomic avenues for conifer forestry and breeding.


Assuntos
Evolução Molecular , Genoma de Planta/genética , Picea/genética , Sequência Conservada/genética , Elementos de DNA Transponíveis/genética , Inativação Gênica , Genes de Plantas/genética , Genômica , Internet , Íntrons/genética , Fenótipo , RNA não Traduzido/genética , Análise de Sequência de DNA , Sequências Repetidas Terminais/genética , Transcrição Gênica/genética
9.
Bioinformatics ; 32(13): 1925-32, 2016 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-27153683

RESUMO

MOTIVATION: Scaffolding is often an essential step in a genome assembly process, in which contigs are ordered and oriented using read pairs from a combination of paired-end libraries and longer-range mate-pair libraries. Although a simple idea, scaffolding is unfortunately hard to get right in practice. One source of problems is so-called PE-contamination in mate-pair libraries, in which a non-negligible fraction of the read pairs get the wrong orientation and a much smaller insert size than what is expected. This contamination has been discussed before, in relation to integrated scaffolders, but solutions rely on the orientation being observable, e.g. by finding the junction adapter sequence in the reads. This is not always possible, making orientation and insert size of a read pair stochastic. To our knowledge, there is neither previous work on modeling PE-contamination, nor a study on the effect PE-contamination has on scaffolding quality. RESULTS: We have addressed PE-contamination in an update to our scaffolder BESST. We formulate the problem as an integer linear program which is solved using an efficient heuristic. The new method shows significant improvement over both integrated and stand-alone scaffolders in our experiments. The impact of modeling PE-contamination is quantified by comparing with the previous BESST model. We also show how other scaffolders are vulnerable to PE-contaminated libraries, resulting in an increased number of misassemblies, more conservative scaffolding and inflated assembly sizes. AVAILABILITY AND IMPLEMENTATION: The model is implemented in BESST. Source code and usage instructions are found at https://github.com/ksahlin/BESST BESST can also be downloaded using PyPI. CONTACT: ksahlin@kth.se SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biblioteca Gênica , Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Bactérias/genética , Humanos , Programação Linear , Software
10.
BMC Bioinformatics ; 15: 281, 2014 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-25128196

RESUMO

BACKGROUND: The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features.We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance. RESULTS: We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. CONCLUSION: We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.


Assuntos
Algoritmos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Biblioteca Gênica , Humanos , Reprodutibilidade dos Testes
11.
Bioinformatics ; 28(17): 2215-22, 2012 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-22923455

RESUMO

MOTIVATION: One of the important steps of genome assembly is scaffolding, in which contigs are linked using information from read-pairs. Scaffolding provides estimates about the order, relative orientation and distance between contigs. We have found that contig distance estimates are generally strongly biased and based on false assumptions. Since erroneous distance estimates can mislead in subsequent analysis, it is important to provide unbiased estimation of contig distance. RESULTS: In this article, we show that state-of-the-art programs for scaffolding are using an incorrect model of gap size estimation. We discuss why current maximum likelihood estimators are biased and describe what different cases of bias we are facing. Furthermore, we provide a model for the distribution of reads that span a gap and derive the maximum likelihood equation for the gap length. We motivate why this estimate is sound and show empirically that it outperforms gap estimators in popular scaffolding programs. Our results have consequences both for scaffolding software, structural variation detection and for library insert-size estimation as is commonly performed by read aligners. AVAILABILITY: A reference implementation is provided at https://github.com/SciLifeLab/gapest. SUPPLEMENTARY INFORMATION: Supplementary data are availible at Bioinformatics online.


Assuntos
Algoritmos , Mapeamento de Sequências Contíguas/métodos , Genômica/métodos , Modelos Genéticos , Biblioteca Gênica , Funções Verossimilhança , Probabilidade , Análise de Regressão , Análise de Sequência de DNA/métodos , Software
12.
bioRxiv ; 2023 Mar 18.
Artigo em Inglês | MEDLINE | ID: mdl-36993458

RESUMO

Y-chromosomal Ampliconic Genes (YAGs) are important for male fertility, as they encode proteins functioning in spermatogenesis. The variation in copy number and expression levels of these multicopy gene families has been recently studied in great apes, however, the diversity of splicing variants remains unexplored. Here we deciphered the sequences of polyadenylated transcripts of all nine YAG families (BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY) from testis samples of six great ape species (human, chimpanzee, bonobo, gorilla, Bornean orangutan, and Sumatran orangutan). To achieve this, we enriched YAG transcripts with capture-probe hybridization and sequenced them with long (Pacific Biosciences) reads. Our analysis of this dataset resulted in several findings. First, we uncovered a high diversity of YAG transcripts across great apes. Second, we observed evolutionarily conserved alternative splicing patterns for most YAG families except for BPY2 and PRY. Our results suggest that BPY2 transcripts and predicted proteins in several great ape species (bonobo and the two orangutans) have independent evolutionary origins and are not homologous to human reference transcripts and proteins. In contrast, our results suggest that the PRY gene family, having the highest representation of transcripts without open reading frames, has been undergoing pseudogenization. Third, even though we have identified many species-specific protein-coding YAG transcripts, we have not detected any signatures of positive selection. Overall, our work illuminates the YAG isoform landscape and its evolutionary history, and provides a genomic resource for future functional studies focusing on infertility phenotypes in humans and critically endangered great apes.

13.
Genome Biol Evol ; 15(11)2023 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-37967251

RESUMO

Y chromosomal ampliconic genes (YAGs) are important for male fertility, as they encode proteins functioning in spermatogenesis. The variation in copy number and expression levels of these multicopy gene families has been studied in great apes; however, the diversity of splicing variants remains unexplored. Here, we deciphered the sequences of polyadenylated transcripts of all nine YAG families (BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY) from testis samples of six great ape species (human, chimpanzee, bonobo, gorilla, Bornean orangutan, and Sumatran orangutan). To achieve this, we enriched YAG transcripts with capture probe hybridization and sequenced them with long (Pacific Biosciences) reads. Our analysis of this data set resulted in several findings. First, we observed evolutionarily conserved alternative splicing patterns for most YAG families except for BPY2 and PRY. Second, our results suggest that BPY2 transcripts and proteins originate from separate genomic regions in bonobo versus human, which is possibly facilitated by acquiring new promoters. Third, our analysis indicates that the PRY gene family, having the highest representation of noncoding transcripts, has been undergoing pseudogenization. Fourth, we have not detected signatures of selection in the five YAG families shared among great apes, even though we identified many species-specific protein-coding transcripts. Fifth, we predicted consensus disorder regions across most gene families and species, which could be used for future investigations of male infertility. Overall, our work illuminates the YAG isoform landscape and provides a genomic resource for future functional studies focusing on infertility phenotypes in humans and critically endangered great apes.


Assuntos
Hominidae , Pan paniscus , Animais , Masculino , Humanos , Pan paniscus/genética , Hominidae/genética , Cromossomo Y/genética , Pan troglodytes/genética , Isoformas de Proteínas/genética
14.
Genome Biol ; 24(1): 133, 2023 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-37264447

RESUMO

It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos
15.
Comput Struct Biotechnol J ; 21: 3656-3664, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37533804

RESUMO

The importance of gene amplifications in evolution is more and more recognized. Yet, tools to study multi-copy gene families are still scarce, and many such families are overlooked using common sequencing methods. Haplotype reconstruction is even harder for polymorphic multi-copy gene families. Here, we show that all variants (or haplotypes) of a multi-copy gene family present in a single genome, can be obtained using Oxford Nanopore Technologies sequencing of PCR products, followed by steps of mapping, SNP calling and haplotyping. As a proof of concept, we acquired the sequences of highly similar variants of the cidA and cidB genes present in the genome of the Wolbachia wPip, a bacterium infecting Culex pipiens mosquitoes. Our method relies on a wide database of cid genes, previously acquired by cloning and Sanger sequencing. We addressed problems commonly faced when using mapping approaches for multi-copy gene families with highly similar variants. In addition, we confirmed that PCR amplification causes frequent chimeras which have to be carefully considered when working on families of recombinant genes. We tested the robustness of the method using a combination of bioinformatics (read simulations) and molecular biology approaches (sequence acquisitions through cloning and Sanger sequencing, specific PCRs and digital droplet PCR). When different haplotypes present within a single genome cannot be reconstructed from short reads sequencing, this pipeline confers a high throughput acquisition, gives reliable results as well as insights of the relative copy numbers of the different variants.

16.
Genome Biol ; 23(1): 260, 2022 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-36522758

RESUMO

Read alignment is often the computational bottleneck in analyses. Recently, several advances have been made on seeding methods for fast sequence comparison. We combine two such methods, syncmers and strobemers, in a novel seeding approach for constructing dynamic-sized fuzzy seeds and implement the method in a short-read aligner, strobealign. The seeding is fast to construct and effectively reduces repetitiveness in the seeding step, as shown using a novel metric E-hits. strobealign is several times faster than traditional aligners at similar and sometimes higher accuracy while being both faster and more accurate than more recently proposed aligners for short reads of lengths 150nt and longer. Availability: https://github.com/ksahlin/strobealign.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Software , Alinhamento de Sequência , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Sementes
17.
IEEE/ACM Trans Comput Biol Bioinform ; 19(6): 3673-3684, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-34847041

RESUMO

A multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions (contigs, or safe paths), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.


Assuntos
Algoritmos , Genômica , Genoma , Sequência de Bases , RNA
18.
Nat Protoc ; 17(6): 1415-1443, 2022 06.
Artigo em Inglês | MEDLINE | ID: mdl-35411044

RESUMO

In many parts of the world, human-mediated environmental change is depleting biodiversity faster than it can be characterized, while invasive species cause agricultural damage, threaten human health and disrupt native habitats. Consequently, the application of effective approaches for rapid surveillance and identification of biological specimens is increasingly important to inform conservation and biosurveillance efforts. Taxonomic assignments have been greatly advanced using sequence-based applications, such as DNA barcoding, a diagnostic technique that utilizes PCR and DNA sequence analysis of standardized genetic regions. However, in many biodiversity hotspots, endeavors are often hindered by a lack of laboratory infrastructure, funding for biodiversity research and restrictions on the transport of biological samples. A promising development is the advent of low-cost, miniaturized scientific equipment. Such tools can be assembled into functional laboratories to carry out genetic analyses in situ, at local institutions, field stations or classrooms. Here, we outline the steps required to perform amplicon sequencing applications, from DNA isolation to nanopore sequencing and downstream data analysis, all of which can be conducted outside of a conventional laboratory environment using miniaturized scientific equipment, without reliance on Internet connectivity. Depending on sample type, the protocol (from DNA extraction to full bioinformatic analyses) can be completed within 10 h, and with appropriate quality controls can be used for diagnostic identification of samples independent of core genomic facilities that are required for alternative methods.


Assuntos
Código de Barras de DNA Taxonômico , Nanoporos , Biodiversidade , DNA/genética , Código de Barras de DNA Taxonômico/métodos , Humanos , Análise de Sequência de DNA/métodos
19.
Nat Commun ; 12(1): 2, 2021 01 04.
Artigo em Inglês | MEDLINE | ID: mdl-33397972

RESUMO

Oxford Nanopore (ONT) is a leading long-read technology which has been revolutionizing transcriptome analysis through its capacity to sequence the majority of transcripts from end-to-end. This has greatly increased our ability to study the diversity of transcription mechanisms such as transcription initiation, termination, and alternative splicing. However, ONT still suffers from high error rates which have thus far limited its scope to reference-based analyses. When a reference is not available or is not a viable option due to reference-bias, error correction is a crucial step towards the reconstruction of the sequenced transcripts and downstream sequence analysis of transcripts. In this paper, we present a novel computational method to error correct ONT cDNA sequencing data, called isONcorrect. IsONcorrect is able to jointly use all isoforms from a gene during error correction, thereby allowing it to correct reads at low sequencing depths. We are able to obtain a median accuracy of 98.9-99.6%, demonstrating the feasibility of applying cost-effective cDNA full transcript length sequencing for reference-free transcriptome analysis.


Assuntos
Perfilação da Expressão Gênica , Nanoporos , Nanotecnologia/métodos , Algoritmos , Alelos , Animais , Drosophila melanogaster/genética , Éxons/genética , Regulação da Expressão Gênica , Heurística , Polimorfismo de Nucleotídeo Único/genética , Isoformas de Proteínas/metabolismo , Sítios de Splice de RNA/genética , RNA Mensageiro/genética , RNA Mensageiro/metabolismo
20.
Ecol Evol ; 11(3): 1392-1398, 2021 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-33598139

RESUMO

Third-generation sequencing technologies, such as Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio), have gained popularity over the last years. These platforms can generate millions of long-read sequences. This is not only advantageous for genome sequencing projects, but also advantageous for amplicon-based high-throughput sequencing experiments, such as DNA barcoding. However, the relatively high error rates associated with these technologies still pose challenges for generating high-quality consensus sequences. Here, we present NGSpeciesID, a program which can generate highly accurate consensus sequences from long-read amplicon sequencing technologies, including ONT and PacBio. The tool includes clustering of the reads to help filter out contaminants or reads with high error rates and employs polishing strategies specific to the appropriate sequencing platform. We show that NGSpeciesID produces consensus sequences with improved usability by minimizing preprocessing and software installation and scalability by enabling rapid processing of hundreds to thousands of samples, while maintaining similar consensus accuracy as current pipelines.

SELEÇÃO DE REFERÊNCIAS
Detalhe da pesquisa