Search | VHL Regional Portal

1.

Designing efficient randstrobes for sequence similarity analyses.

Karami, Moein; Soltani Mohammadi, Aryan; Martin, Marcel; Ekim, Baris; Shen, Wei; Guo, Lidong; Xu, Mengyang; Pibiri, Giulio Ermanno; Patro, Rob; Sahlin, Kristoffer.

Bioinformatics ; 40(4)2024 Mar 29.

Article in English | MEDLINE | ID: mdl-38579261

ABSTRACT

MOTIVATION: Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080-94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy. RESULTS: In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign's accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes. AVAILABILITY AND IMPLEMENTATION: All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.

2.

Transcript Isoform Diversity of Ampliconic Genes on the Y Chromosome of Great Apes.

Tomaszkiewicz, Marta; Sahlin, Kristoffer; Medvedev, Paul; Makova, Kateryna D.

Genome Biol Evol ; 15(11)2023 Nov 01.

Article in English | MEDLINE | ID: mdl-37967251

ABSTRACT

Y chromosomal ampliconic genes (YAGs) are important for male fertility, as they encode proteins functioning in spermatogenesis. The variation in copy number and expression levels of these multicopy gene families has been studied in great apes; however, the diversity of splicing variants remains unexplored. Here, we deciphered the sequences of polyadenylated transcripts of all nine YAG families (BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY) from testis samples of six great ape species (human, chimpanzee, bonobo, gorilla, Bornean orangutan, and Sumatran orangutan). To achieve this, we enriched YAG transcripts with capture probe hybridization and sequenced them with long (Pacific Biosciences) reads. Our analysis of this data set resulted in several findings. First, we observed evolutionarily conserved alternative splicing patterns for most YAG families except for BPY2 and PRY. Second, our results suggest that BPY2 transcripts and proteins originate from separate genomic regions in bonobo versus human, which is possibly facilitated by acquiring new promoters. Third, our analysis indicates that the PRY gene family, having the highest representation of noncoding transcripts, has been undergoing pseudogenization. Fourth, we have not detected signatures of selection in the five YAG families shared among great apes, even though we identified many species-specific protein-coding transcripts. Fifth, we predicted consensus disorder regions across most gene families and species, which could be used for future investigations of male infertility. Overall, our work illuminates the YAG isoform landscape and provides a genomic resource for future functional studies focusing on infertility phenotypes in humans and critically endangered great apes.

Subject(s)

Hominidae , Pan paniscus , Animals , Male , Humans , Pan paniscus/genetics , Hominidae/genetics , Y Chromosome/genetics , Pan troglodytes/genetics , Protein Isoforms/genetics

3.

Nanopore sequencing of PCR products enables multicopy gene family reconstruction.

Namias, Alice; Sahlin, Kristoffer; Makoundou, Patrick; Bonnici, Iago; Sicard, Mathieu; Belkhir, Khalid; Weill, Mylène.

Comput Struct Biotechnol J ; 21: 3656-3664, 2023.

Article in English | MEDLINE | ID: mdl-37533804

ABSTRACT

The importance of gene amplifications in evolution is more and more recognized. Yet, tools to study multi-copy gene families are still scarce, and many such families are overlooked using common sequencing methods. Haplotype reconstruction is even harder for polymorphic multi-copy gene families. Here, we show that all variants (or haplotypes) of a multi-copy gene family present in a single genome, can be obtained using Oxford Nanopore Technologies sequencing of PCR products, followed by steps of mapping, SNP calling and haplotyping. As a proof of concept, we acquired the sequences of highly similar variants of the cidA and cidB genes present in the genome of the Wolbachia wPip, a bacterium infecting Culex pipiens mosquitoes. Our method relies on a wide database of cid genes, previously acquired by cloning and Sanger sequencing. We addressed problems commonly faced when using mapping approaches for multi-copy gene families with highly similar variants. In addition, we confirmed that PCR amplification causes frequent chimeras which have to be carefully considered when working on families of recombinant genes. We tested the robustness of the method using a combination of bioinformatics (read simulations) and molecular biology approaches (sequence acquisitions through cloning and Sanger sequencing, specific PCRs and digital droplet PCR). When different haplotypes present within a single genome cannot be reconstructed from short reads sequencing, this pipeline confers a high throughput acquisition, gives reliable results as well as insights of the relative copy numbers of the different variants.

4.

Efficient mapping of accurate long reads in minimizer space with mapquik.

Ekim, Baris; Sahlin, Kristoffer; Medvedev, Paul; Berger, Bonnie; Chikhi, Rayan.

Genome Res ; 33(7): 1188-1197, 2023 07.

Article in English | MEDLINE | ID: mdl-37399256

ABSTRACT

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.

Subject(s)

High-Throughput Nucleotide Sequencing , Software , Humans , Algorithms , Sequence Analysis, DNA , Genome, Human

5.

A survey of mapping algorithms in the long-reads era.

Sahlin, Kristoffer; Baudeau, Thomas; Cazaux, Bastien; Marchet, Camille.

Genome Biol ; 24(1): 133, 2023 06 01.

Article in English | MEDLINE | ID: mdl-37264447

ABSTRACT

It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).

Subject(s)

High-Throughput Nucleotide Sequencing , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Algorithms

6.

isONform: reference-free transcriptome reconstruction from Oxford Nanopore data.

Petri, Alexander J; Sahlin, Kristoffer.

Bioinformatics ; 39(39 Suppl 1): i222-i231, 2023 06 30.

Article in English | MEDLINE | ID: mdl-37387174

ABSTRACT

MOTIVATION: With advances in long-read transcriptome sequencing, we can now fully sequence transcripts, which greatly improves our ability to study transcription processes. A popular long-read transcriptome sequencing technique is Oxford Nanopore Technologies (ONT), which through its cost-effective sequencing and high throughput, has the potential to characterize the transcriptome in a cell. However, due to transcript variability and sequencing errors, long cDNA reads need substantial bioinformatic processing to produce a set of isoform predictions from the reads. Several genome and annotation-based methods exist to produce transcript predictions. However, such methods require high-quality genomes and annotations and are limited by the accuracy of long-read splice aligners. In addition, gene families with high heterogeneity may not be well represented by a reference genome and would benefit from reference-free analysis. Reference-free methods to predict transcripts from ONT, such as RATTLE, exist, but their sensitivity is not comparable to reference-based approaches. RESULTS: We present isONform, a high-sensitivity algorithm to construct isoforms from ONT cDNA sequencing data. The algorithm is based on iterative bubble popping on gene graphs built from fuzzy seeds from the reads. Using simulated, synthetic, and biological ONT cDNA data, we show that isONform has substantially higher sensitivity than RATTLE albeit with some loss in precision. On biological data, we show that isONform's predictions have substantially higher consistency with the annotation-based method StringTie2 compared with RATTLE. We believe isONform can be used both for isoform construction for organisms without well-annotated genomes and as an orthogonal method to verify predictions of reference-based methods. AVAILABILITY AND IMPLEMENTATION: https://github.com/aljpetri/isONform.

Subject(s)

Nanopores , DNA, Complementary , Transcriptome , Algorithms , Computational Biology

7.

Entropy predicts sensitivity of pseudorandom seeds.

Maier, Benjamin Dominik; Sahlin, Kristoffer.

Genome Res ; 33(7): 1162-1174, 2023 07.

Article in English | MEDLINE | ID: mdl-37217253

ABSTRACT

Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness-sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using k-mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.

Subject(s)

Algorithms , INDEL Mutation , Sequence Alignment , Entropy , Sequence Analysis, DNA , Software

8.

Transcript Isoform Diversity of Ampliconic Genes on the Y Chromosome of Great Apes.

Tomaszkiewicz, Marta; Sahlin, Kristoffer; Medvedev, Paul; Makova, Kateryna D.

bioRxiv ; 2023 Mar 18.

Article in English | MEDLINE | ID: mdl-36993458

ABSTRACT

Y-chromosomal Ampliconic Genes (YAGs) are important for male fertility, as they encode proteins functioning in spermatogenesis. The variation in copy number and expression levels of these multicopy gene families has been recently studied in great apes, however, the diversity of splicing variants remains unexplored. Here we deciphered the sequences of polyadenylated transcripts of all nine YAG families (BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY) from testis samples of six great ape species (human, chimpanzee, bonobo, gorilla, Bornean orangutan, and Sumatran orangutan). To achieve this, we enriched YAG transcripts with capture-probe hybridization and sequenced them with long (Pacific Biosciences) reads. Our analysis of this dataset resulted in several findings. First, we uncovered a high diversity of YAG transcripts across great apes. Second, we observed evolutionarily conserved alternative splicing patterns for most YAG families except for BPY2 and PRY. Our results suggest that BPY2 transcripts and predicted proteins in several great ape species (bonobo and the two orangutans) have independent evolutionary origins and are not homologous to human reference transcripts and proteins. In contrast, our results suggest that the PRY gene family, having the highest representation of transcripts without open reading frames, has been undergoing pseudogenization. Third, even though we have identified many species-specific protein-coding YAG transcripts, we have not detected any signatures of positive selection. Overall, our work illuminates the YAG isoform landscape and its evolutionary history, and provides a genomic resource for future functional studies focusing on infertility phenotypes in humans and critically endangered great apes.

9.

Strobealign: flexible seed size enables ultra-fast and accurate read alignment.

Sahlin, Kristoffer.

Genome Biol ; 23(1): 260, 2022 12 15.

Article in English | MEDLINE | ID: mdl-36522758

ABSTRACT

Read alignment is often the computational bottleneck in analyses. Recently, several advances have been made on seeding methods for fast sequence comparison. We combine two such methods, syncmers and strobemers, in a novel seeding approach for constructing dynamic-sized fuzzy seeds and implement the method in a short-read aligner, strobealign. The seeding is fast to construct and effectively reduces repetitiveness in the seeding step, as shown using a novel metric E-hits. strobealign is several times faster than traditional aligners at similar and sometimes higher accuracy while being both faster and more accurate than more recently proposed aligners for short reads of lengths 150nt and longer. Availability: https://github.com/ksahlin/strobealign.

Subject(s)

High-Throughput Nucleotide Sequencing , Software , Sequence Alignment , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Algorithms , Seeds

10.

Rapid in situ identification of biological specimens via DNA amplicon sequencing using miniaturized laboratory equipment.

Pomerantz, Aaron; Sahlin, Kristoffer; Vasiljevic, Nina; Seah, Adeline; Lim, Marisa; Humble, Emily; Kennedy, Susan; Krehenwinkel, Henrik; Winter, Sven; Ogden, Rob; Prost, Stefan.

Nat Protoc ; 17(6): 1415-1443, 2022 06.

Article in English | MEDLINE | ID: mdl-35411044

ABSTRACT

In many parts of the world, human-mediated environmental change is depleting biodiversity faster than it can be characterized, while invasive species cause agricultural damage, threaten human health and disrupt native habitats. Consequently, the application of effective approaches for rapid surveillance and identification of biological specimens is increasingly important to inform conservation and biosurveillance efforts. Taxonomic assignments have been greatly advanced using sequence-based applications, such as DNA barcoding, a diagnostic technique that utilizes PCR and DNA sequence analysis of standardized genetic regions. However, in many biodiversity hotspots, endeavors are often hindered by a lack of laboratory infrastructure, funding for biodiversity research and restrictions on the transport of biological samples. A promising development is the advent of low-cost, miniaturized scientific equipment. Such tools can be assembled into functional laboratories to carry out genetic analyses in situ, at local institutions, field stations or classrooms. Here, we outline the steps required to perform amplicon sequencing applications, from DNA isolation to nanopore sequencing and downstream data analysis, all of which can be conducted outside of a conventional laboratory environment using miniaturized scientific equipment, without reliance on Internet connectivity. Depending on sample type, the protocol (from DNA extraction to full bioinformatic analyses) can be completed within 10 h, and with appropriate quality controls can be used for diagnostic identification of samples independent of core genomic facilities that are required for alternative methods.

Subject(s)

DNA Barcoding, Taxonomic , Nanopores , Biodiversity , DNA/genetics , DNA Barcoding, Taxonomic/methods , Humans , Sequence Analysis, DNA/methods

11.

Safety in Multi-Assembly via Paths Appearing in All Path Covers of a DAG.

Caceres, Manuel; Mumey, Brendan; Husic, Edin; Rizzi, Romeo; Cairo, Massimo; Sahlin, Kristoffer; Tomescu, Alexandru I.

IEEE/ACM Trans Comput Biol Bioinform ; 19(6): 3673-3684, 2022.

Article in English | MEDLINE | ID: mdl-34847041

ABSTRACT

A multi-assembly problem asks to reconstruct multiple genomic sequences from mixed reads sequenced from all of them. Standard formulations of such problems model a solution as a path cover in a directed acyclic graph, namely a set of paths that together cover all vertices of the graph. Since multi-assembly problems admit multiple solutions in practice, we consider an approach commonly used in standard genome assembly: output only partial solutions (contigs, or safe paths), that appear in all path cover solutions. We study constrained path covers, a restriction on the path cover solution that incorporate practical constraints arising in multi-assembly problems. We give efficient algorithms finding all maximal safe paths for constrained path covers. We compute the safe paths of splicing graphs constructed from transcript annotations of different species. Our algorithms run in less than 15 seconds per species and report RNA contigs that are over 99% precise and are up to 8 times longer than unitigs. Moreover, RNA contigs cover over 70% of the transcripts and their coding sequences in most cases. With their increased length to unitigs, high precision, and fast construction time, maximal safe paths can provide a better base set of sequences for transcript assembly programs.

Subject(s)

Algorithms , Genomics , Genome , Base Sequence , RNA

12.

Effective sequence similarity detection with strobemers.

Sahlin, Kristoffer.

Genome Res ; 31(11): 2080-2094, 2021 11.

Article in English | MEDLINE | ID: mdl-34667119

ABSTRACT

k-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate k consecutive k-mers and make most k-mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches owing to the size of k Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.

Subject(s)

Algorithms , Computational Biology , Computational Biology/methods , High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA/methods , Software

13.

Accurate spliced alignment of long RNA sequencing reads.

Sahlin, Kristoffer; Mäkinen, Veli.

Bioinformatics ; 37(24): 4643-4651, 2021 12 11.

Article in English | MEDLINE | ID: mdl-34302453

ABSTRACT

MOTIVATION: Long-read RNA sequencing technologies are establishing themselves as the primary techniques to detect novel isoforms, and many such analyses are dependent on read alignments. However, the error rate and sequencing length of the reads create new challenges for accurately aligning them, particularly around small exons. RESULTS: We present an alignment method uLTRA for long RNA sequencing reads based on a novel two-pass collinear chaining algorithm. We show that uLTRA produces higher accuracy over state-of-the-art aligners with substantially higher accuracy for small exons on simulated and synthetic data. On simulated data, uLTRA achieves an accuracy of about 60% for exons of length 10 nucleotides or smaller and close to 90% accuracy for exons of length between 11 and 20 nucleotides. On biological data where true read location is unknown, we show several examples where uLTRA aligns to known and novel isoforms containing small exons that are not detected with other aligners. While uLTRA obtains its accuracy using annotations, it can also be used as a wrapper around minimap2 to align reads outside annotated regions. AVAILABILITYAND IMPLEMENTATION: uLTRA is available at https://github.com/ksahlin/ultra. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

High-Throughput Nucleotide Sequencing , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, RNA , Nucleotides

14.

NGSpeciesID: DNA barcode and amplicon consensus generation from long-read sequencing data.

Sahlin, Kristoffer; Lim, Marisa C W; Prost, Stefan.

Ecol Evol ; 11(3): 1392-1398, 2021 Feb.

Article in English | MEDLINE | ID: mdl-33598139

ABSTRACT

Third-generation sequencing technologies, such as Oxford Nanopore Technologies (ONT) and Pacific Biosciences (PacBio), have gained popularity over the last years. These platforms can generate millions of long-read sequences. This is not only advantageous for genome sequencing projects, but also advantageous for amplicon-based high-throughput sequencing experiments, such as DNA barcoding. However, the relatively high error rates associated with these technologies still pose challenges for generating high-quality consensus sequences. Here, we present NGSpeciesID, a program which can generate highly accurate consensus sequences from long-read amplicon sequencing technologies, including ONT and PacBio. The tool includes clustering of the reads to help filter out contaminants or reads with high error rates and employs polishing strategies specific to the appropriate sequencing platform. We show that NGSpeciesID produces consensus sequences with improved usability by minimizing preprocessing and software installation and scalability by enabling rapid processing of hundreds to thousands of samples, while maintaining similar consensus accuracy as current pipelines.

15.

Author Correction: Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis.

Sahlin, Kristoffer; Medvedev, Paul.

Nat Commun ; 12(1): 992, 2021 Feb 08.

Article in English | MEDLINE | ID: mdl-33558522

16.

Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis.

Sahlin, Kristoffer; Medvedev, Paul.

Nat Commun ; 12(1): 2, 2021 01 04.

Article in English | MEDLINE | ID: mdl-33397972

ABSTRACT

Oxford Nanopore (ONT) is a leading long-read technology which has been revolutionizing transcriptome analysis through its capacity to sequence the majority of transcripts from end-to-end. This has greatly increased our ability to study the diversity of transcription mechanisms such as transcription initiation, termination, and alternative splicing. However, ONT still suffers from high error rates which have thus far limited its scope to reference-based analyses. When a reference is not available or is not a viable option due to reference-bias, error correction is a crucial step towards the reconstruction of the sequenced transcripts and downstream sequence analysis of transcripts. In this paper, we present a novel computational method to error correct ONT cDNA sequencing data, called isONcorrect. IsONcorrect is able to jointly use all isoforms from a gene during error correction, thereby allowing it to correct reads at low sequencing depths. We are able to obtain a median accuracy of 98.9-99.6%, demonstrating the feasibility of applying cost-effective cDNA full transcript length sequencing for reference-free transcriptome analysis.

Subject(s)

Gene Expression Profiling , Nanopores , Nanotechnology/methods , Algorithms , Alleles , Animals , Drosophila melanogaster/genetics , Exons/genetics , Gene Expression Regulation , Heuristics , Polymorphism, Single Nucleotide/genetics , Protein Isoforms/metabolism , RNA Splice Sites/genetics , RNA, Messenger/genetics , RNA, Messenger/metabolism

17.

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality Value-Based Algorithm.

Sahlin, Kristoffer; Medvedev, Paul.

J Comput Biol ; 27(4): 472-484, 2020 04.

Article in English | MEDLINE | ID: mdl-32181688

ABSTRACT

Long-read sequencing of transcripts with Pacific Biosciences (PacBio) Iso-Seq and Oxford Nanopore Technologies has proven to be central to the study of complex isoform landscapes in many organisms. However, current de novo transcript reconstruction algorithms from long-read data are limited, leaving the potential of these technologies unfulfilled. A common bottleneck is the dearth of scalable and accurate algorithms for clustering long reads according to their gene family of origin. To address this challenge, we develop isONclust, a clustering algorithm that is greedy (to scale) and makes use of quality values (to handle variable error rates). We test isONclust on three simulated and five biological data sets, across a breadth of organisms, technologies, and read depths. Our results demonstrate that isONclust is a substantial improvement over previous approaches, both in terms of overall accuracy and/or scalability to large data sets.

Subject(s)

Computational Biology , Gene Expression Profiling/methods , Sequence Analysis, DNA/methods , Transcriptome/genetics , Algorithms , High-Throughput Nucleotide Sequencing/methods , Software

18.

DiscoverY: a classifier for identifying Y chromosome sequences in male assemblies.

Rangavittal, Samarth; Stopa, Natasha; Tomaszkiewicz, Marta; Sahlin, Kristoffer; Makova, Kateryna D; Medvedev, Paul.

BMC Genomics ; 20(1): 641, 2019 Aug 09.

Article in English | MEDLINE | ID: mdl-31399045

ABSTRACT

BACKGROUND: Although the Y chromosome plays an important role in male sex determination and fertility, it is currently understudied due to its haploid and repetitive nature. Methods to isolate Y-specific contigs from a whole-genome assembly broadly fall into two categories. The first involves retrieving Y-contigs using proportion sharing with a female, but such a strategy is prone to false positives in the absence of a high-quality, complete female reference. A second strategy uses the ratio of depth of coverage from male and female reads to select Y-contigs, but such a method requires high-depth sequencing of a female and cannot utilize existing female references. RESULTS: We develop a k-mer based method called DiscoverY, which combines proportion sharing with female with depth of coverage from male reads to classify contigs as Y-chromosomal. We evaluate the performance of DiscoverY on human and gorilla genomes, across different sequencing platforms including Illumina, 10X, and PacBio. In the cases where the male and female data are of high quality, DiscoverY has a high precision and recall and outperforms existing methods. For cases when a high quality female reference is not available, we quantify the effect of using draft reference or even just raw sequencing reads from a female. CONCLUSION: DiscoverY is an effective method to isolate Y-specific contigs from a whole-genome assembly. However, regions homologous to the X chromosome remain difficult to detect.

Subject(s)

Chromosomes, Human, Y/genetics , Sequence Analysis, DNA/methods , Female , Haploidy , Humans , Male , Sequence Analysis, DNA/economics , Time Factors

19.

Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon.

Sahlin, Kristoffer; Tomaszkiewicz, Marta; Makova, Kateryna D; Medvedev, Paul.

Nat Commun ; 9(1): 4601, 2018 11 02.

Article in English | MEDLINE | ID: mdl-30389934

ABSTRACT

A significant portion of genes in vertebrate genomes belongs to multigene families, with each family containing several gene copies whose presence/absence, as well as isoform structure, can be highly variable across individuals. Existing de novo techniques for assaying the sequences of such highly-similar gene families fall short of reconstructing end-to-end transcripts with nucleotide-level precision or assigning alternatively spliced transcripts to their respective gene copies. We present IsoCon, a high-precision method using long PacBio Iso-Seq reads to tackle this challenge. We apply IsoCon to nine Y chromosome ampliconic gene families and show that it outperforms existing methods on both experimental and simulated data. IsoCon has allowed us to detect an unprecedented number of novel isoforms and has opened the door for unraveling the structure of many multigene families and gaining a deeper understanding of genome evolution and human diseases.

Subject(s)

Algorithms , Multigene Family , RNA, Messenger/genetics , Sequence Analysis, RNA/methods , Aged , Computer Simulation , Exons/genetics , Fragile X Mental Retardation Protein/genetics , Gene Dosage , Humans , Male , Middle Aged , Protein Isoforms/genetics , Protein Isoforms/metabolism , RNA Splicing/genetics , RNA, Messenger/metabolism , Reproducibility of Results , Testis/metabolism

20.

Structural Variation Detection with Read Pair Information: An Improved Null Hypothesis Reduces Bias.

Sahlin, Kristoffer; Frånberg, Mattias; Arvestad, Lars.

J Comput Biol ; 24(6): 581-589, 2017 Jun.

Article in English | MEDLINE | ID: mdl-27681236

ABSTRACT

Reads from paired-end and mate-pair libraries are often utilized to find structural variation in genomes, and one common approach is to use their fragment length for detection. After aligning read pairs to the reference, read pair distances are analyzed for statistically significant deviations. However, previously proposed methods are based on a simplified model of observed fragment lengths that does not agree with data. We show how this model limits statistical analysis of identifying variants and propose a new model by adapting a model we have previously introduced for contig scaffolding, which agrees with data. From this model, we derive an improved null hypothesis that when applied in the variant caller CLEVER, reduces the number of false positives and corrects a bias that contributes to more deletion calls than insertion calls. We advise developers of variant callers with statistical fragment length-based methods to adapt the concepts in our proposed model and null hypothesis.

Subject(s)

Algorithms , High-Throughput Nucleotide Sequencing/methods , Metagenomics/methods , Sequence Analysis, DNA/methods , Software , Bias , Genome, Human , Genomic Structural Variation , Humans , Models, Genetic

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL