Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters










Publication year range
1.
J Comput Biol ; 31(7): 597-615, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38980804

ABSTRACT

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a decycling set of the de Bruijn graph, which is a set of unavoidable k-mers. Any long enough sequence, by definition, must contain a k-mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the k-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.


Subject(s)
Algorithms , Computational Biology , Software , Computational Biology/methods , Sequence Alignment/methods , Humans , Sequence Analysis, DNA/methods
2.
J Comput Biol ; 31(5): 381-395, 2024 05.
Article in English | MEDLINE | ID: mdl-38687333

ABSTRACT

Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. In this study, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step toward building a deep learning assembler, although it is at present too slow to be practical. In total, this article provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.


Subject(s)
COVID-19 , Neural Networks, Computer , SARS-CoV-2 , Humans , COVID-19/virology , SARS-CoV-2/genetics , Algorithms , Telomere/genetics , Computational Biology/methods , Genomics/methods , Deep Learning , Genome, Human
3.
J Comput Biol ; 30(12): 1251-1276, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37646787

ABSTRACT

Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.


Subject(s)
Algorithms , Genomics , Sequence Analysis, DNA/methods , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Software
4.
Entropy (Basel) ; 24(11)2022 Nov 03.
Article in English | MEDLINE | ID: mdl-36359689

ABSTRACT

The goal of the first part of this note is to get an explicit formula for rotation number and Mather ß-function for ellipse. This is done here with the help of non-standard generating function of billiard problem. In this way the derivation is especially simple. In the second part we discuss application of Mather ß-function to rigidity problem.

5.
BMC Bioinformatics ; 22(1): 534, 2021 Oct 30.
Article in English | MEDLINE | ID: mdl-34717540

ABSTRACT

BACKGROUND: Generating high-quality de novo genome assemblies is foundational to the genomics study of model and non-model organisms. In recent years, long-read sequencing has greatly benefited genome assembly and scaffolding, a process by which assembled sequences are ordered and oriented through the use of long-range information. Long reads are better able to span repetitive genomic regions compared to short reads, and thus have tremendous utility for resolving problematic regions and helping generate more complete draft assemblies. Here, we present LongStitch, a scalable pipeline that corrects and scaffolds draft genome assemblies exclusively using long reads. RESULTS: LongStitch incorporates multiple tools developed by our group and runs in up to three stages, which includes initial assembly correction (Tigmint-long), followed by two incremental scaffolding stages (ntLink and ARKS-long). Tigmint-long and ARKS-long are misassembly correction and scaffolding utilities, respectively, previously developed for linked reads, that we adapted for long reads. Here, we describe the LongStitch pipeline and introduce our new long-read scaffolder, ntLink, which utilizes lightweight minimizer mappings to join contigs. LongStitch was tested on short and long-read assemblies of Caenorhabditis elegans, Oryza sativa, and three different human individuals using corresponding nanopore long-read data, and improves the contiguity of each assembly from 1.2-fold up to 304.6-fold (as measured by NGA50 length). Furthermore, LongStitch generates more contiguous and correct assemblies compared to state-of-the-art long-read scaffolder LRScaf in most tests, and consistently improves upon human assemblies in under five hours using less than 23 GB of RAM. CONCLUSIONS: Due to its effectiveness and efficiency in improving draft assemblies using long reads, we expect LongStitch to benefit a wide variety of de novo genome assembly projects. The LongStitch pipeline is freely available at https://github.com/bcgsc/longstitch .


Subject(s)
Genomics , High-Throughput Nucleotide Sequencing , Genome , Humans , Repetitive Sequences, Nucleic Acid , Sequence Analysis, DNA
6.
Cell Syst ; 12(10): 958-968.e6, 2021 10 20.
Article in English | MEDLINE | ID: mdl-34525345

ABSTRACT

DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. Here, we define an algorithmic approach, mdBG, that makes use of minimizer-space de Bruijn graphs to enable long-read genome assembly. mdBG achieves orders-of-magnitude improvement in both speed and memory usage over existing methods without compromising accuracy. A human genome is assembled in under 10 min using 8 cores and 10 GB RAM, and 60 Gbp of metagenome reads are assembled in 4 min using 1 GB RAM. In addition, we constructed a minimizer-space de Bruijn graph-based representation of 661,405 bacterial genomes, comprising 16 million nodes and 45 million edges, and successfully search it for anti-microbial resistance (AMR) genes in 12 min. We expect our advances to be essential to sequence analysis, given the rise of long-read sequencing in genomics, metagenomics, and pangenomics. Code for constructing mdBGs is freely available for download at https://github.com/ekimb/rust-mdbg/.


Subject(s)
Algorithms , Genomics , Humans , Metagenomics , Microcomputers , Sequence Analysis, DNA/methods
7.
J Comput Biol ; 28(11): 1052-1062, 2021 11.
Article in English | MEDLINE | ID: mdl-34448593

ABSTRACT

Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. The major difficulties of taxonomic analysis are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species, and sequencing errors. Microbial communities can be studied with reads clustering, a process referred to as genome binning. In this study, we present MetaProb 2 an unsupervised genome binning method based on reads assembly and probabilistic k-mers statistics. The novelties of MetaProb 2 are the use of minimizers to efficiently assemble reads into unitigs and a community detection algorithm based on graph modularity to cluster unitigs and to detect representative unitigs. The effectiveness of MetaProb 2 is demonstrated in both simulated and real datasets in comparison with state-of-art binning tools such as MetaProb, AbundanceBin, Bimeta, and MetaCluster. On real datasets, it is the only one capable of producing promising results while being parsimonious with computational resources.


Subject(s)
Computational Biology/methods , Metagenomics/methods , Algorithms , Data Mining , Databases, Genetic , Unsupervised Machine Learning
8.
PeerJ ; 9: e10805, 2021.
Article in English | MEDLINE | ID: mdl-33604186

ABSTRACT

Minimizers are widely used to select subsets of fixed-length substrings (k-mers) from biological sequences in applications ranging from read mapping to taxonomy prediction and indexing of large datasets. The minimizer of a string of w consecutive k-mers is the k-mer with smallest value according to an ordering of all k-mers. Syncmers are defined here as a family of alternative methods which select k-mers by inspecting the position of the smallest-valued substring of length s < k within the k-mer. For example, a closed syncmer is selected if its smallest s-mer is at the start or end of the k-mer. At least one closed syncmer must be found in every window of length (k - s) k-mers. Unlike a minimizer, a syncmer is identified by its sequence alone, and is therefore synchronized in the following sense: if a given k-mer is selected from one sequence, it will also be selected from any other sequence. Also, minimizers can be deleted by mutations in flanking sequence, which cannot happen with syncmers. Experiments on minimizers with parameters used in the minimap2 read mapper and Kraken taxonomy prediction algorithm respectively show that syncmers can simultaneously achieve both lower density and higher conservation compared to minimizers.

9.
Methods Mol Biol ; 2243: 95-105, 2021.
Article in English | MEDLINE | ID: mdl-33606254

ABSTRACT

High-throughput sequencing machines can read millions of DNA molecules in parallel in a short time and at a relatively low cost. As a consequence, researchers have access to databases with millions of genomic samples. Searching and analyzing these large amounts of data require efficient algorithms.Universal hitting sets are sets of words that must be present in any long enough string. Using small universal hitting sets, it is possible to increase the efficiency of many high-throughput sequencing data analyses. But, generating minimum-size universal hitting sets is a hard problem. In this chapter, we cover our algorithmic developments to produce compact universal hitting sets and some of their potential applications.


Subject(s)
High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Algorithms , DNA/genetics , Humans , Software
10.
J Comput Biol ; 28(4): 395-409, 2021 04.
Article in English | MEDLINE | ID: mdl-33325773

ABSTRACT

Universal hitting sets (UHS) are sets of words that are unavoidable: every long enough sequence is hit by the set (i.e., it contains a word from the set). There is a tight relationship between UHS and minimizer schemes, where minimizer schemes with low density (i.e., efficient schemes) correspond to UHS of small size. Local schemes are a generalization of minimizer schemes that can be used as replacement for minimizer scheme with the possibility of being much more efficient. We establish the link between efficient local schemes and the minimum length of a string that must be hit by a UHS. We give bounds for the remaining path length of the Mykkeltveit UHS. In addition, we create a local scheme with the lowest known density that is only a log factor away from the theoretical lower bound.


Subject(s)
Genome, Human/genetics , High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA , Software , Algorithms , Computational Biology/trends , Humans
11.
Genome Biol ; 20(1): 257, 2019 11 28.
Article in English | MEDLINE | ID: mdl-31779668

ABSTRACT

Although Kraken's k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.


Subject(s)
Metagenomics/methods , Software
12.
Philos Trans A Math Phys Eng Sci ; 377(2144): 20180074, 2019 May 06.
Article in English | MEDLINE | ID: mdl-30879420

ABSTRACT

A soft solid is said to be initially stressed if it is subjected to a state of internal stress in its unloaded reference configuration. In physical terms, its stored elastic energy may not vanish in the absence of an elastic deformation, being also dependent on the spatial distribution of the underlying material inhomogeneities. Developing a sound mathematical framework to model initially stressed solids in nonlinear elasticity is key for many applications in engineering and biology. This work investigates the links between the existence of elastic minimizers and the constitutive restrictions for initially stressed materials subjected to finite deformations. In particular, we consider a subclass of constitutive responses in which the strain energy density is taken as a scalar-valued function of both the deformation gradient and the initial stress tensor. The main advantage of this approach is that the initial stress tensor belongs to the group of divergence-free symmetric tensors satisfying the boundary conditions in any given reference configuration. However, it is still unclear which physical restrictions must be imposed for the well-posedness of this elastic problem. Assuming that the constitutive response depends on the choice of the reference configuration only through the initial stress tensor, under given conditions we prove the local existence of a relaxed state given by an implicit tensor function of the initial stress distribution. This tensor function is generally not unique, and can be transformed according to the symmetry group of the material at fixed initial stresses. These results allow one to extend Ball's existence theorem of elastic minimizers for the proposed constitutive choice of initially stressed materials. This article is part of the theme issue 'Rivlin's legacy in continuum mechanics and applied mathematics'.

13.
J Comput Biol ; 25(7): 766-779, 2018 07.
Article in English | MEDLINE | ID: mdl-29708767

ABSTRACT

Emerging single-molecule sequencing technologies from Pacific Biosciences and Oxford Nanopore have revived interest in long-read mapping algorithms. Alignment-based seed-and-extend methods demonstrate good accuracy, but face limited scalability, while faster alignment-free methods typically trade decreased precision for efficiency. In this article, we combine a fast approximate read mapping algorithm based on minimizers with a novel MinHash identity estimation technique to achieve both scalability and precision. In contrast to prior methods, we develop a mathematical framework that defines the types of mapping targets we uncover, establish probabilistic estimates of p-value and sensitivity, and demonstrate tolerance for alignment error rates up to 20%. With this framework, our algorithm automatically adapts to different minimum length and identity requirements and provides both positional and identity estimates for each mapping reported. For mapping human PacBio reads to the hg38 reference, our method is 290 × faster than Burrows-Wheeler Aligner-MEM with a lower memory footprint and recall rate of 96%. We further demonstrate the scalability of our method by mapping noisy PacBio reads (each ≥5 kbp in length) to the complete NCBI RefSeq database containing 838 Gbp of sequence and >60,000 genomes.


Subject(s)
Genome, Human/genetics , High-Throughput Nucleotide Sequencing/methods , Software , Algorithms , Databases, Factual , Humans , Sequence Alignment/methods , Sequence Analysis, DNA
14.
Philos Trans A Math Phys Eng Sci ; 372(2028)2014 Nov 13.
Article in English | MEDLINE | ID: mdl-25288810

ABSTRACT

We consider the minimization of the repulsive-attractive power-law interaction energies that occur in many biological and physical situations. We show the existence of global minimizers in the discrete setting and obtain bounds for their supports independently of the number of Dirac deltas in a certain range of exponents. These global discrete minimizers correspond to the stable spatial profiles of flock patterns in swarming models. Global minimizers of the continuum problem are obtained by compactness. We also illustrate our results through numerical simulations.

SELECTION OF CITATIONS
SEARCH DETAIL
...