Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 46
Filter
1.
BMC Bioinformatics ; 25(1): 263, 2024 Aug 08.
Article in English | MEDLINE | ID: mdl-39118013

ABSTRACT

BACKGROUND: Genome assembly, which involves reconstructing a target genome, relies on scaffolding methods to organize and link partially assembled fragments. The rapid evolution of long read sequencing technologies toward more accurate long reads, coupled with the continued use of short read technologies, has created a unique need for hybrid assembly workflows. The construction of accurate genomic scaffolds in hybrid workflows is complicated due to scale, sequencing technology diversity (e.g., short vs. long reads, contigs or partial assemblies), and repetitive regions within a target genome. RESULTS: In this paper, we present a new parallel workflow for hybrid genome scaffolding that would allow combining pre-constructed partial assemblies with newly sequenced long reads toward an improved assembly. More specifically, the workflow, called Maptcha, is aimed at generating long scaffolds of a target genome, from two sets of input sequences-an already constructed partial assembly of contigs, and a set of newly sequenced long reads. Our scaffolding approach internally uses an alignment-free mapping step to build a ⟨ contig,contig ⟩ graph using long reads as linking information. Subsequently, this graph is used to generate scaffolds. We present and evaluate a graph-theoretic "wiring" heuristic to perform this scaffolding step. To enable efficient workload management in a parallel setting, we use a batching technique that partitions the scaffolding tasks so that the more expensive alignment-based assembly step at the end can be efficiently parallelized. This step also allows the use of any standalone assembler for generating the final scaffolds. CONCLUSIONS: Our experiments with Maptcha on a variety of input genomes, and comparison against two state-of-the-art hybrid scaffolders demonstrate that Maptcha is able to generate longer and more accurate scaffolds substantially faster. In almost all cases, the scaffolds produced by Maptcha are at least an order of magnitude longer (in some cases two orders) than the scaffolds produced by state-of-the-art tools. Maptcha runs significantly faster too, reducing time-to-solution from hours to minutes for most input cases. We also performed a coverage experiment by varying the sequencing coverage depth for long reads, which demonstrated the potential of Maptcha to generate significantly longer scaffolds in low coverage settings ( 1 × - 10 × ).


Subject(s)
Genomics , Workflow , Genomics/methods , Sequence Analysis, DNA/methods , Software , Genome , High-Throughput Nucleotide Sequencing/methods , Algorithms
2.
J Comput Biol ; 31(7): 597-615, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38980804

ABSTRACT

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a decycling set of the de Bruijn graph, which is a set of unavoidable k-mers. Any long enough sequence, by definition, must contain a k-mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the k-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.


Subject(s)
Algorithms , Computational Biology , Software , Computational Biology/methods , Sequence Alignment/methods , Humans , Sequence Analysis, DNA/methods
3.
Anat Sci Educ ; 2024 Jun 09.
Article in English | MEDLINE | ID: mdl-38853404

ABSTRACT

Dental anatomy education for dental technology students should be developed in alignment with digital dental laboratory practices. We hypothesized that a virtually assisted sketching-based dental anatomy teaching module could improve students' acquisition of skills essential for digital restoration design. The second-year dental technology curriculum included a novel virtual technology-assisted sketching-based module for dental anatomy education. Pre- and post-course assessments evaluated students' skill sets and knowledge bases. Computer-aided design (CAD) scores were analyzed after one year to assess how the skills students developed through this module impacted their subsequent CAD performance. Participants who undertook the dental sketching-based teaching module demonstrated significantly improved theoretical knowledge of dental anatomy, dental aesthetic perception, and spatial reasoning skills. A partial least squares structural equation model indicated that the positive effects of this module on subsequent CAD performance were indirectly mediated by dental aesthetic perception, spatial reasoning, and practice time. A virtually assisted sketching-based dental anatomy teaching module significantly improved students' acquisition of skills and knowledge and positively mediated dental technology students' CAD performance.

4.
bioRxiv ; 2024 May 30.
Article in English | MEDLINE | ID: mdl-38854044

ABSTRACT

Motivation: The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics, such as the cosine similarity, are still lacking. Theoretical contributions: In this paper, we present a theoretical framework for estimating cosine similarity from FracMinHash sketches. We establish conditions under which this estimation is sound, and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings. Practical contributions: We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise cosine similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.

5.
Algorithms Mol Biol ; 19(1): 19, 2024 May 04.
Article in English | MEDLINE | ID: mdl-38704605

ABSTRACT

BACKGROUND: Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. RESULTS: In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in O ( | t | + | p | + ℓ 2 ) time and O ( ℓ log ℓ ) space, where |t| is the number of k -mers inside the sketch of the reference, |p| is the number of k -mers inside the read's sketch and ℓ is the number of times that k -mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm's performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.

6.
J Comput Biol ; 31(1): 2-20, 2024 Jan.
Article in English | MEDLINE | ID: mdl-37975802

ABSTRACT

Minimizers and syncmers are sketching methods that sample representative k-mer seeds from a long string. The minimizer scheme guarantees a well-spread k-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.


Subject(s)
Algorithms , Genomics , Humans , Genomics/methods , Genome, Human
7.
ACM BCB ; 20232023 Sep.
Article in English | MEDLINE | ID: mdl-38050580

ABSTRACT

In computational biology, k-mers and edit distance are fundamental concepts. However, little is known about the metric space of all k-mers equipped with the edit distance. In this work, we explore the structure of the k-mer space by studying its maximal independent sets (MISs). An MIS is a sparse sketch of all k-mers with nice theoretical properties, and therefore admits critical applications in clustering, indexing, hashing, and sketching large-scale sequencing data, particularly those with high error-rates. Finding an MIS is a challenging problem, as the size of a k-mer space grows geometrically with respect to k. We propose three algorithms for this problem. The first and the most intuitive one uses a greedy strategy. The second method implements two techniques to avoid redundant comparisons by taking advantage of the locality-property of the k-mer space and the estimated bounds on the edit distance. The last algorithm avoids expensive calculations of the edit distance by translating the edit distance into the shortest path in a specifically designed graph. These algorithms are implemented and the calculated MISs of k-mer spaces and their statistical properties are reported and analyzed for k up to 15. Source code is freely available at https://github.com/Shao-Group/kmerspace.

8.
J Comput Biol ; 30(12): 1251-1276, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37646787

ABSTRACT

Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.


Subject(s)
Algorithms , Genomics , Sequence Analysis, DNA/methods , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Software
9.
Genome Biol ; 24(1): 121, 2023 05 17.
Article in English | MEDLINE | ID: mdl-37198663

ABSTRACT

We present RabbitTClust, a fast and memory-efficient genome clustering tool based on sketch-based distance estimation. Our approach enables efficient processing of large-scale datasets by combining dimensionality reduction techniques with streaming and parallelization on modern multi-core platforms. 113,674 complete bacterial genome sequences from RefSeq, 455 GB in FASTA format, can be clustered within less than 6 min and 1,009,738 GenBank assembled bacterial genomes, 4.0 TB in FASTA format, within only 34 min on a 128-core workstation. Our results further identify 1269 redundant genomes, with identical nucleotide content, in the RefSeq bacterial genomes database.


Subject(s)
Genome , Software , Databases, Nucleic Acid , Cluster Analysis , Bacteria , Algorithms , Genome, Bacterial
10.
Stat Comput ; 33(1): 34, 2023.
Article in English | MEDLINE | ID: mdl-36691583

ABSTRACT

There is an increasing body of work exploring the integration of random projection into algorithms for numerical linear algebra. The primary motivation is to reduce the overall computational cost of processing large datasets. A suitably chosen random projection can be used to embed the original dataset in a lower-dimensional space such that key properties of the original dataset are retained. These algorithms are often referred to as sketching algorithms, as the projected dataset can be used as a compressed representation of the full dataset. We show that random matrix theory, in particular the Tracy-Widom law, is useful for describing the operating characteristics of sketching algorithms in the tall-data regime when the sample size n is much greater than the number of variables d. Asymptotic large sample results are of particular interest as this is the regime where sketching is most useful for data compression. In particular, we develop asymptotic approximations for the success rate in generating random subspace embeddings and the convergence probability of iterative sketching algorithms. We test a number of sketching algorithms on real large high-dimensional datasets and find that the asymptotic expressions give accurate predictions of the empirical performance. Supplementary Information: The online version contains supplementary material available at 10.1007/s11222-022-10148-5.

11.
Article in English | MEDLINE | ID: mdl-38831964

ABSTRACT

Given a sequencing read, the broad goal of read mapping is to find the location(s) in the reference genome that have a "similar sequence". Traditionally, "similar sequence" was defined as having a high alignment score and read mappers were viewed as heuristic solutions to this well-defined problem. For sketch-based mappers, however, there has not been a problem formulation to capture what problem an exact sketch-based mapping algorithm should solve. Moreover, there is no sketch-based method that can find all possible mapping positions for a read above a certain score threshold. In this paper, we formulate the problem of read mapping at the level of sequence sketches. We give an exact dynamic programming algorithm that finds all hits above a given similarity threshold. It runs in 𝒪|t|+|p|+ℓ2 time and Θℓ2 space, where |t| is the number of k-mers inside the sketch of the reference, |p| is the number of k-mers inside the read's sketch and ℓ is the number of times that k-mers from the pattern sketch occur in the sketch of the text. We evaluate our algorithm's performance in mapping long reads to the T2T assembly of human chromosome Y, where ampliconic regions make it desirable to find all good mapping positions. For an equivalent level of precision as minimap2, the recall of our algorithm is 0.88, compared to only 0.76 of minimap2.

12.
Sensors (Basel) ; 22(24)2022 Dec 08.
Article in English | MEDLINE | ID: mdl-36559998

ABSTRACT

Approximating quantiles and distributions over streaming data has been studied for roughly two decades now. Recently, Karnin, Lang, and Liberty proposed the first asymptotically optimal algorithm for doing so. This manuscript complements their theoretical result by providing a practical variants of their algorithm with improved constants. For a given sketch size, our techniques provably reduce the upper bound on the sketch error by a factor of two. These improvements are verified experimentally. Our modified quantile sketch improves the latency as well by reducing the worst-case update time from O(1ε) down to O(log1ε).


Subject(s)
Algorithms
13.
BMC Bioinformatics ; 23(1): 452, 2022 Oct 31.
Article in English | MEDLINE | ID: mdl-36316646

ABSTRACT

BACKGROUND: In modern sequencing experiments, quickly and accurately identifying the sources of the reads is a crucial need. In metagenomics, where each read comes from one of potentially many members of a community, it can be important to identify the exact species the read is from. In other settings, it is important to distinguish which reads are from the targeted sample and which are from potential contaminants. In both cases, identification of the correct source of a read enables further investigation of relevant reads, while minimizing wasted work. This task is particularly challenging for long reads, which can have a substantial error rate that obscures the origins of each read. RESULTS: Existing tools for the read classification problem are often alignment or index-based, but such methods can have large time and/or space overheads. In this work, we investigate the effectiveness of several sampling and sketching-based approaches for read classification. In these approaches, a chosen sampling or sketching algorithm is used to generate a reduced representation (a "screen") of potential source genomes for a query readset before reads are streamed in and compared against this screen. Using a query read's similarity to the elements of the screen, the methods predict the source of the read. Such an approach requires limited pre-processing, stores and works with only a subset of the input data, and is able to perform classification with a high degree of accuracy. CONCLUSIONS: The sampling and sketching approaches investigated include uniform sampling, methods based on MinHash and its weighted and order variants, a minimizer-based technique, and a novel clustering-based sketching approach. We demonstrate the effectiveness of these techniques both in identifying the source microbial genomes for reads from a metagenomic long read sequencing experiment, and in distinguishing between long reads from organisms of interest and potential contaminant reads. We then compare these approaches to existing alignment, index and sketching-based tools for read classification, and demonstrate how such a method is a viable alternative for determining the source of query reads. Finally, we present a reference implementation of these approaches at https://github.com/arun96/sketching .


Subject(s)
High-Throughput Nucleotide Sequencing , Software , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Metagenomics/methods , Metagenome , Algorithms
14.
J Comput Biol ; 29(12): 1288-1304, 2022 12.
Article in English | MEDLINE | ID: mdl-36095142

ABSTRACT

Minimizers are widely used to sample representative k-mers from biological sequences in many applications, such as read mapping and taxonomy prediction. In most scenarios, having the minimizer scheme select as few k-mer positions as possible (i.e., having a low density) is desirable to reduce computation and memory cost. Despite the growing interest in minimizers, learning an effective scheme with optimal density is still an open question, as it requires solving an apparently challenging discrete optimization problem on the permutation space of k-mer orderings. Most existing schemes are designed to work well in expectation over random sequences, which have limited applicability to many practical tools. On the other hand, several methods have been proposed to construct minimizer schemes for a specific target sequence. These methods, however, only approximate the original objective with likewise discrete surrogate tasks that are not able to significantly improve the density performance. This article introduces the first continuous relaxation of the density minimizing objective, DeepMinimizer, which employs a novel Deep Learning twin architecture to simultaneously ensure both validity and performance of the minimizer scheme. Our surrogate objective is fully differentiable and, therefore, amenable to efficient gradient-based optimization using GPU computing. Finally, we demonstrate that DeepMinimizer discovers minimizer schemes that significantly outperform state-of-the-art constructions on human genomic sequences.


Subject(s)
Algorithms , Genomics , Humans , Genomics/methods , Genome , Base Sequence
15.
RECIIS (Online) ; 16(3): 686-703, jul.-set. 2022. ilus
Article in Portuguese | LILACS | ID: biblio-1398951

ABSTRACT

As narrativas visuais, como os quadrinhos, são ferramentas populares que podem contribuir para a edu-cação e a comunicação científica para diferentes públicos. Dentro dessa perspectiva, este trabalho discute o uso de quadrinhos na disseminação do conhecimento sobre a segurança alimentar de peixes em um evento de divulgação científica. Para a ocasião, foi criada uma história, que enfatizou o conceito sobre Anisakisspp. e as metodologias de prevenção da anisaquíase. O enredo envolveu participantes de todas as idades, sugerindo fortemente que os quadrinhos são cativantes e funcionam como uma ferramenta de aprendizado, podendo contribuir para a alfabetização científifica da população, levando à promoção da saúde em níveis individual e coletivo. Destaca-se aqui a importância da produção de novos conhecimentos, que ampliem o diálogo das áreas da saúde, da ciência e da tecnologia com a sociedade. Além disso, torna-se urgente aumentar os investimentos e promover a formação avançada e continuada de divulgadores da ciência.


Visual narratives, like comics, are popular tools that can contribute to science education and communi-cation for different audiences. Within this perspective, this work discusses the use of comics in the dis-semination of knowledge about fish food safety in a scientific dissemination event. For the occasion, a story was created, which emphasized the concept of Anisakis spp. and methodologies for the prevention of anisakiasis. The plot involved participants of all ages, strongly suggesting that comics are captivating and function as a learning tool, which can contribute to the scientific literacy of the population, leading to health promotion at an individual and collective levels. We highlight the importance of producing new knowledge that expands the dialogue between the areas of health, science and technology with society. Furthermore, it is urgent to increase investments and promote advanced and continuous training for science disseminators


Las narrativas visuales, como los cómics, son herramientas populares que pueden contribuir a la educación científica y la comunicación para diferentes públicos. Dentro de esta perspectiva, este trabajo discute el uso de las historietas en la difusión del conocimiento sobre la seguridad alimentaria de los peces en un evento de divulgación científica. Para la ocasión, se creó una historia, que enfatizó el concepto de Anisakis spp. y las metodologías para la prevención de la anisakiasis. La trama involucró a participantes de todas las edades, lo que sugiere fuertemente que los cómics son cautivadores y funcionan como una herramienta de aprendiza-je, que puede contribuir a la alfabetización científica de la población, lo que lleva a la promoción de la salud a los niveles individual y colectivo. Se destaca aquí la importancia de producir nuevos conocimientos, lo que amplía el diálogo entre las áreas de la salud, la ciencia y la tecnología con la sociedad, además, es urgente incrementar las inversiones y promover la formación avanzada y continua de los divulgadores de la ciencia


Subject(s)
Humans , Scientific Communication and Diffusion , Graphic Novel , Public Policy , Science , Education, Continuing , Information Literacy , Health Promotion , Learning
16.
PeerJ ; 10: e13784, 2022.
Article in English | MEDLINE | ID: mdl-35891643

ABSTRACT

Bacteria of the genus Klebsiella are among the most important multi-drug resistant human pathogens, though they have been isolated from a variety of environments. The importance and ubiquity of these organisms call for quick and accurate methods for their classification. Average Nucleotide Identity (ANI) is becoming a standard for species delimitation based on whole genome sequence comparison. However, much faster genome comparison tools have been appearing in the literature. In this study we tested the quality of different approaches for genome-based species delineation against ANI. To this end, we compared 1,189 Klebsiella genomes using measures calculated with Mash, Dashing, and DNA compositional signatures, all of which run in a fraction of the time required to obtain ANI. Receiver Operating Characteristic (ROC) curve analyses showed equal quality in species discrimination for ANI, Mash and Dashing, with Area Under the Curve (AUC) values above 0.99, followed by DNA signatures (AUC: 0.96). Accordingly, groups obtained at optimized cutoffs largely agree with species designation, with ANI, Mash and Dashing producing 15 species-level groups. DNA signatures broke the dataset into more than 30 groups. Testing Mash to map species after adding draft genomes to the dataset also showed excellent results (AUC above 0.99), producing a total of 26 Klebsiella species-level groups. The ecological niches of Klebsiella strains were found to neither be related to species delimitation, nor to protein functional content, suggesting that a single Klebsiella species can have a wide repertoire of ecological functions.


Subject(s)
Genome, Bacterial , Klebsiella , Humans , Klebsiella/genetics , Genome, Bacterial/genetics , Bacteria , DNA
17.
J Comput Biol ; 29(2): 155-168, 2022 02.
Article in English | MEDLINE | ID: mdl-35108101

ABSTRACT

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.


Subject(s)
Mutation , Sequence Analysis, DNA/statistics & numerical data , Algorithms , Base Sequence , Computational Biology , Confidence Intervals , Genomics/statistics & numerical data , Humans , Models, Genetic , Sequence Alignment/statistics & numerical data , Software
18.
J Comput Biol ; 29(2): 140-154, 2022 02.
Article in English | MEDLINE | ID: mdl-35049334

ABSTRACT

k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch-a sketching technique for representing associative maps inspired from Count-Min-and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min-an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, in terms of both the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets.


Subject(s)
Computational Biology/methods , Genomics/statistics & numerical data , Software , Algorithms , Animals , Computer Graphics , Databases, Genetic/statistics & numerical data , Genome, Human , Humans , Models, Statistical , Molecular Sequence Annotation/statistics & numerical data
19.
Front Psychol ; 13: 1012787, 2022.
Article in English | MEDLINE | ID: mdl-36687809

ABSTRACT

Multiple external representations (e.g., diagrams, equations) and their interpretations play a central role in science and science learning as research has shown that they can substantially facilitate the learning and understanding of science concepts. Therefore, multiple and particularly visual representations are a core element of university physics. In electrodynamics, which students encounter already at the beginning of their studies, vector fields are a central representation typically used in two forms: the algebraic representation as a formula and the visual representation depicted by a vector field diagram. While the former is valuable for quantitative calculations, vector field diagrams are beneficial for showing many properties of a field at a glance. However, benefiting from the mutual complementarity of both representations requires representational competencies aiming at referring different representations to each other. Yet, previous study results revealed several student problems particularly regarding the conceptual understanding of vector calculus concepts. Against this background, we have developed research-based, multi-representational learning tasks that focus on the visual interpretation of vector field diagrams aiming at enhancing a broad, mathematical as well as conceptual, understanding of vector calculus concepts. Following current trends in education research and considering cognitive psychology, the tasks incorporate sketching activities and interactive (computer-based) simulations to enhance multi-representational learning. In this article, we assess the impact of the learning tasks in a field study by implementing them into lecture-based recitations in a first-year electrodynamics course at the University of Göttingen. For this, a within- and between-subjects design is used comparing a multi-representational intervention group and a control group working on traditional calculation-based tasks. To analyze the impact of multiple representations, students' performance in a vector calculus test as well as their perceived cognitive load during task processing is compared between the groups. Moreover, analyses offer guidance for further design of multi-representational learning tasks in field-related physics topics.

20.
IEEE Trans Knowl Data Eng ; 34(1): 328-339, 2022 Jan.
Article in English | MEDLINE | ID: mdl-38288326

ABSTRACT

In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size O(logn) to buckets of size O(loglogn) by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For an additive approximation error ϵ on a Jaccard index t, given a random oracle, HyperMinHash needs O(ϵ-2(loglogn+log1ϵ)) space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 1019 with relative error of around 10% using 2MiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 1010 with the same memory consumption.

SELECTION OF CITATIONS
SEARCH DETAIL