Pesquisa | Biblioteca Virtual em Saúde

TrieDedup: a fast trie-based deduplication algorithm to handle ambiguous bases in high-throughput sequencing.

Hu, Jianqiao; Luo, Sai; Tian, Ming; Ye, Adam Yongxin.

BMC Bioinformatics ; 25(1): 154, 2024 Apr 18.

Artigo em Inglês | MEDLINE | ID: mdl-38637756

RESUMO

BACKGROUND: High-throughput sequencing is a powerful tool that is extensively applied in biological studies. However, sequencers may produce low-quality bases, leading to ambiguous bases, 'N's. PCR duplicates introduced in library preparation are conventionally removed in genomics studies, and several deduplication tools have been developed for this purpose. Two identical reads may appear different due to ambiguous bases and the existing tools cannot address 'N's correctly or efficiently. RESULTS: Here we proposed and implemented TrieDedup, which uses the trie (prefix tree) data structure to compare and store sequences. TrieDedup can handle ambiguous base 'N's, and efficiently deduplicate at the level of raw sequences. We also reduced its memory usage by approximately 20% by implementing restrictedDict in Python. We benchmarked the performance of the algorithm and showed that TrieDedup can deduplicate reads up to 270-fold faster than pairwise comparison at a cost of 32-fold higher memory usage. CONCLUSIONS: The TrieDedup algorithm may facilitate PCR deduplication, barcode or UMI assignment, and repertoire diversity analysis of large-scale high-throughput sequencing datasets with its ultra-fast algorithm that can account for ambiguous bases due to sequencing errors.

Assuntos

Algoritmos , Software , Genômica , Biblioteca Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA

DAT-MT Accelerated Graph Fusion Dependency Parsing Model for Small Samples in Professional Fields.

Li, Rui; Shu, Shili; Wang, Shunli; Liu, Yang; Li, Yanhao; Peng, Mingjun.

Entropy (Basel) ; 25(10)2023 Oct 12.

Artigo em Inglês | MEDLINE | ID: mdl-37895565

RESUMO

The rapid development of information technology has made the amount of information in massive texts far exceed human intuitive cognition, and dependency parsing can effectively deal with information overload. In the background of domain specialization, the migration and application of syntactic treebanks and the speed improvement in syntactic analysis models become the key to the efficiency of syntactic analysis. To realize domain migration of syntactic tree library and improve the speed of text parsing, this paper proposes a novel approach-the Double-Array Trie and Multi-threading (DAT-MT) accelerated graph fusion dependency parsing model. It effectively combines the specialized syntactic features from small-scale professional field corpus with the generalized syntactic features from large-scale news corpus, which improves the accuracy of syntactic relation recognition. Aiming at the problem of high space and time complexity brought by the graph fusion model, the DAT-MT method is proposed. It realizes the rapid mapping of massive Chinese character features to the model's prior parameters and the parallel processing of calculation, thereby improving the parsing speed. The experimental results show that the unlabeled attachment score (UAS) and the labeled attachment score (LAS) of the model are improved by 13.34% and 14.82% compared with the model with only the professional field corpus and improved by 3.14% and 3.40% compared with the model only with news corpus; both indicators are better than DDParser and LTP 4 methods based on deep learning. Additionally, the method in this paper achieves a speedup of about 3.7 times compared to the method with a red-black tree index and a single thread. Efficient and accurate syntactic analysis methods will benefit the real-time processing of massive texts in professional fields, such as multi-dimensional semantic correlation, professional feature extraction, and domain knowledge graph construction.

Natural Language Processing: Practical Applications in Medicine and Investigation of Contextual Autocomplete.

Voytovich, Leah; Greenberg, Clayton.

Acta Neurochir Suppl ; 134: 207-214, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-34862544

RESUMO

Natural language processing (NLP) is the task of converting unstructured human language data into structured data that a machine can understand. While its applications are far and wide in healthcare, and are growing considerably every day, this chapter will focus on one particularly relevant application for healthcare professionals-reducing the burden of clinical documentation. More specifically, the chapter will discuss two studies (Gopinath et al., Fast, structured clinical documentation via contextual autocomplete. arXiv: 2007.15153, 2020; Greenbaum et al., Contextual autocomplete: a novel user interface using machine learning to improve ontology usage and structured data capture for presenting problems in the emergency department, 2017) that have implemented contextual autocompletion in electronic medical records and their promising results with regards to time saved for clinicians. The goals of this chapter are to introduce to the curious healthcare provider the basics of natural language processing, zoom into the use case of contextual autocomplete for electronic medical records, and provide a hands-on tutorial that introduces the basic NLP concepts required to build a model for predictive suggestions.

Assuntos

Aprendizado de Máquina , Processamento de Linguagem Natural , Registros Eletrônicos de Saúde , Humanos

Genome-wide functional analysis using the barcode sequence alignment and statistical analysis (Barcas) tool.

Mun, Jihyeob; Kim, Dong-Uk; Hoe, Kwang-Lae; Kim, Seon-Young.

BMC Bioinformatics ; 17(Suppl 17): 475, 2016 Dec 23.

Artigo em Inglês | MEDLINE | ID: mdl-28155635

RESUMO

BACKGROUND: Pooled library screen analysis using shRNAs or CRISPR-Cas9 hold great promise to genome-wide functional studies. While pooled library screens are effective tools, erroneous barcodes can potentially be generated during the production of many barcodes. However, no current tools can distinguish erroneous barcodes from PCR or sequencing errors in a data preprocessing step. RESULTS: We developed the Barcas program, a specialized program for the mapping and analysis of multiplexed barcode sequencing (barcode-seq) data. For fast and efficient mapping, Barcas uses a trie data structure based imperfect matching algorithm which generates precise mapping results containing mismatches, shifts, insertions and deletions (indel) in a flexible manner. Barcas provides three functions for quality control (QC) of a barcode library and distinguishes erroneous barcodes from PCR or sequencing errors. It also provides useful functions for data analysis and visualization. CONCLUSIONS: Barcas is an all-in-one package providing useful functions including mapping, data QC, library QC, statistical analysis and visualization in genome-wide pooled screens.

Assuntos

Genoma , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Animais , Sistemas CRISPR-Cas , Interpretação Estatística de Dados , Humanos , Camundongos , RNA Interferente Pequeno

Efficient sequential and parallel algorithms for finding edit distance based motifs.

Pal, Soumitra; Xiao, Peng; Rajasekaran, Sanguthevar.

BMC Genomics ; 17 Suppl 4: 465, 2016 08 18.

Artigo em Inglês | MEDLINE | ID: mdl-27557423

RESUMO

BACKGROUND: Motif search is an important step in extracting meaningful patterns from biological data. The general problem of motif search is intractable and there is a pressing need to develop efficient, exact and approximation algorithms to solve this problem. In this paper, we present several novel, exact, sequential and parallel algorithms for solving the (l,d) Edit-distance-based Motif Search (EMS) problem: given two integers l,d and n biological strings, find all strings of length l that appear in each input string with atmost d errors of types substitution, insertion and deletion. METHODS: One popular technique to solve the problem is to explore for each input string the set of all possible l-mers that belong to the d-neighborhood of any substring of the input string and output those which are common for all input strings. We introduce a novel and provably efficient neighborhood exploration technique. We show that it is enough to consider the candidates in neighborhood which are at a distance exactly d. We compactly represent these candidate motifs using wildcard characters and efficiently explore them with very few repetitions. Our sequential algorithm uses a trie based data structure to efficiently store and sort the candidate motifs. Our parallel algorithm in a multi-core shared memory setting uses arrays for storing and a novel modification of radix-sort for sorting the candidate motifs. RESULTS: The algorithms for EMS are customarily evaluated on several challenging instances such as (8,1), (12,2), (16,3), (20,4), and so on. The best previously known algorithm, EMS1, is sequential and in estimated 3 days solves up to instance (16,3). Our sequential algorithms are more than 20 times faster on (16,3). On other hard instances such as (9,2), (11,3), (13,4), our algorithms are much faster. Our parallel algorithm has more than 600 % scaling performance while using 16 threads. CONCLUSIONS: Our algorithms have pushed up the state-of-the-art of EMS solvers and we believe that the techniques introduced in this paper are also applicable to other motif search problems such as Planted Motif Search (PMS) and Simple Motif Search (SMS).

Assuntos

Algoritmos , Motivos de Aminoácidos/genética , Motivos de Nucleotídeos/genética , Software , Biologia Computacional/métodos , Análise de Sequência de DNA/métodos , Análise de Sequência de Proteína/métodos

Automatic disease diagnosis using optimised weightless neural networks for low-power wearable devices.

Cheruku, Ramalingaswamy; Edla, Damodar Reddy; Kuppili, Venkatanareshbabu; Dharavath, Ramesh; Beechu, Nareshkumar Reddy.

Healthc Technol Lett ; 4(4): 122-128, 2017 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-28868148

RESUMO

Low-power wearable devices for disease diagnosis are used at anytime and anywhere. These are non-invasive and pain-free for the better quality of life. However, these devices are resource constrained in terms of memory and processing capability. Memory constraint allows these devices to store a limited number of patterns and processing constraint provides delayed response. It is a challenging task to design a robust classification system under above constraints with high accuracy. In this Letter, to resolve this problem, a novel architecture for weightless neural networks (WNNs) has been proposed. It uses variable sized random access memories to optimise the memory usage and a modified binary TRIE data structure for reducing the test time. In addition, a bio-inspired-based genetic algorithm has been employed to improve the accuracy. The proposed architecture is experimented on various disease datasets using its software and hardware realisations. The experimental results prove that the proposed architecture achieves better performance in terms of accuracy, memory saving and test time as compared to standard WNNs. It also outperforms in terms of accuracy as compared to conventional neural network-based classifiers. The proposed architecture is a powerful part of most of the low-power wearable devices for the solution of memory, accuracy and time issues.

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.

Holley, Guillaume; Wittler, Roland; Stoye, Jens.

Algorithms Mol Biol ; 11: 3, 2016.

Artigo em Inglês | MEDLINE | ID: mdl-27087830

RESUMO

BACKGROUND: High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences "colored" by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. RESULTS: In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52-66 times faster while using about 5.5-14.3 times less memory. CONCLUSION: We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure. AVAILABILITY: https://www.github.com/GuillaumeHolley/BloomFilterTrie.

IncMD: incremental trie-based structural motif discovery algorithm.

Badr, Ghada; Al-Turaiki, Isra; Turcotte, Marcel; Mathkour, Hassan.

J Bioinform Comput Biol ; 12(5): 1450027, 2014 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-25362841

RESUMO

The discovery of common RNA secondary structure motifs is an important problem in bioinformatics. The presence of such motifs is usually associated with key biological functions. However, the identification of structural motifs is far from easy. Unlike motifs in sequences, which have conserved bases, structural motifs have common structure arrangements even if the underlying sequences are different. Over the past few years, hundreds of algorithms have been published for the discovery of sequential motifs, while less work has been done for the structural motifs case. Current structural motif discovery algorithms are limited in terms of accuracy and scalability. In this paper, we present an incremental and scalable algorithm for discovering RNA secondary structure motifs, namely IncMD. We consider the structural motif discovery as a frequent pattern mining problem and tackle it using a modified a priori algorithm. IncMD uses data structures, trie-based linked lists of prefixes (LLP), to accelerate the search and retrieval of patterns, support counting, and candidate generation. We modify the candidate generation step in order to adapt it to the RNA secondary structure representation. IncMD constructs the frequent patterns incrementally from RNA secondary structure basic elements, using nesting and joining operations. The notion of a motif group is introduced in order to simulate an alignment of motifs that only differ in the number of unpaired bases. In addition, we use a cluster beam approach to select motifs that will survive to the next iterations of the search. Results indicate that IncMD can perform better than some of the available structural motif discovery algorithms in terms of sensitivity (Sn), positive predictive value (PPV), and specificity (Sp). The empirical results also show that the algorithm is scalable and runs faster than all of the compared algorithms.

Assuntos

Algoritmos , Conformação de Ácido Nucleico , RNA/química , Sequência de Bases , Biologia Computacional , Simulação por Computador , Mineração de Dados , Bases de Dados de Ácidos Nucleicos , Modelos Moleculares

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA