Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 19 de 19
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 39(5)2023 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-37171886

RESUMO

SUMMARY: Finding Maximum Exact Matches, i.e. matches between two strings that cannot be further extended to the left or right, is a classic string problem with applications in genome-to-genome comparisons. The existing tools rarely explicitly address the problem of MEM finding for a pair of very similar genomes, which may be computationally challenging. We present copMEM2, a multithreaded implementation of its predecessor. Together with a few optimizations, including a carefully built predecessor query data structure and sort procedure selection, and taking care for highly similar data, copMEM2 allows to compute all MEMs of minimum length 50 between the human and mouse genomes in 59 s, using 10.40 GB of RAM and 12 threads, being at least a few times faster than its main contenders. On a pair of human genomes, hg18 and hg19, the results are 324 s and 16.57 GB, respectively. AVAILABILITY AND IMPLEMENTATION: copMEM2 is available at https://github.com/wbieniec/copmem2.


Assuntos
Algoritmos , Software , Animais , Humanos , Camundongos , Genoma Humano , Movimento Celular , Transporte Proteico , Análise de Sequência de DNA/métodos
2.
Int J Mol Sci ; 23(9)2022 Apr 26.
Artigo em Inglês | MEDLINE | ID: mdl-35563141

RESUMO

Acute lymphoblastic leukemia (ALL) is the most common hematological malignancy affecting pediatric patients. ALL treatment regimens with cytostatics manifest substantial toxicity and have reached the maximum of well-tolerated doses. One potential approach for improving treatment efficiency could be supplementation of the current regimen with naturally occurring phytochemicals with anti-cancer properties. Nutraceuticals such as quercetin, curcumin, resveratrol, and genistein have been studied in anti-cancer therapy, but their application is limited by their low bioavailability. However, their cooperative activity could potentially increase their efficiency at low, bioavailable doses. We studied their cooperative effect on the viability of a human ALL MOLT-4 cell line in vitro at the concentration considered to be in the bioavailable range in vivo. To analyze their potential side effect on the viability of non-tumor cells, we evaluated their toxicity on a normal human foreskin fibroblast cell line (BJ). In both cell lines, we also measured specific indicators of cell death, changes in cell membrane permeability (CMP), and mitochondrial membrane potential (MMP). Even at a low bioavailable concentration, genistein and curcumin decreased MOLT-4 viability, and their combination had a significant interactive effect. While resveratrol and quercetin did not affect MOLT-4 viability, together they enhanced the effect of the genistein/curcumin mix, significantly inhibiting MOLT-4 population growth in vitro. Moreover, the analyzed phytochemicals and their combinations did not affect the BJ cell line. In both cell lines, they induced a decrease in MMP and correlating CMP changes, but in non-tumor cells, both metabolic activity and cell membrane continuity were restored in time. (4) Conclusions: The results indicate that the interactive activity of analyzed phytochemicals can induce an anti-cancer effect on ALL cells without a significant effect on non-tumor cells. It implies that the application of the combinations of phytochemicals an anti-cancer treatment supplement could be worth further investigation regardless of their low bioavailability.


Assuntos
Curcumina , Leucemia-Linfoma Linfoblástico de Células Precursoras , Leucemia-Linfoma Linfoblástico de Células T Precursoras , Apoptose , Linhagem Celular , Linhagem Celular Tumoral , Curcumina/farmacologia , Curcumina/uso terapêutico , Genisteína/farmacologia , Genisteína/uso terapêutico , Humanos , Compostos Fitoquímicos/farmacologia , Compostos Fitoquímicos/uso terapêutico , Leucemia-Linfoma Linfoblástico de Células Precursoras/tratamento farmacológico , Leucemia-Linfoma Linfoblástico de Células T Precursoras/tratamento farmacológico , Quercetina/farmacologia , Quercetina/uso terapêutico , Resveratrol/farmacologia , Resveratrol/uso terapêutico
3.
Gigascience ; 112022 01 27.
Artigo em Inglês | MEDLINE | ID: mdl-35084032

RESUMO

BACKGROUND: Genomes within the same species reveal large similarity, exploited by specialized multiple genome compressors. The existing algorithms and tools are however targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is rather moderate. RESULTS: In this work, we propose MBGC, a specialized genome compressor making use of specific redundancy of bacterial genomes. Its characteristic features are finding both direct and reverse-complemented LZ-matches, as well as a careful management of a reference buffer in a multi-threaded implementation. Our tool is not only compression efficient but also fast. On a collection of 168,311 bacterial genomes, totalling 587 GB, we achieve a compression ratio of approximately a factor of 1,265 and compression (respectively decompression) speed of ∼1,580 MB/s (respectively 780 MB/s) using 8 hardware threads, on a computer with a 14-core/28-thread CPU and a fast SSD, being almost 3 times more succinct and >6 times faster in the compression than the next best competitor.


Assuntos
Compressão de Dados , Algoritmos , Genoma Bacteriano , Análise de Sequência de DNA , Software
4.
Bioinformatics ; 36(7): 2082-2089, 2020 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-31893286

RESUMO

MOTIVATION: The amount of sequencing data from high-throughput sequencing technologies grows at a pace exceeding the one predicted by Moore's law. One of the basic requirements is to efficiently store and transmit such huge collections of data. Despite significant interest in designing FASTQ compressors, they are still imperfect in terms of compression ratio or decompression resources. RESULTS: We present Pseudogenome-based Read Compressor (PgRC), an in-memory algorithm for compressing the DNA stream, based on the idea of building an approximation of the shortest common superstring over high-quality reads. Experiments show that PgRC wins in compression ratio over its main competitors, SPRING and Minicom, by up to 15 and 20% on average, respectively, while being comparably fast in decompression. AVAILABILITY AND IMPLEMENTATION: PgRC can be downloaded from https://github.com/kowallus/PgRC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Software , Algoritmos , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA
5.
Bioinformatics ; 35(4): 677-678, 2019 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-30060142

RESUMO

MOTIVATION: Genome-to-genome comparisons require designating anchor points, which are given by Maximum Exact Matches (MEMs) between their sequences. For large genomes this is a challenging problem and the performance of existing solutions, even in parallel regimes, is not quite satisfactory. RESULTS: We present a new algorithm, copMEM, that allows to sparsely sample both input genomes, with sampling steps being coprime. Despite being a single-threaded implementation, copMEM computes all MEMs of minimum length 100 between the human and mouse genomes in less than 2 minutes, using 7 GB of RAM memory. AVAILABILITY AND IMPLEMENTATION: https://github.com/wbieniec/copmem. SUPPLEMENTARY DATA: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Software , Algoritmos , Animais , Biologia Computacional , Humanos , Camundongos
6.
Bioinformatics ; 35(12): 2043-2050, 2019 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-30407485

RESUMO

MOTIVATION: Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. The reduction of sequencing costs implies a need for algorithms able to process increasing amounts of generated data in reasonable time. RESULTS: We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known BWA-MEM and Bowtie2 tools at a comparable accuracy, validated in a variant calling pipeline. AVAILABILITY AND IMPLEMENTATION: Whisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Sequência de Bases , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA
7.
Bioinformatics ; 34(24): 4290-4292, 2018 12 15.
Artigo em Inglês | MEDLINE | ID: mdl-29939210

RESUMO

Motivation: The many thousands of high-quality genomes available now-a-days imply a shift from single genome to pan-genomic analyses. A basic algorithmic building brick for such a scenario is online search over a collection of similar texts, a problem with surprisingly few solutions presented so far. Results: We present SOPanG, a simple tool for exact pattern matching over an elastic-degenerate string, a recently proposed simplified model for the pan-genome. Thanks to bit-parallelism, it achieves pattern matching speeds above 400 MB/s, more than an order of magnitude higher than of other software. Availability and implementation: SOPanG is available for free from: https://github.com/MrAlexSee/sopang. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Genômica , Software , Algoritmos , Genoma/genética , Genômica/métodos , Armazenamento e Recuperação da Informação , Internet , Software/normas
8.
Bioinformatics ; 32(7): 1115-7, 2016 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-26615213

RESUMO

MOTIVATION: Data compression is crucial in effective handling of genomic data. Among several recently published algorithms, ERGC seems to be surprisingly good, easily beating all of the competitors. RESULTS: We evaluated ERGC and the previously proposed algorithms GDC and iDoComp, which are the ones used in the original paper for comparison, on a wide data set including 12 assemblies of human genome (instead of only four of them in the original paper). ERGC wins only when one of the genomes (referential or target) contains mixed-cased letters (which is the case for only the two Korean genomes). In all other cases ERGC is on average an order of magnitude worse than GDC and iDoComp. CONTACT: sebastian.deorowicz@polsl.pl, iochoa@stanford.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Análise de Sequência de DNA , Algoritmos , Genoma , Genoma Humano , Genômica , Humanos
9.
PLoS One ; 10(7): e0133198, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26182400

RESUMO

We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating k-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments.


Assuntos
Algoritmos , Genoma , Análise de Sequência de RNA/estatística & dados numéricos , Software , Animais , Caenorhabditis elegans/genética , Conjuntos de Dados como Assunto , Escherichia coli/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de RNA/métodos
10.
11.
Bioinformatics ; 31(10): 1569-76, 2015 May 15.
Artigo em Inglês | MEDLINE | ID: mdl-25609798

RESUMO

MOTIVATION: Building the histogram of occurrences of every k-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of k-mer counting. Its applications include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. The tremendous amounts of NGS data require fast algorithms for k-mer counting, preferably using moderate amounts of memory. RESULTS: We present a novel method for k-mer counting, on large datasets about twice faster than the strongest competitors (Jellyfish 2, KMC 1), using about 12 GB (or less) of RAM. Our disk-based method bears some resemblance to MSPKmerCounter, yet replacing the original minimizers with signatures (a carefully selected subset of all minimizers) and using (k, x)-mers allows to significantly reduce the I/O and a highly parallel overall architecture allows to achieve unprecedented processing speeds. For example, KMC 2 counts the 28-mers of a human reads collection with 44-fold coverage (106 GB of compressed size) in about 20 min, on a 6-core Intel i7 PC with an solid-state disk.


Assuntos
Algoritmos , Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Animais , Humanos
12.
Bioinformatics ; 31(9): 1389-95, 2015 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-25536966

RESUMO

MOTIVATION: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (2012), is based on the Burrows-Wheeler transform (BWT) and achieves 0.518 bits per base for a 134.0 Gbp human genome sequencing collection with almost 45-fold coverage. RESULTS: We propose overlapping reads compression with minimizers, a compression algorithm dedicated to sequencing reads (DNA only). Our method makes use of a conceptually simple and easily parallelizable idea of minimizers, to obtain 0.317 bits per base as the compression ratio, allowing to fit the 134.0 Gbp dataset into only 5.31 GB of space. AVAILABILITY AND IMPLEMENTATION: http://sun.aei.polsl.pl/orcom under a free license. CONTACT: sebastian.deorowicz@polsl.pl SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Animais , Galinhas/genética , Genoma Humano , Humanos
13.
PLoS One ; 9(10): e109384, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25289699

RESUMO

The availability of thousands of individual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key operation useful in such analyses is aligning sequencing reads against a collection of genomes, which is costly with the use of existing algorithms due to their large memory requirements. We present MuGI, Multiple Genome Index, which reports all occurrences of a given pattern, in exact and approximate matching model, against a collection of thousand(s) genomes. Its unique feature is the small index size, which is customisable. It fits in a standard computer with 16-32 GB, or even 8 GB, of RAM, for the 1000GP collection of 1092 diploid human genomes. The solution is also fast. For example, the exact matching queries (of average length 150 bp) are handled in average time of 39 µs and with up to 3 mismatches in 373 µs on the test PC with the index size of 13.4 GB. For a smaller index, occupying 7.4 GB in memory, the respective times grow to 76 µs and 917 µs. Software is available at http://sun.aei.polsl.pl/mugi under a free license. Data S1 is available at PLOS One online.


Assuntos
Indexação e Redação de Resumos , Computadores , Genoma , Genômica , Algoritmos , Biologia Computacional/métodos , Conjuntos de Dados como Assunto , Genômica/métodos
14.
Algorithms Mol Biol ; 8(1): 25, 2013 Nov 18.
Artigo em Inglês | MEDLINE | ID: mdl-24252160

RESUMO

: Post-Sanger sequencing methods produce tons of data, and there is a general agreement that the challenge to store and process them must be addressed with data compression. In this review we first answer the question "why compression" in a quantitative manner. Then we also answer the questions "what" and "how", by sketching the fundamental compression ideas, describing the main sequencing data types and formats, and comparing the specialized compression algorithms and tools. Finally, we go back to the question "why compression" and give other, perhaps surprising answers, demonstrating the pervasiveness of data compression techniques in computational biology.

15.
J Comput Biol ; 20(9): 621-30, 2013 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-23919388

RESUMO

The problem of finding the locations in DNA sequences that match a given motif describing the binding specificities of a transcription factor (TF) has many applications in computational biology. This problem has been extensively studied when the position weight matrix (PWM) model is used to represent motifs. We investigate it under the feature motif model, a generalization of the PWM model that does not assume independence between positions in the pattern while being compatible with the original PWM. We present a new method for finding the binding sites of a transcription factor in a DNA sequence when the feature motif model is used to describe transcription factor binding specificities. The experimental results on random and real data show that the search algorithm is fast in practice.


Assuntos
Modelos Genéticos , Elementos de Resposta/genética , Fatores de Transcrição/genética , Motivos de Aminoácidos , Biologia Computacional/métodos
16.
Bioinformatics ; 29(20): 2572-8, 2013 Oct 15.
Artigo em Inglês | MEDLINE | ID: mdl-23969136

RESUMO

MOTIVATION: Genomic repositories are rapidly growing, as witnessed by the 1000 Genomes or the UK10K projects. Hence, compression of multiple genomes of the same species has become an active research area in the past years. The well-known large redundancy in human sequences is not easy to exploit because of huge memory requirements from traditional compression algorithms. RESULTS: We show how to obtain several times higher compression ratio than of the best reported results, on two large genome collections (1092 human and 775 plant genomes). Our inputs are variant call format files restricted to their essential fields. More precisely, our novel Ziv-Lempel-style compression algorithm squeezes a single human genome to ∼400 KB. The key to high compression is to look for similarities across the whole collection, not just against one reference sequence, what is typical for existing solutions. AVAILABILITY: http://sun.aei.polsl.pl/tgc (also as Supplementary Material) under a free license. Supplementary data: Supplementary data are available at Bioinformatics online.


Assuntos
Arabidopsis/genética , Compressão de Dados/métodos , Genoma Humano , Genoma de Planta , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Algoritmos , Sequência de Bases , Bases de Dados Genéticas , Genômica/métodos , Humanos , Análise de Sequência de DNA/métodos
17.
BMC Bioinformatics ; 14: 160, 2013 May 16.
Artigo em Inglês | MEDLINE | ID: mdl-23679007

RESUMO

BACKGROUND: The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection. RESULTS: We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data. CONCLUSIONS: By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.


Assuntos
Algoritmos , Genômica/métodos , Microcomputadores , Animais , Caenorhabditis elegans/genética , Genoma Humano , Humanos , Alinhamento de Sequência , Análise de Sequência de DNA , Software
18.
Bioinformatics ; 27(21): 2979-86, 2011 Nov 01.
Artigo em Inglês | MEDLINE | ID: mdl-21896510

RESUMO

MOTIVATION: Storing, transferring and maintaining genomic databases becomes a major challenge because of the rapid technology progress in DNA sequencing and correspondingly growing pace at which the sequencing data are being produced. Efficient compression, with support for extraction of arbitrary snippets of any sequence, is the key to maintaining those huge amounts of data. RESULTS: We present an LZ77-style compression scheme for relative compression of multiple genomes of the same species. While the solution bears similarity to known algorithms, it offers significantly higher compression ratios at compression speed over an order of magnitude greater. In particular, 69 differentially encoded human genomes are compressed over 400 times at fast compression, or even 1000 times at slower compression (the reference genome itself needs much more space). Adding fast random access to text snippets decreases the ratio to ~300. AVAILABILITY: GDC is available at http://sun.aei.polsl.pl/gdc. CONTACT: sebastian.deorowicz@polsl.pl. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Genômica/métodos , Análise de Sequência de DNA , Algoritmos , Bases de Dados de Ácidos Nucleicos , Genoma Humano , Humanos
19.
Bioinformatics ; 27(6): 860-2, 2011 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-21252073

RESUMO

MOTIVATION: Modern sequencing instruments are able to generate at least hundreds of millions short reads of genomic data. Those huge volumes of data require effective means to store them, provide quick access to any record and enable fast decompression. RESULTS: We present a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project (www.1000genomes.org). AVAILABILITY: DSRC is freely available at http:/sun.aei.polsl.pl/dsrc.


Assuntos
Compressão de Dados/métodos , Análise de Sequência de DNA , Algoritmos , Sequência de Bases , Biologia Computacional/métodos , Genômica , Internet
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...