Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 11 de 11
Filtrar
1.
Nat Methods ; 19(4): 441-444, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35347321

RESUMO

The cost of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today's genomic research. In spite of the increasing popularity of third-generation sequencing, the existing algorithms for compressing long reads exhibit a minor advantage over the general-purpose gzip. We present CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses.


Assuntos
Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Genoma , Análise de Sequência de DNA , Software
2.
Bioinformatics ; 38(5): 1447-1449, 2022 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-34904625

RESUMO

SUMMARY: Phage-Host Interaction Search Tool (PHIST) predicts prokaryotic hosts of viruses based on exact matches between viral and host genomes. It improves host prediction accuracy at species level over current alignment-based tools (on average by 3 percentage points) as well as alignment-free and CRISPR-based tools (by 14-20 percentage points). PHIST is also two orders of magnitude faster than alignment-based tools making it suitable for metagenomics studies. AVAILABILITY AND IMPLEMENTATION: GNU-licensed C++ code wrapped in Python API available at: https://github.com/refresh-bio/phist. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Bacteriófagos , Vírus , Bacteriófagos/genética , Metagenômica , Vírus/genética , Metagenoma , Software
3.
Bioinformatics ; 35(1): 133-136, 2019 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-29986074

RESUMO

Summary: Kmer-db is a new tool for estimating evolutionary relationship on the basis of k-mers extracted from genomes or sequencing reads. Thanks to an efficient data structure and parallel implementation, our software estimates distances between 40 715 pathogens in <7 min (on a modern workstation), 26 times faster than Mash, its main competitor. Availability and implementation: https://github.com/refresh-bio/kmer-db and http://sun.aei.polsl.pl/REFRESH/kmer-db. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Evolução Biológica , Biologia Computacional , Software , Genoma
4.
Bioinformatics ; 35(12): 2043-2050, 2019 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-30407485

RESUMO

MOTIVATION: Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. The reduction of sequencing costs implies a need for algorithms able to process increasing amounts of generated data in reasonable time. RESULTS: We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known BWA-MEM and Bowtie2 tools at a comparable accuracy, validated in a variant calling pipeline. AVAILABILITY AND IMPLEMENTATION: Whisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Sequência de Bases , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA
5.
BMC Bioinformatics ; 18(1): 285, 2017 May 30.
Artigo em Inglês | MEDLINE | ID: mdl-28558674

RESUMO

BACKGROUND: Survival analysis is an important element of reasoning from data. Applied in a number of fields, it has become particularly useful in medicine to estimate the survival rate of patients on the basis of their condition, examination results, and undergoing treatment. The recent developments in the next generation sequencing open new opportunities in survival study as they allow vast amount of genome-, transcriptome-, and proteome-related features to be investigated. These include single nucleotide and structural variants, expressions of genes and microRNAs, DNA methylation, and many others. RESULTS: We present LR-Rules, a new algorithm for rule induction from survival data. It works according to the separate-and-conquer heuristics with a use of log-rank test for establishing rule body. Extensive experiments show LR-Rules to generate models of superior accuracy and comprehensibility. The detailed analysis of rules rendered by the presented algorithm on four medical datasets concerning leukemia as well as breast, lung, and thyroid cancers, reveals the ability to discover true relations between attributes and patients' survival rate. Two of the case studies incorporate features obtained with a use of high throughput technologies showing the usability of the algorithm in the analysis of bioinformatics data. CONCLUSIONS: LR-Rules is a viable alternative to existing approaches to survival analysis, particularly when the interpretability of a resulting model is crucial. Presented algorithm may be especially useful when applied on the genomic and proteomic data as it may contribute to the better understanding of the background of diseases and support their treatments.


Assuntos
Algoritmos , Neoplasias da Mama/metabolismo , Neoplasias da Mama/mortalidade , Neoplasias da Mama/patologia , Variações do Número de Cópias de DNA , Metilação de DNA , Feminino , Humanos , Estimativa de Kaplan-Meier , MicroRNAs/metabolismo , Polimorfismo de Nucleotídeo Único , Transcriptoma
6.
BMC Bioinformatics ; 14: 83, 2013 Mar 05.
Artigo em Inglês | MEDLINE | ID: mdl-23497112

RESUMO

BACKGROUND: Machine learning techniques are known to be a powerful way of distinguishing microRNA hairpins from pseudo hairpins and have been applied in a number of recognised miRNA search tools. However, many current methods based on machine learning suffer from some drawbacks, including not addressing the class imbalance problem properly. It may lead to overlearning the majority class and/or incorrect assessment of classification performance. Moreover, those tools are effective for a narrow range of species, usually the model ones. This study aims at improving performance of miRNA classification procedure, extending its usability and reducing computational time. RESULTS: We present HuntMi, a stand-alone machine learning miRNA classification tool. We developed a novel method of dealing with the class imbalance problem called ROC-select, which is based on thresholding score function produced by traditional classifiers. We also introduced new features to the data representation. Several classification algorithms in combination with ROC-select were tested and random forest was selected for the best balance between sensitivity and specificity. Reliable assessment of classification performance is guaranteed by using large, strongly imbalanced, and taxon-specific datasets in 10-fold cross-validation procedure. As a result, HuntMi achieves a considerably better performance than any other miRNA classification tool and can be applied in miRNA search experiments in a wide range of species. CONCLUSIONS: Our results indicate that HuntMi represents an effective and flexible tool for identification of new microRNAs in animals, plants and viruses. ROC-select strategy proves to be superior to other methods of dealing with class imbalance problem and can possibly be used in other machine learning classification tasks. The HuntMi software as well as datasets used in the research are freely available at http://lemur.amu.edu.pl/share/HuntMi/.


Assuntos
Inteligência Artificial , MicroRNAs/classificação , Precursores de RNA/classificação , Algoritmos , Software
7.
Plant Cell Physiol ; 54(2): e10, 2013 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-23299413

RESUMO

Splicing is one of the major contributors to observed spatiotemporal diversification of transcripts and proteins in metazoans. There are numerous factors that affect the process, but splice sites themselves along with the adjacent splicing signals are critical here. Unfortunately, there is still little known about splicing in plants and, consequently, further research in some fields of plant molecular biology will encounter difficulties. Keeping this in mind, we performed a large-scale analysis of splice sites in eight plant species, using novel algorithms and tools developed by us. The analyses included identification of orthologous splice sites, polypyrimidine tracts and branch sites. Additionally we identified putative intronic and exonic cis-regulatory motifs, U12 introns as well as splice sites in 45 microRNA genes in five plant species. We also provide experimental evidence for plant splice sites in the form of expressed sequence tag and RNA-Seq data. All the data are stored in a novel database called ERISdb and are freely available at http://lemur.amu.edu.pl/share/ERISdb/.


Assuntos
Bases de Dados Genéticas , Genes de Plantas , Sítios de Splice de RNA , RNA de Plantas/genética , Software , Algoritmos , Etiquetas de Sequências Expressas , Internet , Íntrons , MicroRNAs/genética , Plantas/genética , Splicing de RNA , Sequências Reguladoras de Ácido Ribonucleico , Ferramenta de Busca , Análise de Sequência de RNA , Transdução de Sinais
8.
Curr Opin Struct Biol ; 80: 102577, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-37012200

RESUMO

Large-scale genomics requires highly scalable and accurate multiple sequence alignment methods. Results collected over this last decade suggest accuracy loss when scaling up over a few thousand sequences. This issue has been actively addressed with a number of innovative algorithmic solutions that combine low-level hardware optimization with novel higher-level heuristics. This review provides an extensive critical overview of these recent methods. Using established reference datasets we conclude that albeit significant progress has been achieved, a unified framework able to consistently and efficiently produce high-accuracy large-scale multiple alignments is still lacking.


Assuntos
Algoritmos , Genômica , Genômica/métodos , Sequência de Aminoácidos , Alinhamento de Sequência , Software
9.
Sci Rep ; 7: 41553, 2017 01 31.
Artigo em Inglês | MEDLINE | ID: mdl-28139687

RESUMO

The ever-increasing size of sequence databases caused by the development of high throughput sequencing, poses to multiple alignment algorithms one of the greatest challenges yet. As we show, well-established techniques employed for increasing alignment quality, i.e., refinement and consistency, are ineffective when large protein families are investigated. We present QuickProbs 2, an algorithm for multiple sequence alignment. Based on probabilistic models, equipped with novel column-oriented refinement and selective consistency, it offers outstanding accuracy. When analysing hundreds of sequences, Quick-Probs 2 is noticeably better than ClustalΩ and MAFFT, the previous leaders for processing numerous protein families. In the case of smaller sets, for which consistency-based methods are the best performing, QuickProbs 2 is also superior to the competitors. Due to low computational requirements of selective consistency and utilization of massively parallel architectures, presented algorithm has similar execution times to ClustalΩ, and is orders of magnitude faster than full consistency approaches, like MSAProbs or PicXAA. All these make QuickProbs 2 an excellent tool for aligning families ranging from few, to hundreds of proteins.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Proteínas/genética , Análise de Sequência de Proteína , Software , Algoritmos , Sequência de Aminoácidos , Reprodutibilidade dos Testes
10.
Sci Rep ; 6: 33964, 2016 Sep 27.
Artigo em Inglês | MEDLINE | ID: mdl-27670777

RESUMO

Rapid development of modern sequencing platforms has contributed to the unprecedented growth of protein families databases. The abundance of sets containing hundreds of thousands of sequences is a formidable challenge for multiple sequence alignment algorithms. The article introduces FAMSA, a new progressive algorithm designed for fast and accurate alignment of thousands of protein sequences. Its features include the utilization of the longest common subsequence measure for determining pairwise similarities, a novel method of evaluating gap costs, and a new iterative refinement scheme. What matters is that its implementation is highly optimized and parallelized to make the most of modern computer platforms. Thanks to the above, quality indicators, i.e. sum-of-pairs and total-column scores, show FAMSA to be superior to competing algorithms, such as Clustal Omega or MAFFT for datasets exceeding a few thousand sequences. Quality does not compromise on time or memory requirements, which are an order of magnitude lower than those in the existing solutions. For example, a family of 415519 sequences was analyzed in less than two hours and required no more than 8 GB of RAM. FAMSA is available for free at http://sun.aei.polsl.pl/REFRESH/famsa.

11.
PLoS One ; 9(2): e88901, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24586435

RESUMO

Multiple sequence alignment is a crucial task in a number of biological analyses like secondary structure prediction, domain searching, phylogeny, etc. MSAProbs is currently the most accurate alignment algorithm, but its effectiveness is obtained at the expense of computational time. In the paper we present QuickProbs, the variant of MSAProbs customised for graphics processors. We selected the two most time consuming stages of MSAProbs to be redesigned for GPU execution: the posterior matrices calculation and the consistency transformation. Experiments on three popular benchmarks (BAliBASE, PREFAB, OXBench-X) on quad-core PC equipped with high-end graphics card show QuickProbs to be 5.7 to 9.7 times faster than original CPU-parallel MSAProbs. Additional tests performed on several protein families from Pfam database give overall speed-up of 6.7. Compared to other algorithms like MAFFT, MUSCLE, or ClustalW, QuickProbs proved to be much more accurate at similar speed. Additionally we introduce a tuned variant of QuickProbs which is significantly more accurate on sets of distantly related sequences than MSAProbs without exceeding its computation time. The GPU part of QuickProbs was implemented in OpenCL, thus the package is suitable for graphics processors produced by all major vendors.


Assuntos
Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de Proteína/métodos , Algoritmos , Computadores , Proteínas/química , Proteínas/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA