Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
1.
Nat Commun ; 15(1): 5580, 2024 Jul 03.
Artigo em Inglês | MEDLINE | ID: mdl-38961062

RESUMO

DNA methylation plays an important role in various biological processes, including cell differentiation, ageing, and cancer development. The most important methylation in mammals is 5-methylcytosine mostly occurring in the context of CpG dinucleotides. Sequencing methods such as whole-genome bisulfite sequencing successfully detect 5-methylcytosine DNA modifications. However, they suffer from the serious drawbacks of short read lengths and might introduce an amplification bias. Here we present Rockfish, a deep learning algorithm that significantly improves read-level 5-methylcytosine detection by using Nanopore sequencing. Rockfish is compared with other methods based on Nanopore sequencing on R9.4.1 and R10.4.1 datasets. There is an increase in the single-base accuracy and the F1 measure of up to 5 percentage points on R.9.4.1 datasets, and up to 0.82 percentage points on R10.4.1 datasets. Moreover, Rockfish shows a high correlation with whole-genome bisulfite sequencing, requires lower read depth, and achieves higher confidence in biologically important regions such as CpG-rich promoters while being computationally efficient. Its superior performance in human and mouse samples highlights its versatility for studying 5-methylcytosine methylation across varied organisms and diseases. Finally, its adaptable architecture ensures compatibility with new versions of pores and chemistry as well as modification types.


Assuntos
5-Metilcitosina , Ilhas de CpG , Metilação de DNA , Sequenciamento por Nanoporos , 5-Metilcitosina/metabolismo , 5-Metilcitosina/química , Sequenciamento por Nanoporos/métodos , Animais , Camundongos , Humanos , Ilhas de CpG/genética , Aprendizado Profundo , Algoritmos , Análise de Sequência de DNA/métodos , Sequenciamento Completo do Genoma/métodos , Sulfitos/química
3.
BMC Bioinformatics ; 25(1): 15, 2024 Jan 11.
Artigo em Inglês | MEDLINE | ID: mdl-38212694

RESUMO

BACKGROUND: Long reads have gained popularity in the analysis of metagenomics data. Therefore, we comprehensively assessed metagenomics classification tools on the species taxonomic level. We analysed kmer-based tools, mapping-based tools and two general-purpose long reads mappers. We evaluated more than 20 pipelines which use either nucleotide or protein databases and selected 13 for an extensive benchmark. We prepared seven synthetic datasets to test various scenarios, including the presence of a host, unknown species and related species. Moreover, we used available sequencing data from three well-defined mock communities, including a dataset with abundance varying from 0.0001 to 20% and six real gut microbiomes. RESULTS: General-purpose mappers Minimap2 and Ram achieved similar or better accuracy on most testing metrics than best-performing classification tools. They were up to ten times slower than the fastest kmer-based tools requiring up to four times less RAM. All tested tools were prone to report organisms not present in datasets, except CLARK-S, and they underperformed in the case of the high presence of the host's genetic material. Tools which use a protein database performed worse than those based on a nucleotide database. Longer read lengths made classification easier, but due to the difference in read length distributions among species, the usage of only the longest reads reduced the accuracy. The comparison of real gut microbiome datasets shows a similar abundance profiles for the same type of tools but discordance in the number of reported organisms and abundances between types. Most assessments showed the influence of database completeness on the reports. CONCLUSION: The findings indicate that kmer-based tools are well-suited for rapid analysis of long reads data. However, when heightened accuracy is essential, mappers demonstrate slightly superior performance, albeit at a considerably slower pace. Nevertheless, a combination of diverse categories of tools and databases will likely be necessary to analyse complex samples. Discrepancies observed among tools when applied to real gut datasets, as well as a reduced performance in cases where unknown species or a significant proportion of the host genome is present in the sample, highlight the need for continuous improvement of existing tools. Additionally, regular updates and curation of databases are important to ensure their effectiveness.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Metagenoma , Análise de Sequência de DNA , Metagenômica , Bases de Dados de Proteínas , Nucleotídeos
4.
5.
Commun Biol ; 5(1): 967, 2022 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-36109650

RESUMO

Singapore's National Flower, Papilionanthe (Ple.) Miss Joaquim 'Agnes' (PMJ) is highly prized as a horticultural flower from the Orchidaceae family. A combination of short-read sequencing, single-molecule long-read sequencing and chromatin contact mapping was used to assemble the PMJ genome, spanning 2.5 Gb and 19 pseudo-chromosomal scaffolds. Genomic resources and chemical profiling provided insights towards identifying, understanding and elucidating various classes of secondary metabolite compounds synthesized by the flower. For example, presence of the anthocyanin pigments detected by chemical profiling coincides with the expression of ANTHOCYANIN SYNTHASE (ANS), an enzyme responsible for the synthesis of the former. Similarly, the presence of vandaterosides (a unique class of glycosylated organic acids with the potential to slow skin aging) discovered using chemical profiling revealed the involvement of glycosyltransferase family enzymes candidates in vandateroside biosynthesis. Interestingly, despite the unnoticeable scent of the flower, genes involved in the biosynthesis of volatile compounds and chemical profiling revealed the combination of oxygenated hydrocarbons, including traces of linalool, beta-ionone and vanillin, forming the scent profile of PMJ. In summary, by combining genomics and biochemistry, the findings expands the known biodiversity repertoire of the Orchidaceae family and insights into the genome and secondary metabolite processes of PMJ.


Assuntos
Antocianinas , Orchidaceae , Cromatina/metabolismo , Flores/genética , Flores/metabolismo , Regulação da Expressão Gênica de Plantas , Glicosiltransferases/genética , Redes e Vias Metabólicas , Orchidaceae/genética , Singapura
6.
Nat Methods ; 19(7): 833-844, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35697834

RESUMO

Inosine is a prevalent RNA modification in animals and is formed when an adenosine is deaminated by the ADAR family of enzymes. Traditionally, inosines are identified indirectly as variants from Illumina RNA-sequencing data because they are interpreted as guanosines by cellular machineries. However, this indirect method performs poorly in protein-coding regions where exons are typically short, in non-model organisms with sparsely annotated single-nucleotide polymorphisms, or in disease contexts where unknown DNA mutations are pervasive. Here, we show that Oxford Nanopore direct RNA sequencing can be used to identify inosine-containing sites in native transcriptomes with high accuracy. We trained convolutional neural network models to distinguish inosine from adenosine and guanosine, and to estimate the modification rate at each editing site. Furthermore, we demonstrated their utility on the transcriptomes of human, mouse and Xenopus. Our approach expands the toolkit for studying adenosine-to-inosine editing and can be further extended to investigate other RNA modifications.


Assuntos
Nanoporos , RNA , Adenosina/genética , Animais , Inosina/genética , Camundongos , RNA/genética , RNA/metabolismo , Edição de RNA , Análise de Sequência de RNA
7.
Nat Comput Sci ; 1(5): 332-336, 2021 May.
Artigo em Inglês | MEDLINE | ID: mdl-38217213

RESUMO

Whole genome sequencing technologies are unable to invariably read DNA molecules intact, a shortcoming that assemblers try to resolve by stitching the obtained fragments back together. Here, we present methods for the improvement of de novo genome assembly from erroneous long reads incorporated into a tool called Raven. Raven maintains similar performance for various genomes and has accuracy on par with other assemblers that support third-generation sequencing data. It is one of the fastest options while having the lowest memory consumption on the majority of benchmarked datasets.

8.
Nat Biotechnol ; 37(8): 937-944, 2019 08.
Artigo em Inglês | MEDLINE | ID: mdl-31359005

RESUMO

Characterization of microbiomes has been enabled by high-throughput metagenomic sequencing. However, existing methods are not designed to combine reads from short- and long-read technologies. We present a hybrid metagenomic assembler named OPERA-MS that integrates assembly-based metagenome clustering with repeat-aware, exact scaffolding to accurately assemble complex communities. Evaluation using defined in vitro and virtual gut microbiomes revealed that OPERA-MS assembles metagenomes with greater base pair accuracy than long-read (>5×; Canu), higher contiguity than short-read (~10× NGA50; MEGAHIT, IDBA-UD, metaSPAdes) and fewer assembly errors than non-metagenomic hybrid assemblers (2×; hybridSPAdes). OPERA-MS provides strain-resolved assembly in the presence of multiple genomes of the same species, high-quality reference genomes for rare species (<1%) with ~9× long-read coverage and near-complete genomes with higher coverage. We used OPERA-MS to assemble 28 gut metagenomes of antibiotic-treated patients, and showed that the inclusion of long nanopore reads produces more contiguous assemblies (200× improvement over short-read assemblies), including more than 80 closed plasmid or phage sequences and a new 263 kbp jumbo phage. High-quality hybrid assemblies enable an exquisitely detailed view of the gut resistome in human patients.


Assuntos
Bactérias/efeitos dos fármacos , Bactérias/genética , Metagenômica/métodos , Microbiota/efeitos dos fármacos , Análise de Sequência de DNA/métodos , Antibacterianos/farmacologia , Farmacorresistência Bacteriana , Fezes/microbiologia , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Metagenoma , Nanoporos , Software
9.
Bioinformatics ; 34(5): 748-754, 2018 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-29069314

RESUMO

Motivation: High-throughput sequencing has transformed the study of gene expression levels through RNA-seq, a technique that is now routinely used by various fields, such as genetic research or diagnostics. The advent of third generation sequencing technologies providing significantly longer reads opens up new possibilities. However, the high error rates common to these technologies set new bioinformatics challenges for the gapped alignment of reads to their genomic origin. In this study, we have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates. All tested tools were initially developed for short NGS reads, but some have claimed support for long Pacific Biosciences (PacBio) or even Oxford Nanopore Technologies (ONT) MinION reads. Results: The tools were tested on synthetic and real datasets from two technologies (PacBio and ONT MinION). Alignment quality and resource usage were compared across different aligners. The effect of error correction of long reads was explored, both using self-correction and correction with an external short reads dataset. A tool was developed for evaluating RNA-seq alignment results. This tool can be used to compare the alignment of simulated reads to their genomic origin, or to compare the alignment of real reads to a set of annotated transcripts. Our tests show that while some RNA-seq aligners were unable to cope with long error-prone reads, others produced overall good results. We further show that alignment accuracy can be improved using error-corrected reads. Availability and implementation: https://github.com/kkrizanovic/RNAseqEval, https://figshare.com/projects/RNAseq_benchmark/24391. Contact: mile.sikic@fer.hr. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Animais , Drosophila melanogaster/genética , Humanos , Saccharomyces cerevisiae/genética
10.
Bioinformatics ; 33(9): 1394-1395, 2017 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-28453688

RESUMO

Summary: We present Edlib, an open-source C/C ++ library for exact pairwise sequence alignment using edit distance. We compare Edlib to other libraries and show that it is the fastest while not lacking in functionality and can also easily handle very large sequences. Being easy to use, flexible, fast and low on memory usage, we expect it to be easily adopted as a building block for future bioinformatics tools. Availability and Implementation: Source code, installation instructions and test data are freely available for download at https://github.com/Martinsos/edlib, under the MIT licence. Edlib is implemented in C/C ++ and supported on Linux, MS Windows, and Mac OS. Contact: mile.sikic@fer.hr. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Análise de Sequência de DNA/métodos , Software , Algoritmos
11.
Genome Res ; 27(5): 737-746, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28100585

RESUMO

The assembly of long reads from Pacific Biosciences and Oxford Nanopore Technologies typically requires resource-intensive error-correction and consensus-generation steps to obtain high-quality assemblies. We show that the error-correction step can be omitted and that high-quality consensus sequences can be generated efficiently with a SIMD-accelerated, partial-order alignment-based, stand-alone consensus module called Racon. Based on tests with PacBio and Oxford Nanopore data sets, we show that Racon coupled with miniasm enables consensus genomes with similar or better quality than state-of-the-art methods while being an order of magnitude faster.


Assuntos
Algoritmos , Mapeamento de Sequências Contíguas/métodos , Genômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Mapeamento de Sequências Contíguas/normas , Genômica/normas , Alinhamento de Sequência/normas , Análise de Sequência de DNA/normas
12.
Bioinformatics ; 32(17): i680-i684, 2016 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27587689

RESUMO

MOTIVATION: Protein database search is one of the fundamental problems in bioinformatics. For decades, it has been explored and solved using different exact and heuristic approaches. However, exponential growth of data in recent years has brought significant challenges in improving already existing algorithms. BLAST has been the most successful tool for protein database search, but is also becoming a bottleneck in many applications. Due to that, many different approaches have been developed to complement or replace it. In this article, we present SWORD, an efficient protein database search implementation that runs 8-16 times faster than BLAST in the sensitive mode and up to 68 times faster in the fast and less accurate mode. It is designed to be used in nearly all database search environments, but is especially suitable for large databases. Its sensitivity exceeds that of BLAST for majority of input datasets and provides guaranteed optimal alignments. AVAILABILITY AND IMPLEMENTATION: Sword is freely available for download from https://github.com/rvaser/sword CONTACT: robert.vaser@fer.hr and mile.sikic@fer.hr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Bases de Dados de Proteínas , Ferramenta de Busca , Alinhamento de Sequência , Algoritmos , Software
13.
Bioinformatics ; 32(17): 2582-9, 2016 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27162186

RESUMO

MOTIVATION: Recent emergence of nanopore sequencing technology set a challenge for established assembly methods. In this work, we assessed how existing hybrid and non-hybrid de novo assembly methods perform on long and error prone nanopore reads. RESULTS: We benchmarked five non-hybrid (in terms of both error correction and scaffolding) assembly pipelines as well as two hybrid assemblers which use third generation sequencing data to scaffold Illumina assemblies. Tests were performed on several publicly available MinION and Illumina datasets of Escherichia coli K-12, using several sequencing coverages of nanopore data (20×, 30×, 40× and 50×). We attempted to assess the assembly quality at each of these coverages, in order to estimate the requirements for closed bacterial genome assembly. For the purpose of the benchmark, an extensible genome assembly benchmarking framework was developed. Results show that hybrid methods are highly dependent on the quality of NGS data, but much less on the quality and coverage of nanopore data and perform relatively well on lower nanopore coverages. All non-hybrid methods correctly assemble the E. coli genome when coverage is above 40×, even the non-hybrid method tailored for Pacific Biosciences reads. While it requires higher coverage compared to a method designed particularly for nanopore reads, its running time is significantly lower. AVAILABILITY AND IMPLEMENTATION: https://github.com/kkrizanovic/NanoMark CONTACT: mile.sikic@fer.hr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Nanoporos , Análise de Sequência de DNA , Escherichia coli , Escherichia coli K12 , Genoma Bacteriano , Sequenciamento de Nucleotídeos em Larga Escala
14.
Nat Commun ; 7: 11307, 2016 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-27079541

RESUMO

Realizing the democratic promise of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. Here we present GraphMap, a mapping algorithm designed to analyse nanopore sequencing reads, which progressively refines candidate alignments to robustly handle potentially high-error rates and a fast graph traversal to align long reads with speed and high precision (>95%). Evaluation on MinION sequencing data sets against short- and long-read mappers indicates that GraphMap increases mapping sensitivity by 10-80% and maps >95% of bases. GraphMap alignments enabled single-nucleotide variant calling on the human genome with increased sensitivity (15%) over the next best mapper, precise detection of structural variants from length 100 bp to 4 kbp, and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.


Assuntos
Algoritmos , Biologia Computacional/métodos , Genoma Humano/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genômica/métodos , Humanos , Nanoporos , Polimorfismo de Nucleotídeo Único , Reprodutibilidade dos Testes , Alinhamento de Sequência/métodos
15.
Nat Protoc ; 11(1): 1-9, 2016 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26633127

RESUMO

The SIFT (sorting intolerant from tolerant) algorithm helps bridge the gap between mutations and phenotypic variations by predicting whether an amino acid substitution is deleterious. SIFT has been used in disease, mutation and genetic studies, and a protocol for its use has been previously published with Nature Protocols. This updated protocol describes SIFT 4G (SIFT for genomes), which is a faster version of SIFT that enables practical computations on reference genomes. Users can get predictions for single-nucleotide variants from their organism of interest using the SIFT 4G annotator with SIFT 4G's precomputed databases. The scope of genomic predictions is expanded, with predictions available for more than 200 organisms. Users can also run the SIFT 4G algorithm themselves. SIFT predictions can be retrieved for 6.7 million variants in 4 min once the database has been downloaded. If precomputed predictions are not available, the SIFT 4G algorithm can compute predictions at a rate of 2.6 s per protein sequence. SIFT 4G is available from http://sift-dna.org/sift4g.


Assuntos
Algoritmos , Genômica/métodos , Mutação de Sentido Incorreto/genética , Bases de Dados de Proteínas , Genômica/normas , Humanos , Anotação de Sequência Molecular , Fenótipo , Padrões de Referência
16.
Phys Rev Lett ; 114(24): 248701, 2015 Jun 19.
Artigo em Inglês | MEDLINE | ID: mdl-26197016

RESUMO

Detection of patient zero can give new insights to epidemiologists about the nature of first transmissions into a population. In this Letter, we study the statistical inference problem of detecting the source of epidemics from a snapshot of spreading on an arbitrary network structure. By using exact analytic calculations and Monte Carlo estimators, we demonstrate the detectability limits for the susceptible-infected-recovered model, which primarily depend on the spreading process characteristics. Finally, we demonstrate the applicability of the approach in a case of a simulated sexually transmitted infection spreading over an empirical temporal network of sexual interactions.


Assuntos
Busca de Comunicante/métodos , Modelos Estatísticos , Infecções Sexualmente Transmissíveis/epidemiologia , Simulação por Computador , Métodos Epidemiológicos , Humanos , Método de Monte Carlo , Infecções Sexualmente Transmissíveis/transmissão
17.
PLoS One ; 10(12): e0145857, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26719890

RESUMO

In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result-the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4-5 times faster than SSEARCH, 6-25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases.


Assuntos
Biologia Computacional/métodos , Alinhamento de Sequência/métodos , Software , Algoritmos , Bases de Dados de Ácidos Nucleicos , Navegador
18.
Nucleic Acids Res ; 42(Database issue): D879-81, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24271393

RESUMO

ExoLocator (http://exolocator.eopsf.org) collects in a single place information needed for comparative analysis of protein-coding exons from vertebrate species. The main source of data--the genomic sequences, and the existing exon and homology annotation--is the ENSEMBL database of completed vertebrate genomes. To these, ExoLocator adds the search for ostensibly missing exons in orthologous protein pairs across species, using an extensive computational pipeline to narrow down the search region for the candidate exons and find a suitable template in the other species, as well as state-of-the-art implementations of pairwise alignment algorithms. The resulting complements of exons are organized in a way currently unique to ExoLocator: multiple sequence alignments, both on the nucleotide and on the peptide levels, clearly indicating the exon boundaries. The alignments can be inspected in the web-embedded viewer, downloaded or used on the spot to produce an estimate of conservation within orthologous sets, or functional divergence across paralogues.


Assuntos
Bases de Dados de Proteínas , Éxons , Proteínas/genética , Animais , Genoma Humano , Humanos , Internet , Vertebrados/genética
19.
Bioinformatics ; 29(19): 2494-5, 2013 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-23864730

RESUMO

SUMMARY: We propose SW#, a new CUDA graphical processor unit-enabled and memory-efficient implementation of dynamic programming algorithm, for local alignment. It can be used as either a stand-alone application or a library. Although there are other graphical processor unit implementations of the Smith-Waterman algorithm, SW# is the only one publicly available that can produce sequence alignments on genome-wide scale. For long sequences, it is at least a few hundred times faster than a CPU version of the same algorithm. AVAILABILITY: Source code and installation instructions freely available for download at http://complex.zesoi.fer.hr/SW.html.


Assuntos
Algoritmos , Genoma , Sequência de Bases , Internet , Alinhamento de Sequência , Software
20.
Cell Rep ; 2(5): 1207-19, 2012 Nov 29.
Artigo em Inglês | MEDLINE | ID: mdl-23103170

RESUMO

Chromatin interactions play important roles in transcription regulation. To better understand the underlying evolutionary and functional constraints of these interactions, we implemented a systems approach to examine RNA polymerase-II-associated chromatin interactions in human cells. We found that 40% of the total genomic elements involved in chromatin interactions converged to a giant, scale-free-like, hierarchical network organized into chromatin communities. The communities were enriched in specific functions and were syntenic through evolution. Disease-associated SNPs from genome-wide association studies were enriched among the nodes with fewer interactions, implying their selection against deleterious interactions by limiting the total number of interactions, a model that we further reconciled using somatic and germline cancer mutation data. The hubs lacked disease-associated SNPs, constituted a nonrandomly interconnected core of key cellular functions, and exhibited lethality in mouse mutants, supporting an evolutionary selection that favored the nonrandom spatial clustering of the least-evolving key genomic domains against random genetic or transcriptional errors in the genome. Altogether, our analyses reveal a systems-level evolutionary framework that shapes functionally compartmentalized and error-tolerant transcriptional regulation of human genome in three dimensions.


Assuntos
Cromatina/metabolismo , Animais , Evolução Biológica , Redes Reguladoras de Genes , Genoma , Genoma Humano , Estudo de Associação Genômica Ampla , Humanos , Células K562 , Células MCF-7 , Camundongos , Polimorfismo de Nucleotídeo Único , Regiões Promotoras Genéticas , RNA Polimerase II/metabolismo , Transcrição Gênica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA