Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 25
Filter
1.
Nat Commun ; 15(1): 5580, 2024 Jul 03.
Article in English | MEDLINE | ID: mdl-38961062

ABSTRACT

DNA methylation plays an important role in various biological processes, including cell differentiation, ageing, and cancer development. The most important methylation in mammals is 5-methylcytosine mostly occurring in the context of CpG dinucleotides. Sequencing methods such as whole-genome bisulfite sequencing successfully detect 5-methylcytosine DNA modifications. However, they suffer from the serious drawbacks of short read lengths and might introduce an amplification bias. Here we present Rockfish, a deep learning algorithm that significantly improves read-level 5-methylcytosine detection by using Nanopore sequencing. Rockfish is compared with other methods based on Nanopore sequencing on R9.4.1 and R10.4.1 datasets. There is an increase in the single-base accuracy and the F1 measure of up to 5 percentage points on R.9.4.1 datasets, and up to 0.82 percentage points on R10.4.1 datasets. Moreover, Rockfish shows a high correlation with whole-genome bisulfite sequencing, requires lower read depth, and achieves higher confidence in biologically important regions such as CpG-rich promoters while being computationally efficient. Its superior performance in human and mouse samples highlights its versatility for studying 5-methylcytosine methylation across varied organisms and diseases. Finally, its adaptable architecture ensures compatibility with new versions of pores and chemistry as well as modification types.


Subject(s)
5-Methylcytosine , CpG Islands , DNA Methylation , Nanopore Sequencing , 5-Methylcytosine/metabolism , 5-Methylcytosine/chemistry , Nanopore Sequencing/methods , Animals , Mice , Humans , CpG Islands/genetics , Deep Learning , Algorithms , Sequence Analysis, DNA/methods , Whole Genome Sequencing/methods , Sulfites/chemistry
3.
BMC Bioinformatics ; 25(1): 15, 2024 Jan 11.
Article in English | MEDLINE | ID: mdl-38212694

ABSTRACT

BACKGROUND: Long reads have gained popularity in the analysis of metagenomics data. Therefore, we comprehensively assessed metagenomics classification tools on the species taxonomic level. We analysed kmer-based tools, mapping-based tools and two general-purpose long reads mappers. We evaluated more than 20 pipelines which use either nucleotide or protein databases and selected 13 for an extensive benchmark. We prepared seven synthetic datasets to test various scenarios, including the presence of a host, unknown species and related species. Moreover, we used available sequencing data from three well-defined mock communities, including a dataset with abundance varying from 0.0001 to 20% and six real gut microbiomes. RESULTS: General-purpose mappers Minimap2 and Ram achieved similar or better accuracy on most testing metrics than best-performing classification tools. They were up to ten times slower than the fastest kmer-based tools requiring up to four times less RAM. All tested tools were prone to report organisms not present in datasets, except CLARK-S, and they underperformed in the case of the high presence of the host's genetic material. Tools which use a protein database performed worse than those based on a nucleotide database. Longer read lengths made classification easier, but due to the difference in read length distributions among species, the usage of only the longest reads reduced the accuracy. The comparison of real gut microbiome datasets shows a similar abundance profiles for the same type of tools but discordance in the number of reported organisms and abundances between types. Most assessments showed the influence of database completeness on the reports. CONCLUSION: The findings indicate that kmer-based tools are well-suited for rapid analysis of long reads data. However, when heightened accuracy is essential, mappers demonstrate slightly superior performance, albeit at a considerably slower pace. Nevertheless, a combination of diverse categories of tools and databases will likely be necessary to analyse complex samples. Discrepancies observed among tools when applied to real gut datasets, as well as a reduced performance in cases where unknown species or a significant proportion of the host genome is present in the sample, highlight the need for continuous improvement of existing tools. Additionally, regular updates and curation of databases are important to ensure their effectiveness.


Subject(s)
High-Throughput Nucleotide Sequencing , Metagenome , Sequence Analysis, DNA , Metagenomics , Databases, Protein , Nucleotides
4.
Nat Methods ; 20(4): 491-492, 2023 04.
Article in English | MEDLINE | ID: mdl-36959321
5.
Commun Biol ; 5(1): 967, 2022 09 15.
Article in English | MEDLINE | ID: mdl-36109650

ABSTRACT

Singapore's National Flower, Papilionanthe (Ple.) Miss Joaquim 'Agnes' (PMJ) is highly prized as a horticultural flower from the Orchidaceae family. A combination of short-read sequencing, single-molecule long-read sequencing and chromatin contact mapping was used to assemble the PMJ genome, spanning 2.5 Gb and 19 pseudo-chromosomal scaffolds. Genomic resources and chemical profiling provided insights towards identifying, understanding and elucidating various classes of secondary metabolite compounds synthesized by the flower. For example, presence of the anthocyanin pigments detected by chemical profiling coincides with the expression of ANTHOCYANIN SYNTHASE (ANS), an enzyme responsible for the synthesis of the former. Similarly, the presence of vandaterosides (a unique class of glycosylated organic acids with the potential to slow skin aging) discovered using chemical profiling revealed the involvement of glycosyltransferase family enzymes candidates in vandateroside biosynthesis. Interestingly, despite the unnoticeable scent of the flower, genes involved in the biosynthesis of volatile compounds and chemical profiling revealed the combination of oxygenated hydrocarbons, including traces of linalool, beta-ionone and vanillin, forming the scent profile of PMJ. In summary, by combining genomics and biochemistry, the findings expands the known biodiversity repertoire of the Orchidaceae family and insights into the genome and secondary metabolite processes of PMJ.


Subject(s)
Anthocyanins , Orchidaceae , Chromatin/metabolism , Flowers/genetics , Flowers/metabolism , Gene Expression Regulation, Plant , Glycosyltransferases/genetics , Metabolic Networks and Pathways , Orchidaceae/genetics , Singapore
6.
Nat Methods ; 19(7): 833-844, 2022 07.
Article in English | MEDLINE | ID: mdl-35697834

ABSTRACT

Inosine is a prevalent RNA modification in animals and is formed when an adenosine is deaminated by the ADAR family of enzymes. Traditionally, inosines are identified indirectly as variants from Illumina RNA-sequencing data because they are interpreted as guanosines by cellular machineries. However, this indirect method performs poorly in protein-coding regions where exons are typically short, in non-model organisms with sparsely annotated single-nucleotide polymorphisms, or in disease contexts where unknown DNA mutations are pervasive. Here, we show that Oxford Nanopore direct RNA sequencing can be used to identify inosine-containing sites in native transcriptomes with high accuracy. We trained convolutional neural network models to distinguish inosine from adenosine and guanosine, and to estimate the modification rate at each editing site. Furthermore, we demonstrated their utility on the transcriptomes of human, mouse and Xenopus. Our approach expands the toolkit for studying adenosine-to-inosine editing and can be further extended to investigate other RNA modifications.


Subject(s)
Nanopores , RNA , Adenosine/genetics , Animals , Inosine/genetics , Mice , RNA/genetics , RNA/metabolism , RNA Editing , Sequence Analysis, RNA
7.
Nat Comput Sci ; 1(5): 332-336, 2021 May.
Article in English | MEDLINE | ID: mdl-38217213

ABSTRACT

Whole genome sequencing technologies are unable to invariably read DNA molecules intact, a shortcoming that assemblers try to resolve by stitching the obtained fragments back together. Here, we present methods for the improvement of de novo genome assembly from erroneous long reads incorporated into a tool called Raven. Raven maintains similar performance for various genomes and has accuracy on par with other assemblers that support third-generation sequencing data. It is one of the fastest options while having the lowest memory consumption on the majority of benchmarked datasets.

8.
Nat Biotechnol ; 37(8): 937-944, 2019 08.
Article in English | MEDLINE | ID: mdl-31359005

ABSTRACT

Characterization of microbiomes has been enabled by high-throughput metagenomic sequencing. However, existing methods are not designed to combine reads from short- and long-read technologies. We present a hybrid metagenomic assembler named OPERA-MS that integrates assembly-based metagenome clustering with repeat-aware, exact scaffolding to accurately assemble complex communities. Evaluation using defined in vitro and virtual gut microbiomes revealed that OPERA-MS assembles metagenomes with greater base pair accuracy than long-read (>5×; Canu), higher contiguity than short-read (~10× NGA50; MEGAHIT, IDBA-UD, metaSPAdes) and fewer assembly errors than non-metagenomic hybrid assemblers (2×; hybridSPAdes). OPERA-MS provides strain-resolved assembly in the presence of multiple genomes of the same species, high-quality reference genomes for rare species (<1%) with ~9× long-read coverage and near-complete genomes with higher coverage. We used OPERA-MS to assemble 28 gut metagenomes of antibiotic-treated patients, and showed that the inclusion of long nanopore reads produces more contiguous assemblies (200× improvement over short-read assemblies), including more than 80 closed plasmid or phage sequences and a new 263 kbp jumbo phage. High-quality hybrid assemblies enable an exquisitely detailed view of the gut resistome in human patients.


Subject(s)
Bacteria/drug effects , Bacteria/genetics , Metagenomics/methods , Microbiota/drug effects , Sequence Analysis, DNA/methods , Anti-Bacterial Agents/pharmacology , Drug Resistance, Bacterial , Feces/microbiology , High-Throughput Nucleotide Sequencing/methods , Humans , Metagenome , Nanopores , Software
9.
Bioinformatics ; 34(5): 748-754, 2018 03 01.
Article in English | MEDLINE | ID: mdl-29069314

ABSTRACT

Motivation: High-throughput sequencing has transformed the study of gene expression levels through RNA-seq, a technique that is now routinely used by various fields, such as genetic research or diagnostics. The advent of third generation sequencing technologies providing significantly longer reads opens up new possibilities. However, the high error rates common to these technologies set new bioinformatics challenges for the gapped alignment of reads to their genomic origin. In this study, we have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates. All tested tools were initially developed for short NGS reads, but some have claimed support for long Pacific Biosciences (PacBio) or even Oxford Nanopore Technologies (ONT) MinION reads. Results: The tools were tested on synthetic and real datasets from two technologies (PacBio and ONT MinION). Alignment quality and resource usage were compared across different aligners. The effect of error correction of long reads was explored, both using self-correction and correction with an external short reads dataset. A tool was developed for evaluating RNA-seq alignment results. This tool can be used to compare the alignment of simulated reads to their genomic origin, or to compare the alignment of real reads to a set of annotated transcripts. Our tests show that while some RNA-seq aligners were unable to cope with long error-prone reads, others produced overall good results. We further show that alignment accuracy can be improved using error-corrected reads. Availability and implementation: https://github.com/kkrizanovic/RNAseqEval, https://figshare.com/projects/RNAseq_benchmark/24391. Contact: mile.sikic@fer.hr. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Gene Expression Profiling/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , Animals , Drosophila melanogaster/genetics , Humans , Saccharomyces cerevisiae/genetics
10.
Bioinformatics ; 33(9): 1394-1395, 2017 05 01.
Article in English | MEDLINE | ID: mdl-28453688

ABSTRACT

Summary: We present Edlib, an open-source C/C ++ library for exact pairwise sequence alignment using edit distance. We compare Edlib to other libraries and show that it is the fastest while not lacking in functionality and can also easily handle very large sequences. Being easy to use, flexible, fast and low on memory usage, we expect it to be easily adopted as a building block for future bioinformatics tools. Availability and Implementation: Source code, installation instructions and test data are freely available for download at https://github.com/Martinsos/edlib, under the MIT licence. Edlib is implemented in C/C ++ and supported on Linux, MS Windows, and Mac OS. Contact: mile.sikic@fer.hr. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Sequence Analysis, DNA/methods , Software , Algorithms
11.
Genome Res ; 27(5): 737-746, 2017 05.
Article in English | MEDLINE | ID: mdl-28100585

ABSTRACT

The assembly of long reads from Pacific Biosciences and Oxford Nanopore Technologies typically requires resource-intensive error-correction and consensus-generation steps to obtain high-quality assemblies. We show that the error-correction step can be omitted and that high-quality consensus sequences can be generated efficiently with a SIMD-accelerated, partial-order alignment-based, stand-alone consensus module called Racon. Based on tests with PacBio and Oxford Nanopore data sets, we show that Racon coupled with miniasm enables consensus genomes with similar or better quality than state-of-the-art methods while being an order of magnitude faster.


Subject(s)
Algorithms , Contig Mapping/methods , Genomics/methods , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Contig Mapping/standards , Genomics/standards , Sequence Alignment/standards , Sequence Analysis, DNA/standards
12.
Bioinformatics ; 32(17): i680-i684, 2016 09 01.
Article in English | MEDLINE | ID: mdl-27587689

ABSTRACT

MOTIVATION: Protein database search is one of the fundamental problems in bioinformatics. For decades, it has been explored and solved using different exact and heuristic approaches. However, exponential growth of data in recent years has brought significant challenges in improving already existing algorithms. BLAST has been the most successful tool for protein database search, but is also becoming a bottleneck in many applications. Due to that, many different approaches have been developed to complement or replace it. In this article, we present SWORD, an efficient protein database search implementation that runs 8-16 times faster than BLAST in the sensitive mode and up to 68 times faster in the fast and less accurate mode. It is designed to be used in nearly all database search environments, but is especially suitable for large databases. Its sensitivity exceeds that of BLAST for majority of input datasets and provides guaranteed optimal alignments. AVAILABILITY AND IMPLEMENTATION: Sword is freely available for download from https://github.com/rvaser/sword CONTACT: robert.vaser@fer.hr and mile.sikic@fer.hr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Databases, Protein , Search Engine , Sequence Alignment , Algorithms , Software
13.
Bioinformatics ; 32(17): 2582-9, 2016 09 01.
Article in English | MEDLINE | ID: mdl-27162186

ABSTRACT

MOTIVATION: Recent emergence of nanopore sequencing technology set a challenge for established assembly methods. In this work, we assessed how existing hybrid and non-hybrid de novo assembly methods perform on long and error prone nanopore reads. RESULTS: We benchmarked five non-hybrid (in terms of both error correction and scaffolding) assembly pipelines as well as two hybrid assemblers which use third generation sequencing data to scaffold Illumina assemblies. Tests were performed on several publicly available MinION and Illumina datasets of Escherichia coli K-12, using several sequencing coverages of nanopore data (20×, 30×, 40× and 50×). We attempted to assess the assembly quality at each of these coverages, in order to estimate the requirements for closed bacterial genome assembly. For the purpose of the benchmark, an extensible genome assembly benchmarking framework was developed. Results show that hybrid methods are highly dependent on the quality of NGS data, but much less on the quality and coverage of nanopore data and perform relatively well on lower nanopore coverages. All non-hybrid methods correctly assemble the E. coli genome when coverage is above 40×, even the non-hybrid method tailored for Pacific Biosciences reads. While it requires higher coverage compared to a method designed particularly for nanopore reads, its running time is significantly lower. AVAILABILITY AND IMPLEMENTATION: https://github.com/kkrizanovic/NanoMark CONTACT: mile.sikic@fer.hr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Nanopores , Sequence Analysis, DNA , Escherichia coli , Escherichia coli K12 , Genome, Bacterial , High-Throughput Nucleotide Sequencing
14.
Nat Commun ; 7: 11307, 2016 Apr 15.
Article in English | MEDLINE | ID: mdl-27079541

ABSTRACT

Realizing the democratic promise of nanopore sequencing requires the development of new bioinformatics approaches to deal with its specific error characteristics. Here we present GraphMap, a mapping algorithm designed to analyse nanopore sequencing reads, which progressively refines candidate alignments to robustly handle potentially high-error rates and a fast graph traversal to align long reads with speed and high precision (>95%). Evaluation on MinION sequencing data sets against short- and long-read mappers indicates that GraphMap increases mapping sensitivity by 10-80% and maps >95% of bases. GraphMap alignments enabled single-nucleotide variant calling on the human genome with increased sensitivity (15%) over the next best mapper, precise detection of structural variants from length 100 bp to 4 kbp, and species and strain-specific identification of pathogens using MinION reads. GraphMap is available open source under the MIT license at https://github.com/isovic/graphmap.


Subject(s)
Algorithms , Computational Biology/methods , Genome, Human/genetics , High-Throughput Nucleotide Sequencing/methods , Genomics/methods , Humans , Nanopores , Polymorphism, Single Nucleotide , Reproducibility of Results , Sequence Alignment/methods
15.
Nat Protoc ; 11(1): 1-9, 2016 Jan.
Article in English | MEDLINE | ID: mdl-26633127

ABSTRACT

The SIFT (sorting intolerant from tolerant) algorithm helps bridge the gap between mutations and phenotypic variations by predicting whether an amino acid substitution is deleterious. SIFT has been used in disease, mutation and genetic studies, and a protocol for its use has been previously published with Nature Protocols. This updated protocol describes SIFT 4G (SIFT for genomes), which is a faster version of SIFT that enables practical computations on reference genomes. Users can get predictions for single-nucleotide variants from their organism of interest using the SIFT 4G annotator with SIFT 4G's precomputed databases. The scope of genomic predictions is expanded, with predictions available for more than 200 organisms. Users can also run the SIFT 4G algorithm themselves. SIFT predictions can be retrieved for 6.7 million variants in 4 min once the database has been downloaded. If precomputed predictions are not available, the SIFT 4G algorithm can compute predictions at a rate of 2.6 s per protein sequence. SIFT 4G is available from http://sift-dna.org/sift4g.


Subject(s)
Algorithms , Genomics/methods , Mutation, Missense/genetics , Databases, Protein , Genomics/standards , Humans , Molecular Sequence Annotation , Phenotype , Reference Standards
16.
Phys Rev Lett ; 114(24): 248701, 2015 Jun 19.
Article in English | MEDLINE | ID: mdl-26197016

ABSTRACT

Detection of patient zero can give new insights to epidemiologists about the nature of first transmissions into a population. In this Letter, we study the statistical inference problem of detecting the source of epidemics from a snapshot of spreading on an arbitrary network structure. By using exact analytic calculations and Monte Carlo estimators, we demonstrate the detectability limits for the susceptible-infected-recovered model, which primarily depend on the spreading process characteristics. Finally, we demonstrate the applicability of the approach in a case of a simulated sexually transmitted infection spreading over an empirical temporal network of sexual interactions.


Subject(s)
Contact Tracing/methods , Models, Statistical , Sexually Transmitted Diseases/epidemiology , Computer Simulation , Epidemiologic Methods , Humans , Monte Carlo Method , Sexually Transmitted Diseases/transmission
17.
PLoS One ; 10(12): e0145857, 2015.
Article in English | MEDLINE | ID: mdl-26719890

ABSTRACT

In recent years we have witnessed a growth in sequencing yield, the number of samples sequenced, and as a result-the growth of publicly maintained sequence databases. The increase of data present all around has put high requirements on protein similarity search algorithms with two ever-opposite goals: how to keep the running times acceptable while maintaining a high-enough level of sensitivity. The most time consuming step of similarity search are the local alignments between query and database sequences. This step is usually performed using exact local alignment algorithms such as Smith-Waterman. Due to its quadratic time complexity, alignments of a query to the whole database are usually too slow. Therefore, the majority of the protein similarity search methods prior to doing the exact local alignment apply heuristics to reduce the number of possible candidate sequences in the database. However, there is still a need for the alignment of a query sequence to a reduced database. In this paper we present the SW#db tool and a library for fast exact similarity search. Although its running times, as a standalone tool, are comparable to the running times of BLAST, it is primarily intended to be used for exact local alignment phase in which the database of sequences has already been reduced. It uses both GPU and CPU parallelization and was 4-5 times faster than SSEARCH, 6-25 times faster than CUDASW++ and more than 20 times faster than SSW at the time of writing, using multiple queries on Swiss-prot and Uniref90 databases.


Subject(s)
Computational Biology/methods , Sequence Alignment/methods , Software , Algorithms , Databases, Nucleic Acid , Web Browser
18.
Nucleic Acids Res ; 42(Database issue): D879-81, 2014 Jan.
Article in English | MEDLINE | ID: mdl-24271393

ABSTRACT

ExoLocator (http://exolocator.eopsf.org) collects in a single place information needed for comparative analysis of protein-coding exons from vertebrate species. The main source of data--the genomic sequences, and the existing exon and homology annotation--is the ENSEMBL database of completed vertebrate genomes. To these, ExoLocator adds the search for ostensibly missing exons in orthologous protein pairs across species, using an extensive computational pipeline to narrow down the search region for the candidate exons and find a suitable template in the other species, as well as state-of-the-art implementations of pairwise alignment algorithms. The resulting complements of exons are organized in a way currently unique to ExoLocator: multiple sequence alignments, both on the nucleotide and on the peptide levels, clearly indicating the exon boundaries. The alignments can be inspected in the web-embedded viewer, downloaded or used on the spot to produce an estimate of conservation within orthologous sets, or functional divergence across paralogues.


Subject(s)
Databases, Protein , Exons , Proteins/genetics , Animals , Genome, Human , Humans , Internet , Vertebrates/genetics
19.
Bioinformatics ; 29(19): 2494-5, 2013 Oct 01.
Article in English | MEDLINE | ID: mdl-23864730

ABSTRACT

SUMMARY: We propose SW#, a new CUDA graphical processor unit-enabled and memory-efficient implementation of dynamic programming algorithm, for local alignment. It can be used as either a stand-alone application or a library. Although there are other graphical processor unit implementations of the Smith-Waterman algorithm, SW# is the only one publicly available that can produce sequence alignments on genome-wide scale. For long sequences, it is at least a few hundred times faster than a CPU version of the same algorithm. AVAILABILITY: Source code and installation instructions freely available for download at http://complex.zesoi.fer.hr/SW.html.


Subject(s)
Algorithms , Genome , Base Sequence , Internet , Sequence Alignment , Software
20.
Cell Rep ; 2(5): 1207-19, 2012 Nov 29.
Article in English | MEDLINE | ID: mdl-23103170

ABSTRACT

Chromatin interactions play important roles in transcription regulation. To better understand the underlying evolutionary and functional constraints of these interactions, we implemented a systems approach to examine RNA polymerase-II-associated chromatin interactions in human cells. We found that 40% of the total genomic elements involved in chromatin interactions converged to a giant, scale-free-like, hierarchical network organized into chromatin communities. The communities were enriched in specific functions and were syntenic through evolution. Disease-associated SNPs from genome-wide association studies were enriched among the nodes with fewer interactions, implying their selection against deleterious interactions by limiting the total number of interactions, a model that we further reconciled using somatic and germline cancer mutation data. The hubs lacked disease-associated SNPs, constituted a nonrandomly interconnected core of key cellular functions, and exhibited lethality in mouse mutants, supporting an evolutionary selection that favored the nonrandom spatial clustering of the least-evolving key genomic domains against random genetic or transcriptional errors in the genome. Altogether, our analyses reveal a systems-level evolutionary framework that shapes functionally compartmentalized and error-tolerant transcriptional regulation of human genome in three dimensions.


Subject(s)
Chromatin/metabolism , Animals , Biological Evolution , Gene Regulatory Networks , Genome , Genome, Human , Genome-Wide Association Study , Humans , K562 Cells , MCF-7 Cells , Mice , Polymorphism, Single Nucleotide , Promoter Regions, Genetic , RNA Polymerase II/metabolism , Transcription, Genetic
SELECTION OF CITATIONS
SEARCH DETAIL
...