Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 26
Filtrar
1.
Genome Res ; 2024 Oct 29.
Artigo em Inglês | MEDLINE | ID: mdl-39472021

RESUMO

The critical stage of every de novo genome assembler is identifying paths in assembly graphs that correspond to the reconstructed genomic sequences. The existing algorithmic methods struggle with this, primarily due to repetitive regions causing complex graph tangles, leading to fragmented assemblies. Here, we introduce GNNome, a framework for path identification based on geometric deep learning that enables training models on assembly graphs without relying on existing assembly strategies. By leveraging only the symmetries inherent to the problem, GNNome reconstructs assemblies from PacBio HiFi reads with contiguity and quality comparable to those of the state-of-the-art tools across several species. With every new genome assembled telomere-to-telomere, the amount of reliable training data at our disposal increases. Combining the straightforward generation of abundant simulated data for diverse genomic structures with the AI approach makes the proposed framework a plausible cornerstone for future work on reconstructing complex genomes with different ploidy and aneuploidy degrees. To facilitate such developments, we make the framework and the best-performing model publicly available, provided as a tool that can directly be used to assemble new haploid genomes.

2.
Nat Methods ; 19(7): 833-844, 2022 07.
Artigo em Inglês | MEDLINE | ID: mdl-35697834

RESUMO

Inosine is a prevalent RNA modification in animals and is formed when an adenosine is deaminated by the ADAR family of enzymes. Traditionally, inosines are identified indirectly as variants from Illumina RNA-sequencing data because they are interpreted as guanosines by cellular machineries. However, this indirect method performs poorly in protein-coding regions where exons are typically short, in non-model organisms with sparsely annotated single-nucleotide polymorphisms, or in disease contexts where unknown DNA mutations are pervasive. Here, we show that Oxford Nanopore direct RNA sequencing can be used to identify inosine-containing sites in native transcriptomes with high accuracy. We trained convolutional neural network models to distinguish inosine from adenosine and guanosine, and to estimate the modification rate at each editing site. Furthermore, we demonstrated their utility on the transcriptomes of human, mouse and Xenopus. Our approach expands the toolkit for studying adenosine-to-inosine editing and can be further extended to investigate other RNA modifications.


Assuntos
Nanoporos , RNA , Adenosina/genética , Animais , Inosina/genética , Camundongos , RNA/genética , RNA/metabolismo , Edição de RNA , Análise de Sequência de RNA
3.
BMC Bioinformatics ; 25(1): 15, 2024 Jan 11.
Artigo em Inglês | MEDLINE | ID: mdl-38212694

RESUMO

BACKGROUND: Long reads have gained popularity in the analysis of metagenomics data. Therefore, we comprehensively assessed metagenomics classification tools on the species taxonomic level. We analysed kmer-based tools, mapping-based tools and two general-purpose long reads mappers. We evaluated more than 20 pipelines which use either nucleotide or protein databases and selected 13 for an extensive benchmark. We prepared seven synthetic datasets to test various scenarios, including the presence of a host, unknown species and related species. Moreover, we used available sequencing data from three well-defined mock communities, including a dataset with abundance varying from 0.0001 to 20% and six real gut microbiomes. RESULTS: General-purpose mappers Minimap2 and Ram achieved similar or better accuracy on most testing metrics than best-performing classification tools. They were up to ten times slower than the fastest kmer-based tools requiring up to four times less RAM. All tested tools were prone to report organisms not present in datasets, except CLARK-S, and they underperformed in the case of the high presence of the host's genetic material. Tools which use a protein database performed worse than those based on a nucleotide database. Longer read lengths made classification easier, but due to the difference in read length distributions among species, the usage of only the longest reads reduced the accuracy. The comparison of real gut microbiome datasets shows a similar abundance profiles for the same type of tools but discordance in the number of reported organisms and abundances between types. Most assessments showed the influence of database completeness on the reports. CONCLUSION: The findings indicate that kmer-based tools are well-suited for rapid analysis of long reads data. However, when heightened accuracy is essential, mappers demonstrate slightly superior performance, albeit at a considerably slower pace. Nevertheless, a combination of diverse categories of tools and databases will likely be necessary to analyse complex samples. Discrepancies observed among tools when applied to real gut datasets, as well as a reduced performance in cases where unknown species or a significant proportion of the host genome is present in the sample, highlight the need for continuous improvement of existing tools. Additionally, regular updates and curation of databases are important to ensure their effectiveness.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Metagenoma , Análise de Sequência de DNA , Metagenômica , Bases de Dados de Proteínas , Nucleotídeos
4.
5.
Genome Res ; 27(5): 737-746, 2017 05.
Artigo em Inglês | MEDLINE | ID: mdl-28100585

RESUMO

The assembly of long reads from Pacific Biosciences and Oxford Nanopore Technologies typically requires resource-intensive error-correction and consensus-generation steps to obtain high-quality assemblies. We show that the error-correction step can be omitted and that high-quality consensus sequences can be generated efficiently with a SIMD-accelerated, partial-order alignment-based, stand-alone consensus module called Racon. Based on tests with PacBio and Oxford Nanopore data sets, we show that Racon coupled with miniasm enables consensus genomes with similar or better quality than state-of-the-art methods while being an order of magnitude faster.


Assuntos
Algoritmos , Mapeamento de Sequências Contíguas/métodos , Genômica/métodos , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Mapeamento de Sequências Contíguas/normas , Genômica/normas , Alinhamento de Sequência/normas , Análise de Sequência de DNA/normas
6.
Bioinformatics ; 34(5): 748-754, 2018 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-29069314

RESUMO

Motivation: High-throughput sequencing has transformed the study of gene expression levels through RNA-seq, a technique that is now routinely used by various fields, such as genetic research or diagnostics. The advent of third generation sequencing technologies providing significantly longer reads opens up new possibilities. However, the high error rates common to these technologies set new bioinformatics challenges for the gapped alignment of reads to their genomic origin. In this study, we have explored how currently available RNA-seq splice-aware alignment tools cope with increased read lengths and error rates. All tested tools were initially developed for short NGS reads, but some have claimed support for long Pacific Biosciences (PacBio) or even Oxford Nanopore Technologies (ONT) MinION reads. Results: The tools were tested on synthetic and real datasets from two technologies (PacBio and ONT MinION). Alignment quality and resource usage were compared across different aligners. The effect of error correction of long reads was explored, both using self-correction and correction with an external short reads dataset. A tool was developed for evaluating RNA-seq alignment results. This tool can be used to compare the alignment of simulated reads to their genomic origin, or to compare the alignment of real reads to a set of annotated transcripts. Our tests show that while some RNA-seq aligners were unable to cope with long error-prone reads, others produced overall good results. We further show that alignment accuracy can be improved using error-corrected reads. Availability and implementation: https://github.com/kkrizanovic/RNAseqEval, https://figshare.com/projects/RNAseq_benchmark/24391. Contact: mile.sikic@fer.hr. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Perfilação da Expressão Gênica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Animais , Drosophila melanogaster/genética , Humanos , Saccharomyces cerevisiae/genética
8.
Bioinformatics ; 33(9): 1394-1395, 2017 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-28453688

RESUMO

Summary: We present Edlib, an open-source C/C ++ library for exact pairwise sequence alignment using edit distance. We compare Edlib to other libraries and show that it is the fastest while not lacking in functionality and can also easily handle very large sequences. Being easy to use, flexible, fast and low on memory usage, we expect it to be easily adopted as a building block for future bioinformatics tools. Availability and Implementation: Source code, installation instructions and test data are freely available for download at https://github.com/Martinsos/edlib, under the MIT licence. Edlib is implemented in C/C ++ and supported on Linux, MS Windows, and Mac OS. Contact: mile.sikic@fer.hr. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Análise de Sequência de DNA/métodos , Software , Algoritmos
9.
Bioinformatics ; 32(17): i680-i684, 2016 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27587689

RESUMO

MOTIVATION: Protein database search is one of the fundamental problems in bioinformatics. For decades, it has been explored and solved using different exact and heuristic approaches. However, exponential growth of data in recent years has brought significant challenges in improving already existing algorithms. BLAST has been the most successful tool for protein database search, but is also becoming a bottleneck in many applications. Due to that, many different approaches have been developed to complement or replace it. In this article, we present SWORD, an efficient protein database search implementation that runs 8-16 times faster than BLAST in the sensitive mode and up to 68 times faster in the fast and less accurate mode. It is designed to be used in nearly all database search environments, but is especially suitable for large databases. Its sensitivity exceeds that of BLAST for majority of input datasets and provides guaranteed optimal alignments. AVAILABILITY AND IMPLEMENTATION: Sword is freely available for download from https://github.com/rvaser/sword CONTACT: robert.vaser@fer.hr and mile.sikic@fer.hr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Bases de Dados de Proteínas , Ferramenta de Busca , Alinhamento de Sequência , Algoritmos , Software
10.
Bioinformatics ; 32(17): 2582-9, 2016 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-27162186

RESUMO

MOTIVATION: Recent emergence of nanopore sequencing technology set a challenge for established assembly methods. In this work, we assessed how existing hybrid and non-hybrid de novo assembly methods perform on long and error prone nanopore reads. RESULTS: We benchmarked five non-hybrid (in terms of both error correction and scaffolding) assembly pipelines as well as two hybrid assemblers which use third generation sequencing data to scaffold Illumina assemblies. Tests were performed on several publicly available MinION and Illumina datasets of Escherichia coli K-12, using several sequencing coverages of nanopore data (20×, 30×, 40× and 50×). We attempted to assess the assembly quality at each of these coverages, in order to estimate the requirements for closed bacterial genome assembly. For the purpose of the benchmark, an extensible genome assembly benchmarking framework was developed. Results show that hybrid methods are highly dependent on the quality of NGS data, but much less on the quality and coverage of nanopore data and perform relatively well on lower nanopore coverages. All non-hybrid methods correctly assemble the E. coli genome when coverage is above 40×, even the non-hybrid method tailored for Pacific Biosciences reads. While it requires higher coverage compared to a method designed particularly for nanopore reads, its running time is significantly lower. AVAILABILITY AND IMPLEMENTATION: https://github.com/kkrizanovic/NanoMark CONTACT: mile.sikic@fer.hr SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Nanoporos , Análise de Sequência de DNA , Escherichia coli , Escherichia coli K12 , Genoma Bacteriano , Sequenciamento de Nucleotídeos em Larga Escala
11.
Nucleic Acids Res ; 42(Database issue): D879-81, 2014 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-24271393

RESUMO

ExoLocator (http://exolocator.eopsf.org) collects in a single place information needed for comparative analysis of protein-coding exons from vertebrate species. The main source of data--the genomic sequences, and the existing exon and homology annotation--is the ENSEMBL database of completed vertebrate genomes. To these, ExoLocator adds the search for ostensibly missing exons in orthologous protein pairs across species, using an extensive computational pipeline to narrow down the search region for the candidate exons and find a suitable template in the other species, as well as state-of-the-art implementations of pairwise alignment algorithms. The resulting complements of exons are organized in a way currently unique to ExoLocator: multiple sequence alignments, both on the nucleotide and on the peptide levels, clearly indicating the exon boundaries. The alignments can be inspected in the web-embedded viewer, downloaded or used on the spot to produce an estimate of conservation within orthologous sets, or functional divergence across paralogues.


Assuntos
Bases de Dados de Proteínas , Éxons , Proteínas/genética , Animais , Genoma Humano , Humanos , Internet , Vertebrados/genética
12.
Phys Rev Lett ; 114(24): 248701, 2015 Jun 19.
Artigo em Inglês | MEDLINE | ID: mdl-26197016

RESUMO

Detection of patient zero can give new insights to epidemiologists about the nature of first transmissions into a population. In this Letter, we study the statistical inference problem of detecting the source of epidemics from a snapshot of spreading on an arbitrary network structure. By using exact analytic calculations and Monte Carlo estimators, we demonstrate the detectability limits for the susceptible-infected-recovered model, which primarily depend on the spreading process characteristics. Finally, we demonstrate the applicability of the approach in a case of a simulated sexually transmitted infection spreading over an empirical temporal network of sexual interactions.


Assuntos
Busca de Comunicante/métodos , Modelos Estatísticos , Infecções Sexualmente Transmissíveis/epidemiologia , Simulação por Computador , Métodos Epidemiológicos , Humanos , Método de Monte Carlo , Infecções Sexualmente Transmissíveis/transmissão
13.
Bioinformatics ; 29(19): 2494-5, 2013 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-23864730

RESUMO

SUMMARY: We propose SW#, a new CUDA graphical processor unit-enabled and memory-efficient implementation of dynamic programming algorithm, for local alignment. It can be used as either a stand-alone application or a library. Although there are other graphical processor unit implementations of the Smith-Waterman algorithm, SW# is the only one publicly available that can produce sequence alignments on genome-wide scale. For long sequences, it is at least a few hundred times faster than a CPU version of the same algorithm. AVAILABILITY: Source code and installation instructions freely available for download at http://complex.zesoi.fer.hr/SW.html.


Assuntos
Algoritmos , Genoma , Sequência de Bases , Internet , Alinhamento de Sequência , Software
14.
Nucleic Acids Res ; 40(Web Server issue): W352-7, 2012 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-22693222

RESUMO

In this article, we introduce BioMe (biologically relevant metals), a web-based platform for calculation of various statistical properties of metal-binding sites. Users can obtain the following statistical properties: presence of selected ligands in metal coordination sphere, distribution of coordination numbers, percentage of metal ions coordinated by the combination of selected ligands, distribution of monodentate and bidentate metal-carboxyl, bindings for ASP and GLU, percentage of particular binuclear metal centers, distribution of coordination geometry, descriptive statistics for a metal ion-donor distance and percentage of the selected metal ions coordinated by each of the selected ligands. Statistics is presented in numerical and graphical forms. The underlying database contains information about all contacts within the range of 3 Å from a metal ion found in the asymmetric crystal unit. The stored information for each metal ion includes Protein Data Bank code, structure determination method, types of metal-binding chains [protein, ribonucleic acid (RNA), deoxyribonucleic acid (DNA), water and other] and names of the bounded ligands (amino acid residue, RNA nucleotide, DNA nucleotide, water and other) and the coordination number, the coordination geometry and, if applicable, another metal(s). BioMe is on a regular weekly update schedule. It is accessible at http://metals.zesoi.fer.hr.


Assuntos
Metais/química , Software , Sítios de Ligação , DNA/química , Interpretação Estatística de Dados , Internet , Ligantes , Metaloproteínas/química , RNA/química , Interface Usuário-Computador
15.
Nat Commun ; 15(1): 5580, 2024 Jul 03.
Artigo em Inglês | MEDLINE | ID: mdl-38961062

RESUMO

DNA methylation plays an important role in various biological processes, including cell differentiation, ageing, and cancer development. The most important methylation in mammals is 5-methylcytosine mostly occurring in the context of CpG dinucleotides. Sequencing methods such as whole-genome bisulfite sequencing successfully detect 5-methylcytosine DNA modifications. However, they suffer from the serious drawbacks of short read lengths and might introduce an amplification bias. Here we present Rockfish, a deep learning algorithm that significantly improves read-level 5-methylcytosine detection by using Nanopore sequencing. Rockfish is compared with other methods based on Nanopore sequencing on R9.4.1 and R10.4.1 datasets. There is an increase in the single-base accuracy and the F1 measure of up to 5 percentage points on R.9.4.1 datasets, and up to 0.82 percentage points on R10.4.1 datasets. Moreover, Rockfish shows a high correlation with whole-genome bisulfite sequencing, requires lower read depth, and achieves higher confidence in biologically important regions such as CpG-rich promoters while being computationally efficient. Its superior performance in human and mouse samples highlights its versatility for studying 5-methylcytosine methylation across varied organisms and diseases. Finally, its adaptable architecture ensures compatibility with new versions of pores and chemistry as well as modification types.


Assuntos
5-Metilcitosina , Ilhas de CpG , Metilação de DNA , Sequenciamento por Nanoporos , 5-Metilcitosina/metabolismo , 5-Metilcitosina/química , Sequenciamento por Nanoporos/métodos , Animais , Camundongos , Humanos , Ilhas de CpG/genética , Aprendizado Profundo , Algoritmos , Análise de Sequência de DNA/métodos , Sequenciamento Completo do Genoma/métodos , Sulfitos/química
16.
Commun Biol ; 5(1): 967, 2022 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-36109650

RESUMO

Singapore's National Flower, Papilionanthe (Ple.) Miss Joaquim 'Agnes' (PMJ) is highly prized as a horticultural flower from the Orchidaceae family. A combination of short-read sequencing, single-molecule long-read sequencing and chromatin contact mapping was used to assemble the PMJ genome, spanning 2.5 Gb and 19 pseudo-chromosomal scaffolds. Genomic resources and chemical profiling provided insights towards identifying, understanding and elucidating various classes of secondary metabolite compounds synthesized by the flower. For example, presence of the anthocyanin pigments detected by chemical profiling coincides with the expression of ANTHOCYANIN SYNTHASE (ANS), an enzyme responsible for the synthesis of the former. Similarly, the presence of vandaterosides (a unique class of glycosylated organic acids with the potential to slow skin aging) discovered using chemical profiling revealed the involvement of glycosyltransferase family enzymes candidates in vandateroside biosynthesis. Interestingly, despite the unnoticeable scent of the flower, genes involved in the biosynthesis of volatile compounds and chemical profiling revealed the combination of oxygenated hydrocarbons, including traces of linalool, beta-ionone and vanillin, forming the scent profile of PMJ. In summary, by combining genomics and biochemistry, the findings expands the known biodiversity repertoire of the Orchidaceae family and insights into the genome and secondary metabolite processes of PMJ.


Assuntos
Antocianinas , Orchidaceae , Cromatina/metabolismo , Flores/genética , Flores/metabolismo , Regulação da Expressão Gênica de Plantas , Glicosiltransferases/genética , Redes e Vias Metabólicas , Orchidaceae/genética , Singapura
17.
Nat Comput Sci ; 1(5): 332-336, 2021 May.
Artigo em Inglês | MEDLINE | ID: mdl-38217213

RESUMO

Whole genome sequencing technologies are unable to invariably read DNA molecules intact, a shortcoming that assemblers try to resolve by stitching the obtained fragments back together. Here, we present methods for the improvement of de novo genome assembly from erroneous long reads incorporated into a tool called Raven. Raven maintains similar performance for various genomes and has accuracy on par with other assemblers that support third-generation sequencing data. It is one of the fastest options while having the lowest memory consumption on the majority of benchmarked datasets.

18.
PLoS Comput Biol ; 5(1): e1000278, 2009 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-19180183

RESUMO

Identifying interaction sites in proteins provides important clues to the function of a protein and is becoming increasingly relevant in topics such as systems biology and drug discovery. Although there are numerous papers on the prediction of interaction sites using information derived from structure, there are only a few case reports on the prediction of interaction residues based solely on protein sequence. Here, a sliding window approach is combined with the Random Forests method to predict protein interaction sites using (i) a combination of sequence- and structure-derived parameters and (ii) sequence information alone. For sequence-based prediction we achieved a precision of 84% with a 26% recall and an F-measure of 40%. When combined with structural information, the prediction performance increases to a precision of 76% and a recall of 38% with an F-measure of 51%. We also present an attempt to rationalize the sliding window size and demonstrate that a nine-residue window is the most suitable for predictor construction. Finally, we demonstrate the applicability of our prediction methods by modeling the Ras-Raf complex using predicted interaction sites as target binding interfaces. Our results suggest that it is possible to predict protein interaction sites with quite a high accuracy using only sequence information.


Assuntos
Algoritmos , Inteligência Artificial , Conformação Proteica , Domínios e Motivos de Interação entre Proteínas , Mapeamento de Interação de Proteínas/métodos , Proteínas/química , Sequência de Aminoácidos , Área Sob a Curva , Bases de Dados de Proteínas , Entropia , Modelos Moleculares , Modelos Estatísticos , Proteínas/metabolismo , Curva ROC , Reprodutibilidade dos Testes , Quinases raf/química , Proteínas ras/química
19.
Nat Biotechnol ; 37(8): 937-944, 2019 08.
Artigo em Inglês | MEDLINE | ID: mdl-31359005

RESUMO

Characterization of microbiomes has been enabled by high-throughput metagenomic sequencing. However, existing methods are not designed to combine reads from short- and long-read technologies. We present a hybrid metagenomic assembler named OPERA-MS that integrates assembly-based metagenome clustering with repeat-aware, exact scaffolding to accurately assemble complex communities. Evaluation using defined in vitro and virtual gut microbiomes revealed that OPERA-MS assembles metagenomes with greater base pair accuracy than long-read (>5×; Canu), higher contiguity than short-read (~10× NGA50; MEGAHIT, IDBA-UD, metaSPAdes) and fewer assembly errors than non-metagenomic hybrid assemblers (2×; hybridSPAdes). OPERA-MS provides strain-resolved assembly in the presence of multiple genomes of the same species, high-quality reference genomes for rare species (<1%) with ~9× long-read coverage and near-complete genomes with higher coverage. We used OPERA-MS to assemble 28 gut metagenomes of antibiotic-treated patients, and showed that the inclusion of long nanopore reads produces more contiguous assemblies (200× improvement over short-read assemblies), including more than 80 closed plasmid or phage sequences and a new 263 kbp jumbo phage. High-quality hybrid assemblies enable an exquisitely detailed view of the gut resistome in human patients.


Assuntos
Bactérias/efeitos dos fármacos , Bactérias/genética , Metagenômica/métodos , Microbiota/efeitos dos fármacos , Análise de Sequência de DNA/métodos , Antibacterianos/farmacologia , Farmacorresistência Bacteriana , Fezes/microbiologia , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Metagenoma , Nanoporos , Software
20.
BMC Struct Biol ; 8: 21, 2008 Apr 09.
Artigo em Inglês | MEDLINE | ID: mdl-18400099

RESUMO

BACKGROUND: PSAIA (Protein Structure and Interaction Analyzer) was developed to compute geometric parameters for large sets of protein structures in order to predict and investigate protein-protein interaction sites. RESULTS: In addition to most relevant established algorithms, PSAIA offers a new method PIADA (Protein Interaction Atom Distance Algorithm) for the determination of residue interaction pairs. We found that PIADA produced more satisfactory results than comparable algorithms implemented in PSAIA. Particular advantages of PSAIA include its capacity to combine different methods to detect the locations and types of interactions between residues and its ability, without any further automation steps, to handle large numbers of protein structures and complexes. Generally, the integration of a variety of methods enables PSAIA to offer easier automation of analysis and greater reliability of results.PSAIA can be used either via a graphical user interface or from the command-line. Results are generated in either tabular or XML format. CONCLUSION: In a straightforward fashion and for large sets of protein structures, PSAIA enables the calculation of protein geometric parameters and the determination of location and type for protein-protein interaction sites. XML formatted output enables easy conversion of results to various formats suitable for statistic analysis. Results from smaller data sets demonstrated the influence of geometry on protein interaction sites. Comprehensive analysis of properties of large data sets lead to new information useful in the prediction of protein-protein interaction sites.


Assuntos
Biologia Computacional , Proteínas/química , Proteínas/metabolismo , Software , Algoritmos , Bases de Dados de Proteínas , Ligação Proteica
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA