Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
Mais filtros








Base de dados
Intervalo de ano de publicação
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38695119

RESUMO

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Assuntos
Algoritmos , Biologia Computacional , Alinhamento de Sequência , Alinhamento de Sequência/métodos , Biologia Computacional/métodos , Software , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Proteínas/química , Proteínas/genética , Aprendizado Profundo , Bases de Dados de Proteínas
2.
Bioinformatics ; 40(1)2024 01 02.
Artigo em Inglês | MEDLINE | ID: mdl-38212995

RESUMO

MOTIVATION: Proteins accomplish cellular functions by interacting with each other, which makes the prediction of interaction sites a fundamental problem. As experimental methods are expensive and time consuming, computational prediction of the interaction sites has been studied extensively. Structure-based programs are the most accurate, while the sequence-based ones are much more widely applicable, as the sequences available outnumber the structures by two orders of magnitude. Ideally, we would like a tool that has the quality of the former and the applicability of the latter. RESULTS: We provide here the first solution that achieves these two goals. Our new sequence-based program, Seq-InSite, greatly surpasses the performance of sequence-based models, matching the quality of state-of-the-art structure-based predictors, thus effectively superseding the need for models requiring structure. The predictive power of Seq-InSite is illustrated using an analysis of evolutionary conservation for four protein sequences. AVAILABILITY AND IMPLEMENTATION: Seq-InSite is freely available as a web server at http://seq-insite.csd.uwo.ca/ and as free source code, including trained models and all datasets used for training and testing, at https://github.com/lucian-ilie/Seq-InSite.


Assuntos
Proteínas , Software , Proteínas/química , Sequência de Aminoácidos
3.
Methods Mol Biol ; 2690: 375-383, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-37450160

RESUMO

Several proteins work independently, but the majority work together to maintain the functions of the cell. Thus, it is crucial to know the interaction sites that facilitate protein-protein interactions. The development of effective computational methods is essential because experimental methods are expensive and time-consuming. This chapter is a guide to predicting protein interaction sites using the program "PITHIA." First, some installation guides are presented, followed by descriptions of input file formats. Afterward, PITHIA's commands and options are outlined with examples. Moreover, some notes are provided on how to extend PITHIA's installation and usage.


Assuntos
Biologia Computacional , Proteínas , Sítios de Ligação , Ligação Proteica , Proteínas/metabolismo
4.
Foods ; 11(22)2022 Nov 16.
Artigo em Inglês | MEDLINE | ID: mdl-36429263

RESUMO

The preservation of food supplies has been humankind's priority since ancient times, and it is arguably more relevant today than ever before. Food sustainability and safety have been heavily prioritized by consumers, producers, and government entities alike. In this regard, filamentous fungi have always been a health hazard due to their contamination of the food substrate with mycotoxins. Additionally, mycotoxins are proven resilient to technological processing. This study aims to identify the main mycotoxins that may occur in the meat and meat products "Farm to Fork" chain, along with their effect on the consumers' health, and also to identify effective methods of prevention through the use of essential oils (EO). At the same time, the antifungal and antimycotoxigenic potential of essential oils was considered in order to provide an overview of the subject. Targeting the main ways of meat products' contamination, the use of essential oils with proven in vitro or in situ efficacy against certain fungal species can be an effective alternative if all the associated challenges are addressed (e.g., application methods, suitability for certain products, toxicity).

5.
Int J Mol Sci ; 23(21)2022 Oct 24.
Artigo em Inglês | MEDLINE | ID: mdl-36361606

RESUMO

Cellular functions are governed by proteins, and, while some proteins work independently, most work by interacting with other proteins. As a result it is crucially important to know the interaction sites that facilitate the interactions between the proteins. Since the experimental methods are costly and time consuming, it is essential to develop effective computational methods. We present PITHIA, a sequence-based deep learning model for protein interaction site prediction that exploits the combination of multiple sequence alignments and learning attention. We demonstrate that our new model clearly outperforms the state-of-the-art models on a wide range of metrics. In order to provide meaningful comparison, we update existing test datasets with new information regarding interaction site, as well as introduce an additional new testing dataset which resolves the shortcomings of the existing ones.


Assuntos
Atenção , Proteínas , Alinhamento de Sequência , Biologia Computacional/métodos
6.
Bioinformatics ; 37(9): 1206-1210, 2021 06 09.
Artigo em Inglês | MEDLINE | ID: mdl-34107042

RESUMO

MOTIVATION: Sequence similarity is the most frequently used procedure in biological research, as proved by the widely used BLAST program. The consecutive seed used by BLAST can be dramatically improved by considering multiple spaced seeds. Finding the best seeds is a hard problem and much effort went into developing heuristic algorithms and software for designing highly sensitive spaced seeds. RESULTS: We introduce a new algorithm and software, ALeS, that produces more sensitive seeds than the current state-of-the-art programs, as shown by extensive testing. We also accurately estimate the sensitivity of a seed, enabling its computation for arbitrary seeds. AVAILABILITYAND IMPLEMENTATION: The source code is freely available at github.com/lucian-ilie/ALeS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Projetos de Pesquisa
7.
Bioinformatics ; 37(7): 896-904, 2021 05 17.
Artigo em Inglês | MEDLINE | ID: mdl-32840562

RESUMO

MOTIVATION: Proteins usually perform their functions by interacting with other proteins, which is why accurately predicting protein-protein interaction (PPI) binding sites is a fundamental problem. Experimental methods are slow and expensive. Therefore, great efforts are being made towards increasing the performance of computational methods. RESULTS: We propose DEep Learning Prediction of Highly probable protein Interaction sites (DELPHI), a new sequence-based deep learning suite for PPI-binding sites prediction. DELPHI has an ensemble structure which combines a CNN and a RNN component with fine tuning technique. Three novel features, HSP, position information and ProtVec are used in addition to nine existing ones. We comprehensively compare DELPHI to nine state-of-the-art programmes on five datasets, and DELPHI outperforms the competing methods in all metrics even though its training dataset shares the least similarities with the testing datasets. In the most important metrics, AUPRC and MCC, it surpasses the second best programmes by as much as 18.5% and 27.7%, respectively. We also demonstrated that the improvement is essentially due to using the ensemble model and, especially, the three new features. Using DELPHI it is shown that there is a strong correlation with protein-binding residues (PBRs) and sites with strong evolutionary conservation. In addition, DELPHI's predicted PBR sites closely match known data from Pfam. DELPHI is available as open-sourced standalone software and web server. AVAILABILITY AND IMPLEMENTATION: The DELPHI web server can be found at delphi.csd.uwo.ca/, with all datasets and results in this study. The trained models, the DELPHI standalone source code, and the feature computation pipeline are freely available at github.com/lucian-ilie/DELPHI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Proteínas , Software , Sítios de Ligação , Biologia Computacional , Ligação Proteica , Proteínas/metabolismo , Projetos de Pesquisa
8.
Methods Mol Biol ; 2074: 1-11, 2020.
Artigo em Inglês | MEDLINE | ID: mdl-31583626

RESUMO

Understanding protein-protein interactions (PPIs) is vital to reveal the function mechanisms in cells. Thus, predicting and identifying PPIs is one of the fundamental problems in system biology. Various high-throughput experimental and computation methods have been developed to predict PPIs. Here, we provide a straightforward guide of using the program "SPRINT" to predict the PPIs on an interactome level in an organism. First, some installation guides and input file formats are described. Then, the commands and options to run SPRINT are discussed with examples. In addition, some notes on possible extended installation and usage of SPRINT are given.


Assuntos
Mapeamento de Interação de Proteínas/métodos , Proteínas/metabolismo , Animais , Biologia Computacional/métodos , Humanos , Mapas de Interação de Proteínas , Proteínas/química
9.
Bioinformatics ; 34(4): 678-680, 2018 02 15.
Artigo em Inglês | MEDLINE | ID: mdl-29045591

RESUMO

Summary: De novo genome assembly of next-generation sequencing data is a fundamental problem in bioinformatics. There are many programs that assemble small genomes, but very few can assemble whole human genomes. We present a new algorithm for parallel overlap graph construction, which is capable of assembling human genomes and improves upon the current state-of-the-art in genome assembly. Availability and implementation: SAGE2 is written in C ++ and OpenMP and is freely available (under the GPL 3.0 license) at github.com/lucian-ilie/SAGE2. Contact: ilie@uwo.ca. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Animais , Humanos
10.
BMC Bioinformatics ; 18(1): 564, 2017 Dec 19.
Artigo em Inglês | MEDLINE | ID: mdl-29258419

RESUMO

BACKGROUND: The next generation sequencing (NGS) techniques have been around for over a decade. Many of their fundamental applications rely on the ability to compute good genome assemblies. As the technology evolves, the assembly algorithms and tools have to continuously adjust and improve. The currently dominant technology of Illumina produces reads that are too short to bridge many repeats, setting limits on what can be successfully assembled. The emerging SMRT (Single Molecule, Real-Time) sequencing technique from Pacific Biosciences produces uniform coverage and long reads of length up to sixty thousand base pairs, enabling significantly better genome assemblies. However, SMRT reads are much more expensive and have a much higher error rate than Illumina's - around 10-15% - mostly due to indels. New algorithms are very much needed to take advantage of the long reads while mitigating the effect of high error rate and lowering the required coverage. METHODS: An essential step in assembling SMRT data is the detection of alignments, or overlaps, between reads. High error rate and very long reads make this a much more challenging problem than for Illumina data. We present a new pairwise read aligner, or overlapper, HISEA (Hierarchical SEed Aligner) for SMRT sequencing data. HISEA uses a novel two-step k-mer search, employing consistent clustering, k-mer filtering, and read alignment extension. RESULTS: We compare HISEA against several state-of-the-art programs - BLASR, DALIGNER, GraphMap, MHAP, and Minimap - on real datasets from five organisms. We compare their sensitivity, precision, specificity, F1-score, as well as time and memory usage. We also introduce a new, more precise, evaluation method. Finally, we compare the two leading programs, MHAP and HISEA, for their genome assembly performance in the Canu pipeline. DISCUSSION: Our algorithm has the best alignment detection sensitivity among all programs for SMRT data, significantly higher than the current best. The currently best assembler for SMRT data is the Canu program which uses the MHAP aligner in its pipeline. We have incorporated our new HISEA aligner in the Canu pipeline and benchmarked it against the best pipeline for multiple datasets at two relevant coverage levels: 30x and 50x. Our assemblies are better than those using MHAP for both coverage levels. Moreover, Canu+HISEA assemblies for 30x coverage are comparable with Canu+MHAP assemblies for 50x coverage, while being faster and cheaper. CONCLUSIONS: The HISEA algorithm produces alignments with highest sensitivity compared with the current state-of-the-art algorithms. Integrated in the Canu pipeline, currently the best for assembling PacBio data, it produces better assemblies than Canu+MHAP.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Bases de Dados Genéticas , Alinhamento de Sequência , Fatores de Tempo
11.
BMC Bioinformatics ; 18(1): 485, 2017 Nov 15.
Artigo em Inglês | MEDLINE | ID: mdl-29141584

RESUMO

BACKGROUND: Proteins perform their functions usually by interacting with other proteins. Predicting which proteins interact is a fundamental problem. Experimental methods are slow, expensive, and have a high rate of error. Many computational methods have been proposed among which sequence-based ones are very promising. However, so far no such method is able to predict effectively the entire human interactome: they require too much time or memory. RESULTS: We present SPRINT (Scoring PRotein INTeractions), a new sequence-based algorithm and tool for predicting protein-protein interactions. We comprehensively compare SPRINT with state-of-the-art programs on seven most reliable human PPI datasets and show that it is more accurate while running orders of magnitude faster and using very little memory. CONCLUSION: SPRINT is the only sequence-based program that can effectively predict the entire human interactome: it requires between 15 and 100 min, depending on the dataset. Our goal is to transform the very challenging problem of predicting the entire human interactome into a routine task. AVAILABILITY: The source code of SPRINT is freely available from https://github.com/lucian-ilie/SPRINT/ and the datasets and predicted PPIs from www.csd.uwo.ca/faculty/ilie/SPRINT/ .


Assuntos
Algoritmos , Mapeamento de Interação de Proteínas/métodos , Análise de Sequência de Proteína , Humanos , Proteínas/metabolismo , Software
12.
BMC Res Notes ; 8: 709, 2015 Nov 24.
Artigo em Inglês | MEDLINE | ID: mdl-26601933

RESUMO

BACKGROUND: Genome assembly is a fundamental problem with multiple applications. Current technological limitations do not allow assembling of entire genomes and many programs have been designed to produce longer and more reliable contigs. Assessing the quality of these assemblies and comparing those produced by different tools is essential in choosing the best ones. The QUAST program has become the current state-of-the-art in quality assessment of genome assemblies. The only drawback of QUAST is high time and memory usage for large genomes, e.g., over 4 days and 120 GB of RAM for a single human genome assembly. RESULTS: We introduce LASER, a new tool for assembly evaluation that improves greatly the speed and memory requirements of QUAST. For a human genome assembly, LASER is 5.6 times faster than QUAST while using only half the memory; one human genome assembly is evaluated in 17 hours instead of 4 days. The code of LASER is based on that of QUAST and therefore inherits all its features. CONCLUSIONS: Genome assembly evaluation is an essential step in assessing the quality of an assembly that is too often done improperly, in part due to significant resource consumption. With the introduction of LASER, proper evaluation can be performed efficiently.


Assuntos
DNA/genética , Genoma , Sequenciamento de Nucleotídeos em Larga Escala
13.
Brief Bioinform ; 16(4): 588-99, 2015 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-25183248

RESUMO

Next-generation sequencing technologies revolutionized the ways in which genetic information is obtained and have opened the door for many essential applications in biomedical sciences. Hundreds of gigabytes of data are being produced, and all applications are affected by the errors in the data. Many programs have been designed to correct these errors, most of them targeting the data produced by the dominant technology of Illumina. We present a thorough comparison of these programs. Both HiSeq and MiSeq types of Illumina data are analyzed, and correcting performance is evaluated as the gain in depth and breadth of coverage, as given by correct reads and k-mers. Time and memory requirements, scalability and parallelism are considered as well. Practical guidelines are provided for the effective use of these tools. We also evaluate the efficiency of the current state-of-the-art programs for correcting Illumina data and provide research directions for further improvement.


Assuntos
Interpretação Estatística de Dados , Análise de Sequência de DNA/normas
14.
Bioinformatics ; 31(4): 509-14, 2015 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-25399029

RESUMO

MOTIVATION: Alignment of similar whole genomes is often performed using anchors given by the maximal exact matches (MEMs) between their sequences. In spite of significant amount of research on this problem, the computation of MEMs for large genomes remains a challenging problem. The leading current algorithms employ full text indexes, the sparse suffix array giving the best results. Still, their memory requirements are high, the parallelization is not very efficient, and they cannot handle very large genomes. RESULTS: We present a new algorithm, efficient computation of MEMs (E-MEM) that does not use full text indexes. Our algorithm uses much less space and is highly amenable to parallelization. It can compute all MEMs of minimum length 100 between the whole human and mouse genomes on a 12 core machine in 10 min and 2 GB of memory; the required memory can be as low as 600 MB. It can run efficiently genomes of any size. Extensive testing and comparison with currently best algorithms is provided. AVAILABILITY AND IMPLEMENTATION: The source code of E-MEM is freely available at: http://www.csd.uwo.ca/∼ilie/E-MEM/ CONTACT: ilie@csd.uwo.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Genoma , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Software , Biologia de Sistemas/métodos , Animais , Bases de Dados Factuais , Humanos , Camundongos , Linguagens de Programação , Especificidade da Espécie , Triticum/genética
15.
BMC Bioinformatics ; 15: 302, 2014 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-25225118

RESUMO

BACKGROUND: De novo genome assembly of next-generation sequencing data is one of the most important current problems in bioinformatics, essential in many biological applications. In spite of significant amount of work in this area, better solutions are still very much needed. RESULTS: We present a new program, SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers. CONCLUSIONS: SAGE benefits from innovations in almost every aspect of the assembly process: error correction of input reads, string-overlap graph construction, read copy counts estimation, overlap graph analysis and reduction, contig extraction, and scaffolding. We hope that these new ideas will help advance the current state-of-the-art in an essential area of research in genomics.


Assuntos
Algoritmos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Gráficos por Computador , Tamanho do Genoma
16.
Bioinformatics ; 29(19): 2490-3, 2013 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-23853064

RESUMO

MOTIVATION: High-throughput next-generation sequencing technologies enable increasingly fast and affordable sequencing of genomes and transcriptomes, with a broad range of applications. The quality of the sequencing data is crucial for all applications. A significant portion of the data produced contains errors, and ever more efficient error correction programs are needed. RESULTS: We propose RACER (Rapid and Accurate Correction of Errors in Reads), a new software program for correcting errors in sequencing data. RACER has better error-correcting performance than existing programs, is faster and requires less memory. To support our claims, we performed extensive comparison with the existing leading programs on a variety of real datasets. AVAILABILITY: RACER is freely available for non-commercial use at www.csd.uwo.ca/∼ilie/RACER/.


Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Animais , Genoma , Software , Fatores de Tempo
17.
BMC Bioinformatics ; 14: 69, 2013 Feb 27.
Artigo em Inglês | MEDLINE | ID: mdl-23444904

RESUMO

BACKGROUND: DNA microarrays have become ubiquitous in biological and medical research. The most difficult problem that needs to be solved is the design of DNA oligonucleotides that (i) are highly specific, that is, bind only to the intended target, (ii) cover the highest possible number of genes, that is, all genes that allow such unique regions, and (iii) are computed fast. None of the existing programs meet all these criteria. RESULTS: We introduce a new approach with our software program BOND (Basic OligoNucleotide Design). According to Kane's criteria for oligo design, BOND computes highly specific DNA oligonucleotides, for all the genes that admit unique probes, while running orders of magnitude faster than the existing programs. The same approach enables us to introduce also an evaluation procedure that correctly measures the quality of the oligonucleotides. Extensive comparison is performed to prove our claims. BOND is flexible, easy to use, requires no additional software, and is freely available for non-commercial use from http://www.csd.uwo.ca/∼ilie/BOND/. CONCLUSIONS: We provide an improved solution to the important problem of oligonucleotide design, including a thorough evaluation of oligo design programs. We hope BOND will become a useful tool for researchers in biological and medical sciences by making the microarray procedures faster and more accurate.


Assuntos
Oligonucleotídeos/química , Software , Algoritmos , Genes , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Sondas de Oligonucleotídeos/química , Sondas de Oligonucleotídeos/genética , Oligonucleotídeos/genética
18.
BMC Genomics ; 12: 280, 2011 Jun 01.
Artigo em Inglês | MEDLINE | ID: mdl-21627845

RESUMO

BACKGROUND: DNA oligonucleotides are a very useful tool in biology. The best algorithms for designing good DNA oligonucleotides are filtering out unsuitable regions using a seeding approach. Determining the quality of the seeds is crucial for the performance of these algorithms. RESULTS: We present a sound framework for evaluating the quality of seeds for oligonucleotide design. The F - score is used to measure the accuracy of each seed. A number of natural candidates are tested: contiguous (BLAST-like), spaced, transitions-constrained, and multiple spaced seeds. Multiple spaced seeds are the best, with more seeds providing better accuracy. Single spaced and transition seeds are very close whereas, as expected, contiguous seeds come last. Increased accuracy comes at the price of reduced efficiency. An exception is that single spaced and transitions-constrained seeds are both more accurate and more efficient than contiguous ones. CONCLUSIONS: Our work confirms another application where multiple spaced seeds perform the best. It will be useful in improving the algorithms for oligonucleotide design.


Assuntos
Técnicas de Amplificação de Ácido Nucleico/métodos , Oligonucleotídeos/genética , DNA/genética
19.
Bioinformatics ; 27(17): 2433-4, 2011 Sep 01.
Artigo em Inglês | MEDLINE | ID: mdl-21690104

RESUMO

SUMMARY: Multiple spaced seeds represent the current state-of-the-art for similarity search in bioinformatics, with applications in various areas such as sequence alignment, read mapping, oligonucleotide design, etc. We present SpEED, a software program that computes highly sensitive multiple spaced seeds. SpEED can be several orders of magnitude faster and computes better seeds than the existing leading software programs. AVAILABILITY: The source code of SpEED is freely available at www.csd.uwo.ca/~ilie/SpEED/ CONTACT: ilie@csd.uwo.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Software , Algoritmos , Alinhamento de Sequência
20.
Bioinformatics ; 27(7): 1011-2, 2011 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-21278192

RESUMO

UNLABELLED: We report on a major update (version 2) of the original SHort Read Mapping Program (SHRiMP). SHRiMP2 primarily targets mapping sensitivity, and is able to achieve high accuracy at a very reasonable speed. SHRiMP2 supports both letter space and color space (AB/SOLiD) reads, enables for direct alignment of paired reads and uses parallel computation to fully utilize multi-core architectures. AVAILABILITY: SHRiMP2 executables and source code are freely available at: http://compbio.cs.toronto.edu/shrimp/.


Assuntos
Mapeamento Cromossômico , Genômica/métodos , Análise de Sequência de DNA , Software , Algoritmos , Polimorfismo Genético , Alinhamento de Sequência
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA