Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 23
Filtrar
1.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38695119

RESUMEN

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Asunto(s)
Algoritmos , Biología Computacional , Alineación de Secuencia , Alineación de Secuencia/métodos , Biología Computacional/métodos , Programas Informáticos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Proteínas/química , Proteínas/genética , Aprendizaje Profundo , Bases de Datos de Proteínas
2.
Bioinformatics ; 40(1)2024 01 02.
Artículo en Inglés | MEDLINE | ID: mdl-38212995

RESUMEN

MOTIVATION: Proteins accomplish cellular functions by interacting with each other, which makes the prediction of interaction sites a fundamental problem. As experimental methods are expensive and time consuming, computational prediction of the interaction sites has been studied extensively. Structure-based programs are the most accurate, while the sequence-based ones are much more widely applicable, as the sequences available outnumber the structures by two orders of magnitude. Ideally, we would like a tool that has the quality of the former and the applicability of the latter. RESULTS: We provide here the first solution that achieves these two goals. Our new sequence-based program, Seq-InSite, greatly surpasses the performance of sequence-based models, matching the quality of state-of-the-art structure-based predictors, thus effectively superseding the need for models requiring structure. The predictive power of Seq-InSite is illustrated using an analysis of evolutionary conservation for four protein sequences. AVAILABILITY AND IMPLEMENTATION: Seq-InSite is freely available as a web server at http://seq-insite.csd.uwo.ca/ and as free source code, including trained models and all datasets used for training and testing, at https://github.com/lucian-ilie/Seq-InSite.


Asunto(s)
Proteínas , Programas Informáticos , Proteínas/química , Secuencia de Aminoácidos
3.
Bioinformatics ; 37(9): 1206-1210, 2021 06 09.
Artículo en Inglés | MEDLINE | ID: mdl-34107042

RESUMEN

MOTIVATION: Sequence similarity is the most frequently used procedure in biological research, as proved by the widely used BLAST program. The consecutive seed used by BLAST can be dramatically improved by considering multiple spaced seeds. Finding the best seeds is a hard problem and much effort went into developing heuristic algorithms and software for designing highly sensitive spaced seeds. RESULTS: We introduce a new algorithm and software, ALeS, that produces more sensitive seeds than the current state-of-the-art programs, as shown by extensive testing. We also accurately estimate the sensitivity of a seed, enabling its computation for arbitrary seeds. AVAILABILITYAND IMPLEMENTATION: The source code is freely available at github.com/lucian-ilie/ALeS. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Programas Informáticos , Proyectos de Investigación
4.
Bioinformatics ; 37(7): 896-904, 2021 05 17.
Artículo en Inglés | MEDLINE | ID: mdl-32840562

RESUMEN

MOTIVATION: Proteins usually perform their functions by interacting with other proteins, which is why accurately predicting protein-protein interaction (PPI) binding sites is a fundamental problem. Experimental methods are slow and expensive. Therefore, great efforts are being made towards increasing the performance of computational methods. RESULTS: We propose DEep Learning Prediction of Highly probable protein Interaction sites (DELPHI), a new sequence-based deep learning suite for PPI-binding sites prediction. DELPHI has an ensemble structure which combines a CNN and a RNN component with fine tuning technique. Three novel features, HSP, position information and ProtVec are used in addition to nine existing ones. We comprehensively compare DELPHI to nine state-of-the-art programmes on five datasets, and DELPHI outperforms the competing methods in all metrics even though its training dataset shares the least similarities with the testing datasets. In the most important metrics, AUPRC and MCC, it surpasses the second best programmes by as much as 18.5% and 27.7%, respectively. We also demonstrated that the improvement is essentially due to using the ensemble model and, especially, the three new features. Using DELPHI it is shown that there is a strong correlation with protein-binding residues (PBRs) and sites with strong evolutionary conservation. In addition, DELPHI's predicted PBR sites closely match known data from Pfam. DELPHI is available as open-sourced standalone software and web server. AVAILABILITY AND IMPLEMENTATION: The DELPHI web server can be found at delphi.csd.uwo.ca/, with all datasets and results in this study. The trained models, the DELPHI standalone source code, and the feature computation pipeline are freely available at github.com/lucian-ilie/DELPHI. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Proteínas , Programas Informáticos , Sitios de Unión , Biología Computacional , Unión Proteica , Proteínas/metabolismo , Proyectos de Investigación
5.
Int J Mol Sci ; 23(21)2022 Oct 24.
Artículo en Inglés | MEDLINE | ID: mdl-36361606

RESUMEN

Cellular functions are governed by proteins, and, while some proteins work independently, most work by interacting with other proteins. As a result it is crucially important to know the interaction sites that facilitate the interactions between the proteins. Since the experimental methods are costly and time consuming, it is essential to develop effective computational methods. We present PITHIA, a sequence-based deep learning model for protein interaction site prediction that exploits the combination of multiple sequence alignments and learning attention. We demonstrate that our new model clearly outperforms the state-of-the-art models on a wide range of metrics. In order to provide meaningful comparison, we update existing test datasets with new information regarding interaction site, as well as introduce an additional new testing dataset which resolves the shortcomings of the existing ones.


Asunto(s)
Atención , Proteínas , Alineación de Secuencia , Biología Computacional/métodos
6.
Bioinformatics ; 34(4): 678-680, 2018 02 15.
Artículo en Inglés | MEDLINE | ID: mdl-29045591

RESUMEN

Summary: De novo genome assembly of next-generation sequencing data is a fundamental problem in bioinformatics. There are many programs that assemble small genomes, but very few can assemble whole human genomes. We present a new algorithm for parallel overlap graph construction, which is capable of assembling human genomes and improves upon the current state-of-the-art in genome assembly. Availability and implementation: SAGE2 is written in C ++ and OpenMP and is freely available (under the GPL 3.0 license) at github.com/lucian-ilie/SAGE2. Contact: ilie@uwo.ca. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Algoritmos , Animales , Humanos
7.
BMC Bioinformatics ; 18(1): 485, 2017 Nov 15.
Artículo en Inglés | MEDLINE | ID: mdl-29141584

RESUMEN

BACKGROUND: Proteins perform their functions usually by interacting with other proteins. Predicting which proteins interact is a fundamental problem. Experimental methods are slow, expensive, and have a high rate of error. Many computational methods have been proposed among which sequence-based ones are very promising. However, so far no such method is able to predict effectively the entire human interactome: they require too much time or memory. RESULTS: We present SPRINT (Scoring PRotein INTeractions), a new sequence-based algorithm and tool for predicting protein-protein interactions. We comprehensively compare SPRINT with state-of-the-art programs on seven most reliable human PPI datasets and show that it is more accurate while running orders of magnitude faster and using very little memory. CONCLUSION: SPRINT is the only sequence-based program that can effectively predict the entire human interactome: it requires between 15 and 100 min, depending on the dataset. Our goal is to transform the very challenging problem of predicting the entire human interactome into a routine task. AVAILABILITY: The source code of SPRINT is freely available from https://github.com/lucian-ilie/SPRINT/ and the datasets and predicted PPIs from www.csd.uwo.ca/faculty/ilie/SPRINT/ .


Asunto(s)
Algoritmos , Mapeo de Interacción de Proteínas/métodos , Análisis de Secuencia de Proteína , Humanos , Proteínas/metabolismo , Programas Informáticos
8.
BMC Bioinformatics ; 18(1): 564, 2017 Dec 19.
Artículo en Inglés | MEDLINE | ID: mdl-29258419

RESUMEN

BACKGROUND: The next generation sequencing (NGS) techniques have been around for over a decade. Many of their fundamental applications rely on the ability to compute good genome assemblies. As the technology evolves, the assembly algorithms and tools have to continuously adjust and improve. The currently dominant technology of Illumina produces reads that are too short to bridge many repeats, setting limits on what can be successfully assembled. The emerging SMRT (Single Molecule, Real-Time) sequencing technique from Pacific Biosciences produces uniform coverage and long reads of length up to sixty thousand base pairs, enabling significantly better genome assemblies. However, SMRT reads are much more expensive and have a much higher error rate than Illumina's - around 10-15% - mostly due to indels. New algorithms are very much needed to take advantage of the long reads while mitigating the effect of high error rate and lowering the required coverage. METHODS: An essential step in assembling SMRT data is the detection of alignments, or overlaps, between reads. High error rate and very long reads make this a much more challenging problem than for Illumina data. We present a new pairwise read aligner, or overlapper, HISEA (Hierarchical SEed Aligner) for SMRT sequencing data. HISEA uses a novel two-step k-mer search, employing consistent clustering, k-mer filtering, and read alignment extension. RESULTS: We compare HISEA against several state-of-the-art programs - BLASR, DALIGNER, GraphMap, MHAP, and Minimap - on real datasets from five organisms. We compare their sensitivity, precision, specificity, F1-score, as well as time and memory usage. We also introduce a new, more precise, evaluation method. Finally, we compare the two leading programs, MHAP and HISEA, for their genome assembly performance in the Canu pipeline. DISCUSSION: Our algorithm has the best alignment detection sensitivity among all programs for SMRT data, significantly higher than the current best. The currently best assembler for SMRT data is the Canu program which uses the MHAP aligner in its pipeline. We have incorporated our new HISEA aligner in the Canu pipeline and benchmarked it against the best pipeline for multiple datasets at two relevant coverage levels: 30x and 50x. Our assemblies are better than those using MHAP for both coverage levels. Moreover, Canu+HISEA assemblies for 30x coverage are comparable with Canu+MHAP assemblies for 50x coverage, while being faster and cheaper. CONCLUSIONS: The HISEA algorithm produces alignments with highest sensitivity compared with the current state-of-the-art algorithms. Integrated in the Canu pipeline, currently the best for assembling PacBio data, it produces better assemblies than Canu+MHAP.


Asunto(s)
Algoritmos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Bases de Datos Genéticas , Alineación de Secuencia , Factores de Tiempo
9.
Brief Bioinform ; 16(4): 588-99, 2015 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-25183248

RESUMEN

Next-generation sequencing technologies revolutionized the ways in which genetic information is obtained and have opened the door for many essential applications in biomedical sciences. Hundreds of gigabytes of data are being produced, and all applications are affected by the errors in the data. Many programs have been designed to correct these errors, most of them targeting the data produced by the dominant technology of Illumina. We present a thorough comparison of these programs. Both HiSeq and MiSeq types of Illumina data are analyzed, and correcting performance is evaluated as the gain in depth and breadth of coverage, as given by correct reads and k-mers. Time and memory requirements, scalability and parallelism are considered as well. Practical guidelines are provided for the effective use of these tools. We also evaluate the efficiency of the current state-of-the-art programs for correcting Illumina data and provide research directions for further improvement.


Asunto(s)
Interpretación Estadística de Datos , Análisis de Secuencia de ADN/normas
10.
Bioinformatics ; 31(4): 509-14, 2015 Feb 15.
Artículo en Inglés | MEDLINE | ID: mdl-25399029

RESUMEN

MOTIVATION: Alignment of similar whole genomes is often performed using anchors given by the maximal exact matches (MEMs) between their sequences. In spite of significant amount of research on this problem, the computation of MEMs for large genomes remains a challenging problem. The leading current algorithms employ full text indexes, the sparse suffix array giving the best results. Still, their memory requirements are high, the parallelization is not very efficient, and they cannot handle very large genomes. RESULTS: We present a new algorithm, efficient computation of MEMs (E-MEM) that does not use full text indexes. Our algorithm uses much less space and is highly amenable to parallelization. It can compute all MEMs of minimum length 100 between the whole human and mouse genomes on a 12 core machine in 10 min and 2 GB of memory; the required memory can be as low as 600 MB. It can run efficiently genomes of any size. Extensive testing and comparison with currently best algorithms is provided. AVAILABILITY AND IMPLEMENTATION: The source code of E-MEM is freely available at: http://www.csd.uwo.ca/∼ilie/E-MEM/ CONTACT: ilie@csd.uwo.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Algoritmos , Genoma , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Biología de Sistemas/métodos , Animales , Bases de Datos Factuales , Humanos , Ratones , Lenguajes de Programación , Especificidad de la Especie , Triticum/genética
11.
BMC Bioinformatics ; 15: 302, 2014 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-25225118

RESUMEN

BACKGROUND: De novo genome assembly of next-generation sequencing data is one of the most important current problems in bioinformatics, essential in many biological applications. In spite of significant amount of work in this area, better solutions are still very much needed. RESULTS: We present a new program, SAGE, for de novo genome assembly. As opposed to most assemblers, which are de Bruijn graph based, SAGE uses the string-overlap graph. SAGE builds upon great existing work on string-overlap graph and maximum likelihood assembly, bringing an important number of new ideas, such as the efficient computation of the transitive reduction of the string overlap graph, the use of (generalized) edge multiplicity statistics for more accurate estimation of read copy counts, and the improved use of mate pairs and min-cost flow for supporting edge merging. The assemblies produced by SAGE for several short and medium-size genomes compared favourably with those of existing leading assemblers. CONCLUSIONS: SAGE benefits from innovations in almost every aspect of the assembly process: error correction of input reads, string-overlap graph construction, read copy counts estimation, overlap graph analysis and reduction, contig extraction, and scaffolding. We hope that these new ideas will help advance the current state-of-the-art in an essential area of research in genomics.


Asunto(s)
Algoritmos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Gráficos por Computador , Tamaño del Genoma
12.
Bioinformatics ; 29(19): 2490-3, 2013 Oct 01.
Artículo en Inglés | MEDLINE | ID: mdl-23853064

RESUMEN

MOTIVATION: High-throughput next-generation sequencing technologies enable increasingly fast and affordable sequencing of genomes and transcriptomes, with a broad range of applications. The quality of the sequencing data is crucial for all applications. A significant portion of the data produced contains errors, and ever more efficient error correction programs are needed. RESULTS: We propose RACER (Rapid and Accurate Correction of Errors in Reads), a new software program for correcting errors in sequencing data. RACER has better error-correcting performance than existing programs, is faster and requires less memory. To support our claims, we performed extensive comparison with the existing leading programs on a variety of real datasets. AVAILABILITY: RACER is freely available for non-commercial use at www.csd.uwo.ca/∼ilie/RACER/.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Análisis de Secuencia de ADN/métodos , Animales , Genoma , Programas Informáticos , Factores de Tiempo
13.
BMC Bioinformatics ; 14: 69, 2013 Feb 27.
Artículo en Inglés | MEDLINE | ID: mdl-23444904

RESUMEN

BACKGROUND: DNA microarrays have become ubiquitous in biological and medical research. The most difficult problem that needs to be solved is the design of DNA oligonucleotides that (i) are highly specific, that is, bind only to the intended target, (ii) cover the highest possible number of genes, that is, all genes that allow such unique regions, and (iii) are computed fast. None of the existing programs meet all these criteria. RESULTS: We introduce a new approach with our software program BOND (Basic OligoNucleotide Design). According to Kane's criteria for oligo design, BOND computes highly specific DNA oligonucleotides, for all the genes that admit unique probes, while running orders of magnitude faster than the existing programs. The same approach enables us to introduce also an evaluation procedure that correctly measures the quality of the oligonucleotides. Extensive comparison is performed to prove our claims. BOND is flexible, easy to use, requires no additional software, and is freely available for non-commercial use from http://www.csd.uwo.ca/∼ilie/BOND/. CONCLUSIONS: We provide an improved solution to the important problem of oligonucleotide design, including a thorough evaluation of oligo design programs. We hope BOND will become a useful tool for researchers in biological and medical sciences by making the microarray procedures faster and more accurate.


Asunto(s)
Oligonucleótidos/química , Programas Informáticos , Algoritmos , Genes , Análisis de Secuencia por Matrices de Oligonucleótidos/métodos , Sondas de Oligonucleótidos/química , Sondas de Oligonucleótidos/genética , Oligonucleótidos/genética
14.
Methods Mol Biol ; 2690: 375-383, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-37450160

RESUMEN

Several proteins work independently, but the majority work together to maintain the functions of the cell. Thus, it is crucial to know the interaction sites that facilitate protein-protein interactions. The development of effective computational methods is essential because experimental methods are expensive and time-consuming. This chapter is a guide to predicting protein interaction sites using the program "PITHIA." First, some installation guides are presented, followed by descriptions of input file formats. Afterward, PITHIA's commands and options are outlined with examples. Moreover, some notes are provided on how to extend PITHIA's installation and usage.


Asunto(s)
Biología Computacional , Proteínas , Sitios de Unión , Unión Proteica , Proteínas/metabolismo
15.
Bioinformatics ; 27(17): 2433-4, 2011 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-21690104

RESUMEN

SUMMARY: Multiple spaced seeds represent the current state-of-the-art for similarity search in bioinformatics, with applications in various areas such as sequence alignment, read mapping, oligonucleotide design, etc. We present SpEED, a software program that computes highly sensitive multiple spaced seeds. SpEED can be several orders of magnitude faster and computes better seeds than the existing leading software programs. AVAILABILITY: The source code of SpEED is freely available at www.csd.uwo.ca/~ilie/SpEED/ CONTACT: ilie@csd.uwo.ca SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Biología Computacional/métodos , Programas Informáticos , Algoritmos , Alineación de Secuencia
16.
Bioinformatics ; 27(3): 295-302, 2011 Feb 01.
Artículo en Inglés | MEDLINE | ID: mdl-21115437

RESUMEN

MOTIVATION: High-throughput sequencing technologies produce very large amounts of data and sequencing errors constitute one of the major problems in analyzing such data. Current algorithms for correcting these errors are not very accurate and do not automatically adapt to the given data. RESULTS: We present HiTEC, an algorithm that provides a highly accurate, robust and fully automated method to correct reads produced by high-throughput sequencing methods. Our approach provides significantly higher accuracy than previous methods. It is time and space efficient and works very well for all read lengths, genome sizes and coverage levels. AVAILABILITY: The source code of HiTEC is freely available at www.csd.uwo.ca/~ilie/HiTEC/.


Asunto(s)
Algoritmos , Análisis de Secuencia de ADN/métodos , Genoma , Modelos Genéticos , Reproducibilidad de los Resultados , Programas Informáticos
17.
Bioinformatics ; 27(7): 1011-2, 2011 Apr 01.
Artículo en Inglés | MEDLINE | ID: mdl-21278192

RESUMEN

UNLABELLED: We report on a major update (version 2) of the original SHort Read Mapping Program (SHRiMP). SHRiMP2 primarily targets mapping sensitivity, and is able to achieve high accuracy at a very reasonable speed. SHRiMP2 supports both letter space and color space (AB/SOLiD) reads, enables for direct alignment of paired reads and uses parallel computation to fully utilize multi-core architectures. AVAILABILITY: SHRiMP2 executables and source code are freely available at: http://compbio.cs.toronto.edu/shrimp/.


Asunto(s)
Mapeo Cromosómico , Genómica/métodos , Análisis de Secuencia de ADN , Programas Informáticos , Algoritmos , Polimorfismo Genético , Alineación de Secuencia
18.
Foods ; 11(22)2022 Nov 16.
Artículo en Inglés | MEDLINE | ID: mdl-36429263

RESUMEN

The preservation of food supplies has been humankind's priority since ancient times, and it is arguably more relevant today than ever before. Food sustainability and safety have been heavily prioritized by consumers, producers, and government entities alike. In this regard, filamentous fungi have always been a health hazard due to their contamination of the food substrate with mycotoxins. Additionally, mycotoxins are proven resilient to technological processing. This study aims to identify the main mycotoxins that may occur in the meat and meat products "Farm to Fork" chain, along with their effect on the consumers' health, and also to identify effective methods of prevention through the use of essential oils (EO). At the same time, the antifungal and antimycotoxigenic potential of essential oils was considered in order to provide an overview of the subject. Targeting the main ways of meat products' contamination, the use of essential oils with proven in vitro or in situ efficacy against certain fungal species can be an effective alternative if all the associated challenges are addressed (e.g., application methods, suitability for certain products, toxicity).

19.
BMC Genomics ; 12: 280, 2011 Jun 01.
Artículo en Inglés | MEDLINE | ID: mdl-21627845

RESUMEN

BACKGROUND: DNA oligonucleotides are a very useful tool in biology. The best algorithms for designing good DNA oligonucleotides are filtering out unsuitable regions using a seeding approach. Determining the quality of the seeds is crucial for the performance of these algorithms. RESULTS: We present a sound framework for evaluating the quality of seeds for oligonucleotide design. The F - score is used to measure the accuracy of each seed. A number of natural candidates are tested: contiguous (BLAST-like), spaced, transitions-constrained, and multiple spaced seeds. Multiple spaced seeds are the best, with more seeds providing better accuracy. Single spaced and transition seeds are very close whereas, as expected, contiguous seeds come last. Increased accuracy comes at the price of reduced efficiency. An exception is that single spaced and transitions-constrained seeds are both more accurate and more efficient than contiguous ones. CONCLUSIONS: Our work confirms another application where multiple spaced seeds perform the best. It will be useful in improving the algorithms for oligonucleotide design.


Asunto(s)
Técnicas de Amplificación de Ácido Nucleico/métodos , Oligonucleótidos/genética , ADN/genética
20.
Bioinformatics ; 25(6): 822-3, 2009 Mar 15.
Artículo en Inglés | MEDLINE | ID: mdl-19176560

RESUMEN

MOTIVATION: Alignment of biological sequences is one of the most frequently performed computer tasks. The current state of the art involves the use of (multiple) spaced seeds for producing high quality alignments. A particular important class is that of neighbor seeds which combine high sensitivity with reduced space requirements. Current algorithms for computing good neighbor seeds are very slow (exponential). RESULTS: We give a polynomial-time heuristic algorithm that computes better neighbor seeds than previous ones while being several orders of magnitude faster.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Alineación de Secuencia/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA