Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 640
Filtrar
1.
Toxicol Sci ; 2024 Sep 10.
Artículo en Inglés | MEDLINE | ID: mdl-39254655

RESUMEN

Peptides have emerged as promising therapeutic agents. However, their potential is hindered by hemotoxicity. Understanding the hemotoxicity of peptides is crucial for developing safe and effective peptide-based therapeutics. Here, we employed chemical space complex networks (CSNs) to unravel the hemotoxicity tapestry of peptides. CSNs are powerful tools for visualizing and analyzing the relationships between peptides based on their physicochemical properties and structural features. We constructed CSNs from the StarPepDB database, encompassing 2004 hemolytic peptides, and explored the impact of seven different (dis)similarity measures on network topology and cluster (communities) distribution. Our findings revealed that each CSN extracts orthogonal information, enhancing the motif discovery and enrichment process. We identified 12 consensus hemolytic motifs, whose amino acid composition unveiled a high abundance of lysine, leucine, and valine residues, while aspartic acid, methionine, histidine, asparagine and glutamine were depleted. Additionally, physicochemical properties were used to characterize clusters/communities of hemolytic peptides. To predict hemolytic activity directly from peptide sequences, we constructed multi-query similarity searching models (MQSSMs), which outperformed cutting-edge machine learning (ML)-based models, demonstrating robust hemotoxicity prediction capabilities. Overall, this novel in silico approach uses complex network science as its central strategy to develop robust model classifiers, to characterize the chemical space and to discover new motifs from hemolytic peptides. This will help to enhance the design/selection of peptides with potential therapeutic activity and low toxicity.

2.
Front Bioinform ; 4: 1358374, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39221004

RESUMEN

Sequence alignments are often used to analyze genomic data. However, such alignments are often only calculated and compared on small sequence intervals for analysis purposes. When comparing longer sequences, these are usually divided into shorter sequence intervals for better alignment results. This usually means that the order context of the original sequence is lost. To prevent this, it is possible to use a graph structure to represent the order of the original sequence on the alignment blocks. The visualization of these graph structures can provide insights into the structural variations of genomes in a semi-global context. In this paper, we propose a new graph drawing framework for representing gMSA data. We produce a hierarchical graph layout that supports the comparative analysis of genomes. Based on a reference, the differences and similarities of the different genome orders are visualized. In this work, we present a complete graph drawing framework for gMSA graphs together with the respective algorithms for each of the steps. Additionally, we provide a prototype and an example data set for analyzing gMSA graphs. Based on this data set, we demonstrate the functionalities of the framework using two examples.

3.
Proc Natl Acad Sci U S A ; 121(35): e2410662121, 2024 Aug 27.
Artículo en Inglés | MEDLINE | ID: mdl-39163334

RESUMEN

Proteins perform their biological functions through motion. Although high throughput prediction of the three-dimensional static structures of proteins has proved feasible using deep-learning-based methods, predicting the conformational motions remains a challenge. Purely data-driven machine learning methods encounter difficulty for addressing such motions because available laboratory data on conformational motions are still limited. In this work, we develop a method for generating protein allosteric motions by integrating physical energy landscape information into deep-learning-based methods. We show that local energetic frustration, which represents a quantification of the local features of the energy landscape governing protein allosteric dynamics, can be utilized to empower AlphaFold2 (AF2) to predict protein conformational motions. Starting from ground state static structures, this integrative method generates alternative structures as well as pathways of protein conformational motions, using a progressive enhancement of the energetic frustration features in the input multiple sequence alignment sequences. For a model protein adenylate kinase, we show that the generated conformational motions are consistent with available experimental and molecular dynamics simulation data. Applying the method to another two proteins KaiB and ribose-binding protein, which involve large-amplitude conformational changes, can also successfully generate the alternative conformations. We also show how to extract overall features of the AF2 energy landscape topography, which has been considered by many to be black box. Incorporating physical knowledge into deep-learning-based structure prediction algorithms provides a useful strategy to address the challenges of dynamic structure prediction of allosteric proteins.


Asunto(s)
Simulación de Dinámica Molecular , Conformación Proteica , Proteínas/química , Adenilato Quinasa/química , Adenilato Quinasa/metabolismo , Regulación Alostérica , Aprendizaje Profundo
4.
Int J Mol Sci ; 25(16)2024 Aug 07.
Artículo en Inglés | MEDLINE | ID: mdl-39201310

RESUMEN

Triticum aestivum is an important crop whose reference genome (International Wheat Genome Sequencing Consortium (IWGSC) RefSeq v2.1) offers a valuable resource for understanding wheat genetic structure, improving agronomic traits, and developing new cultivars. A key aspect of gene model annotation is protein-level evidence of gene expression obtained from proteomics studies, followed up by proteogenomics to physically map proteins to the genome. In this research, we have retrieved the largest recent wheat proteomics datasets publicly available and applied the Basic Local Alignment Search Tool (tBLASTn) algorithm to map the 861,759 identified unique peptides against IWGSC RefSeq v2.1. Of the 92,719 hits, 83,015 unique peptides aligned along 33,612 High Confidence (HC) genes, thus validating 31.4% of all wheat HC gene models. Furthermore, 6685 unique peptides were mapped against 3702 Low Confidence (LC) gene models, and we argue that these gene models should be considered for HC status. The remaining 2934 orphan peptides can be used for novel gene discovery, as exemplified here on chromosome 4D. We demonstrated that tBLASTn could not map peptides exhibiting mid-sequence frame shift. We supply all our proteogenomics results, Galaxy workflow and Python code, as well as Browser Extensible Data (BED) files as a resource for the wheat community via the Apollo Jbrowse, and GitHub repositories. Our workflow could be applied to other proteomics datasets to expand this resource with proteins and peptides from biotically and abiotically stressed samples. This would help tease out wheat gene expression under various environmental conditions, both spatially and temporally.


Asunto(s)
Genoma de Planta , Anotación de Secuencia Molecular , Proteínas de Plantas , Proteogenómica , Triticum , Triticum/genética , Triticum/metabolismo , Proteogenómica/métodos , Proteínas de Plantas/genética , Proteínas de Plantas/metabolismo , Algoritmos
5.
Genome Biol ; 25(1): 230, 2024 Aug 26.
Artículo en Inglés | MEDLINE | ID: mdl-39187866

RESUMEN

Seqrutinator is an objective, flexible pipeline that removes sequences with sequencing and/or gene model errors and sequences from pseudogenes from complex, eukaryotic protein superfamilies. Testing Seqrutinator on major superfamilies BAHD, CYP, and UGT removes only 1.94% of SwissProt entries, 14% of entries from the model plant Arabidopsis thaliana, but 80% of entries from Pinus taeda's recent complete proteome. Application of Seqrutinator on crude BAHDomes, CYPomes, and UGTomes obtained from 16 plant proteomes shows convergence of the numbers of paralogues. MSAs, phylogenies, and particularly functional clustering improve drastically upon Seqrutinator application, indicating good performance.


Asunto(s)
Proteínas de Plantas , Proteínas de Plantas/genética , Proteínas de Plantas/metabolismo , Filogenia , Programas Informáticos , Arabidopsis/genética , Arabidopsis/metabolismo , Proteoma , Familia de Multigenes , Análisis de Secuencia de Proteína , Bases de Datos de Proteínas
6.
Mol Biol Evol ; 41(7)2024 Jul 03.
Artículo en Inglés | MEDLINE | ID: mdl-39041199

RESUMEN

The current trend in phylogenetic and evolutionary analyses predominantly relies on omic data. However, prior to core analyses, traditional methods typically involve intricate and time-consuming procedures, including assembly from high-throughput reads, decontamination, gene prediction, homology search, orthology assignment, multiple sequence alignment, and matrix trimming. Such processes significantly impede the efficiency of research when dealing with extensive data sets. In this study, we develop PhyloAln, a convenient reference-based tool capable of directly aligning high-throughput reads or complete sequences with existing alignments as a reference for phylogenetic and evolutionary analyses. Through testing with simulated data sets of species spanning the tree of life, PhyloAln demonstrates consistently robust performance compared with other reference-based tools across different data types, sequencing technologies, coverages, and species, with percent completeness and identity at least 50 percentage points higher in the alignments. Additionally, we validate the efficacy of PhyloAln in removing a minimum of 90% foreign and 70% cross-contamination issues, which are prevalent in sequencing data but often overlooked by other tools. Moreover, we showcase the broad applicability of PhyloAln by generating alignments (completeness mostly larger than 80%, identity larger than 90%) and reconstructing robust phylogenies using real data sets of transcriptomes of ladybird beetles, plastid genes of peppers, or ultraconserved elements of turtles. With these advantages, PhyloAln is expected to facilitate phylogenetic and evolutionary analyses in the omic era. The tool is accessible at https://github.com/huangyh45/PhyloAln.


Asunto(s)
Filogenia , Alineación de Secuencia , Programas Informáticos , Alineación de Secuencia/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Animales , Evolución Molecular
7.
BMC Bioinformatics ; 25(1): 247, 2024 Jul 29.
Artículo en Inglés | MEDLINE | ID: mdl-39075359

RESUMEN

BACKGROUND: Sequence alignment lies at the heart of genome sequence annotation. While the BLAST suite of alignment tools has long held an important role in alignment-based sequence database search, greater sensitivity is achieved through the use of profile hidden Markov models (pHMMs). Here, we describe an FPGA hardware accelerator, called HAVAC, that targets a key bottleneck step (SSV) in the analysis pipeline of the popular pHMM alignment tool, HMMER. RESULTS: The HAVAC kernel calculates the SSV matrix at 1739 GCUPS on a ∼  $3000 Xilinx Alveo U50 FPGA accelerator card, ∼  227× faster than the optimized SSV implementation in nhmmer. Accounting for PCI-e data transfer data processing, HAVAC is 65× faster than nhmmer's SSV with one thread and 35× faster than nhmmer with four threads, and uses ∼  31% the energy of a traditional high end Intel CPU. CONCLUSIONS: HAVAC demonstrates the potential offered by FPGA hardware accelerators to produce dramatic speed gains in sequence annotation and related bioinformatics applications. Because these computations are performed on a co-processor, the host CPU remains free to simultaneously compute other aspects of the analysis pipeline.


Asunto(s)
Cadenas de Markov , Alineación de Secuencia , Alineación de Secuencia/métodos , Biología Computacional/métodos , Homología de Secuencia , Algoritmos , Programas Informáticos
8.
Biomolecules ; 14(7)2024 Jun 26.
Artículo en Inglés | MEDLINE | ID: mdl-39062473

RESUMEN

Glutathione transferase (GST) is a superfamily of ubiquitous enzymes, multigenic in numerous organisms and which generally present homodimeric structures. GSTs are involved in numerous biological functions such as chemical detoxification as well as chemoperception in mammals and insects. GSTs catalyze the conjugation of their cofactor, reduced glutathione (GSH), to xenobiotic electrophilic centers. To achieve this catalytic function, GSTs are comprised of a ligand binding site and a GSH binding site per subunit, which is very specific and highly conserved; the hydrophobic substrate binding site enables the binding of diverse substrates. In this work, we focus our interest in a model organism, the fruit fly Drosophila melanogaster (D. mel), which comprises 42 GST sequences distributed in six classes and composing its GSTome. The goal of this study is to describe the complete structural GSTome of D. mel to determine how changes in the amino acid sequence modify the structural characteristics of GST, particularly in the GSH binding sites and in the dimerization interface. First, we predicted the 3D atomic structures of each GST using the AlphaFold (AF) program and compared them with X-ray crystallography structures, when they exist. We also characterized and compared their global and local folds. Second, we used multiple sequence alignment coupled with AF-predicted structures to characterize the relationship between the conservation of amino acids in the sequence and their structural features. Finally, we applied normal mode analysis to estimate thermal B-factors of all GST structures of D. mel. Particularly, we extracted flexibility profiles of GST and identify key residues and motifs that are systematically involved in the ligand binding/dimerization processes and thus playing a crucial role in the catalytic function. This methodology will be extended to guide the in silico design of synthetic GST with new/optimal catalytic properties for detoxification applications.


Asunto(s)
Drosophila melanogaster , Glutatión Transferasa , Animales , Drosophila melanogaster/enzimología , Glutatión Transferasa/química , Glutatión Transferasa/metabolismo , Glutatión Transferasa/genética , Sitios de Unión , Secuencia de Aminoácidos , Proteínas de Drosophila/química , Proteínas de Drosophila/metabolismo , Proteínas de Drosophila/genética , Modelos Moleculares , Cristalografía por Rayos X , Glutatión/metabolismo , Glutatión/química , Multimerización de Proteína
9.
Comput Biol Med ; 179: 108815, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-38986287

RESUMEN

Predicting protein structure is both fascinating and formidable, playing a crucial role in structure-based drug discovery and unraveling diseases with elusive origins. The Critical Assessment of Protein Structure Prediction (CASP) serves as a biannual battleground where global scientists converge to untangle the intricate relationships within amino acid chains. Two primary methods, Template-Based Modeling (TBM) and Template-Free (TF) strategies, dominate protein structure prediction. The trend has shifted towards Template-Free predictions due to their broader sequence coverage with fewer templates. The predictive process can be broadly classified into contact map, binned-distance, and real-valued distance predictions, each with distinctive strengths and limitations manifested through tailored loss functions. We have also introduced revolutionary end-to-end, and all-atom diffusion-based techniques that have transformed protein structure predictions. Recent advancements in deep learning techniques have significantly improved prediction accuracy, although the effectiveness is contingent upon the quality of input features derived from natural bio-physiochemical attributes and Multiple Sequence Alignments (MSA). Hence, the generation of high-quality MSA data holds paramount importance in harnessing informative input features for enhanced prediction outcomes. Remarkable successes have been achieved in protein structure prediction accuracy, however not enough for what structural knowledge was intended to, which implies need for development in some other aspects of the predictions. In this regard, scientists have opened other frontiers for protein structural prediction. The utilization of subsampling in multiple sequence alignment (MSA) and protein language modeling appears to be particularly promising in enhancing the accuracy and efficiency of predictions, ultimately aiding in drug discovery efforts. The exploration of predicting protein complex structure also opens up exciting opportunities to deepen our knowledge of molecular interactions and design therapeutics that are more effective. In this article, we have discussed the vicissitudes that the scientists have gone through to improve prediction accuracy, and examined the effective policies in predicting from different aspects, including the construction of high quality MSA, providing informative input features, and progresses in deep learning approaches. We have also briefly touched upon transitioning from predicting single-chain protein structures to predicting protein complex structures. Our findings point towards promoting open research environments to support the objectives of protein structure prediction.


Asunto(s)
Conformación Proteica , Proteínas , Proteínas/química , Modelos Moleculares , Biología Computacional/métodos , Humanos , Análisis de Secuencia de Proteína/métodos , Aprendizaje Profundo , Bases de Datos de Proteínas
10.
J Comput Biol ; 31(7): 616-637, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38990757

RESUMEN

Modern genomic datasets, like those generated under the 1000 Genome Project, contain millions of variants belonging to known haplotypes. Although these datasets are more representative than a single reference sequence and can alleviate issues like reference bias, they are significantly more computationally burdensome to work with, often involving large-indexed genome graph data structures for tasks such as read mapping. The construction, preprocessing, and mapping algorithms can require substantial computational resources depending on the size of these variant sets. Moreover, the accuracy of mapping algorithms has been shown to decrease when working with complete variant sets. Therefore, a drastically reduced set of variants that preserves important properties of the original set is desirable. This work provides a technique for finding a minimal subset of variants S such that for given parameters α and δ, all substrings up to length α in the haplotypes are guaranteed to be still alignable to the appropriate locations with either Hamming or edit distance at most δ, using only S. Our contributions include showing the NP-hardness and inapproximability of these optimization problems and providing Integer Linear Programming (ILP) formulations. Our edit distance ILP formulation carefully decomposes the problem according to variant locations, which allows it to scale to support all of chromosome 22's variants from the 1000 Genome Project. Our experiments also demonstrate a significant reduction in the number of variants. For example, for moderately long reads, e.g., α = 1000, over 75% of the variants can be removed while preserving read mappability with edit distance at most one.


Asunto(s)
Algoritmos , Haplotipos , Humanos , Biología Computacional/métodos , Genómica/métodos , Genoma Humano , Programas Informáticos , Variación Genética , Análisis de Secuencia de ADN/métodos
11.
Molecules ; 29(13)2024 Jun 23.
Artículo en Inglés | MEDLINE | ID: mdl-38998944

RESUMEN

Actin, which plays a crucial role in cellular structure and function, interacts with various binding proteins, notably myosin. In mammals, actin is composed of six isoforms that exhibit high levels of sequence conservation and structural similarity overall. As a result, the selection of actin isoforms was considered unimportant in structural studies of their binding with myosin. However, recent high-resolution structural research discovered subtle structural differences in the N-terminus of actin isoforms, suggesting the possibility that each actin isoform may engage in specific interactions with myosin isoforms. In this study, we aimed to explore this possibility, particularly by understanding the influence of different actin isoforms on the interaction with myosin 7A. First, we compared the reported actomyosin structures utilizing the same type of actin isoforms as the high-resolution filamentous skeletal α-actin (3.5 Å) structure elucidated using cryo-EM. Through this comparison, we confirmed that the diversity of myosin isoforms leads to differences in interaction with the actin N-terminus, and that loop 2 of the myosin actin-binding sites directly interacts with the actin N-terminus. Subsequently, with the aid of multiple sequence alignment, we observed significant variations in the length of loop 2 across different myosin isoforms. We predicted that these length differences in loop 2 would likely result in structural variations that would affect the interaction with the actin N-terminus. For myosin 7A, loop 2 was found to be very short, and protein complex predictions using skeletal α-actin confirmed an interaction between loop 2 and the actin N-terminus. The prediction indicated that the positively charged residues present in loop 2 electrostatically interact with the acidic patch residues D24 and D25 of actin subdomain 1, whereas interaction with the actin N-terminus beyond this was not observed. Additionally, analyses of the actomyosin-7A prediction models generated using various actin isoforms consistently yielded the same results regardless of the type of actin isoform employed. The results of this study suggest that the subtle structural differences in the N-terminus of actin isoforms are unlikely to influence the binding structure with short loop 2 myosin 7A. Our findings are expected to provide a deeper understanding for future high-resolution structural binding studies of actin and myosin.


Asunto(s)
Actinas , Miosinas , Unión Proteica , Isoformas de Proteínas , Actinas/química , Actinas/metabolismo , Isoformas de Proteínas/química , Isoformas de Proteínas/metabolismo , Miosinas/química , Miosinas/metabolismo , Sitios de Unión , Animales , Modelos Moleculares , Secuencia de Aminoácidos , Microscopía por Crioelectrón , Humanos
12.
Biomedicines ; 12(6)2024 May 28.
Artículo en Inglés | MEDLINE | ID: mdl-38927403

RESUMEN

The enzyme 4-hydroxyphenylpyruvate dioxygenase (4-HPPD) is involved in the catabolism of the amino acid tyrosine in organisms such as bacteria, plants, and animals. It catalyzes the conversion of 4-hydroxyphenylpyruvate to a homogenisate in the presence of molecular oxygen and Fe(II) as a cofactor. This enzyme represents a key step in the biosynthesis of important compounds, and its activity deficiency leads to severe, rare autosomal recessive disorders, like tyrosinemia type III and hawkinsinuria, for which no cure is currently available. The 4-HPPD C-terminal tail plays a crucial role in the enzyme catalysis/gating mechanism, ensuring the integrity of the active site for catalysis through fine regulation of the C-terminal tail conformation. However, despite growing interest in the 4-HPPD catalytic mechanism and structure, the gating mechanism remains unclear. Furthermore, the absence of the whole 3D structure makes the bioinformatic approach the only possible study to define the enzyme structure/molecular mechanism. Here, wild-type 4-HPPD and its mutants were deeply dissected by applying a comprehensive bioinformatics/evolution study, and we showed for the first time the entire molecular mechanism and regulation of the enzyme gating process, proposing the full-length 3D structure of human 4-HPPD and two novel key residues involved in the 4-HPPD C-terminal tail conformational change.

13.
Int J Mol Sci ; 25(11)2024 Jun 06.
Artículo en Inglés | MEDLINE | ID: mdl-38892439

RESUMEN

Enzymes play a crucial role in various industrial production and pharmaceutical developments, serving as catalysts for numerous biochemical reactions. Determining the optimal catalytic temperature (Topt) of enzymes is crucial for optimizing reaction conditions, enhancing catalytic efficiency, and accelerating the industrial processes. However, due to the limited availability of experimentally determined Topt data and the insufficient accuracy of existing computational methods in predicting Topt, there is an urgent need for a computational approach to predict the Topt values of enzymes accurately. In this study, using phosphatase (EC 3.1.3.X) as an example, we constructed a machine learning model utilizing amino acid frequency and protein molecular weight information as features and employing the K-nearest neighbors regression algorithm to predict the Topt of enzymes. Usually, when conducting engineering for enzyme thermostability, researchers tend not to modify conserved amino acids. Therefore, we utilized this machine learning model to predict the Topt of phosphatase sequences after removing conserved amino acids. We found that the predictive model's mean coefficient of determination (R2) value increased from 0.599 to 0.755 compared to the model based on the complete sequences. Subsequently, experimental validation on 10 phosphatase enzymes with undetermined optimal catalytic temperatures shows that the predicted values of most phosphatase enzymes based on the sequence without conservative amino acids are closer to the experimental optimal catalytic temperature values. This study lays the foundation for the rapid selection of enzymes suitable for industrial conditions.


Asunto(s)
Aminoácidos , Aprendizaje Automático , Temperatura , Aminoácidos/química , Aminoácidos/metabolismo , Monoéster Fosfórico Hidrolasas/metabolismo , Monoéster Fosfórico Hidrolasas/química , Catálisis , Estabilidad de Enzimas , Algoritmos , Secuencia Conservada , Secuencia de Aminoácidos
14.
Mol Biol Evol ; 41(7)2024 Jul 03.
Artículo en Inglés | MEDLINE | ID: mdl-38842253

RESUMEN

Despite having important biological implications, insertion, and deletion (indel) events are often disregarded or mishandled during phylogenetic inference. In multiple sequence alignment, indels are represented as gaps and are estimated without considering the distinct evolutionary history of insertions and deletions. Consequently, indels are usually excluded from subsequent inference steps, such as ancestral sequence reconstruction and phylogenetic tree search. Here, we introduce indel-aware parsimony (indelMaP), a novel way to treat gaps under the parsimony criterion by considering insertions and deletions as separate evolutionary events and accounting for long indels. By identifying the precise location of an evolutionary event on the tree, we can separate overlapping indel events and use affine gap penalties for long indel modeling. Our indel-aware approach harnesses the phylogenetic signal from indels, including them into all inference stages. Validation and comparison to state-of-the-art inference tools on simulated data show that indelMaP is most suitable for densely sampled datasets with closely to moderately related sequences, where it can reach alignment quality comparable to probabilistic methods and accurately infer ancestral sequences, including indel patterns. Due to its remarkable speed, our method is well suited for epidemiological datasets, eliminating the need for downsampling and enabling the exploitation of the additional information provided by dense taxonomic sampling. Moreover, indelMaP offers new insights into the indel patterns of biologically significant sequences and advances our understanding of genetic variability by considering gaps as crucial evolutionary signals rather than mere artefacts.


Asunto(s)
Mutación INDEL , Filogenia , Alineación de Secuencia , Alineación de Secuencia/métodos , Evolución Molecular , Modelos Genéticos , Humanos
15.
Methods Mol Biol ; 2822: 263-290, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38907924

RESUMEN

RNA-Seq data analysis stands as a vital part of genomics research, turning vast and complex datasets into meaningful biological insights. It is a field marked by rapid evolution and ongoing innovation, necessitating a thorough understanding for anyone seeking to unlock the potential of RNA-Seq data. In this chapter, we describe the intricate landscape of RNA-seq data analysis, elucidating a comprehensive pipeline that navigates through the entirety of this complex process. Beginning with quality control, the chapter underscores the paramount importance of ensuring the integrity of RNA-seq data, as it lays the groundwork for subsequent analyses. Preprocessing is then addressed, where the raw sequence data undergoes necessary modifications and enhancements, setting the stage for the alignment phase. This phase involves mapping the processed sequences to a reference genome, a step pivotal for decoding the origins and functions of these sequences.Venturing into the heart of RNA-seq analysis, the chapter then explores differential expression analysis-the process of identifying genes that exhibit varying expression levels across different conditions or sample groups. Recognizing the biological context of these differentially expressed genes is pivotal; hence, the chapter transitions into functional analysis. Here, methods and tools like Gene Ontology and pathway analyses help contextualize the roles and interactions of the identified genes within broader biological frameworks. However, the chapter does not stop at conventional analysis methods. Embracing the evolving paradigms of data science, it delves into machine learning applications for RNA-seq data, introducing advanced techniques in dimension reduction and both unsupervised and supervised learning. These approaches allow for patterns and relationships to be discerned in the data that might be imperceptible through traditional methods.


Asunto(s)
Biología Computacional , RNA-Seq , Programas Informáticos , RNA-Seq/métodos , Humanos , Biología Computacional/métodos , Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Genómica/métodos , Análisis de Datos , Ontología de Genes , Secuenciación de Nucleótidos de Alto Rendimiento/métodos
16.
Methods Mol Biol ; 2802: 395-425, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38819566

RESUMEN

The field of viral genomic studies has experienced an unprecedented increase in data volume. New strains of known viruses are constantly being added to the GenBank database and so are completely new species with little or no resemblance to our databases of sequences. In addition to this, metagenomic techniques have the potential to further increase the number and rate of sequenced genomes. Besides, it is important to consider that viruses have a set of unique features that often break down molecular biology dogmas, e.g., the flux of information from RNA to DNA in retroviruses and the use of RNA molecules as genomes. As a result, extracting meaningful information from viral genomes remains a challenge and standard methods for comparing the unknown and our databases of characterized sequences may need adaptations. Thus, several bioinformatic approaches and tools have been created to address the challenge of analyzing viral data. This chapter offers descriptions and protocols of some of the most important bioinformatic techniques for comparative analysis of viruses. The authors also provide comments and discussion on how viruses' unique features can affect standard analyses and how to overcome some of the major sources of problems. Protocols and topics emphasize online tools (which are more accessible to users) and give the real experience of what most bioinformaticians do in day-by-day work with command-line pipelines. The topics discussed include (1) clustering related genomes, (2) whole genome multiple sequence alignments for small RNA viruses, (3) protein alignment for marker genes and species affiliation, (4) variant calling and annotation, and (5) virome analyses and pathogen identification.


Asunto(s)
Biología Computacional , Genoma Viral , Virus , Biología Computacional/métodos , Virus/genética , Virus/clasificación , Programas Informáticos , Bases de Datos Genéticas
17.
Methods Mol Biol ; 2726: 235-254, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38780734

RESUMEN

Generating accurate alignments of non-coding RNA sequences is indispensable in the quest for understanding RNA function. Nevertheless, aligning RNAs remains a challenging computational task. In the twilight-zone of RNA sequences with low sequence similarity, sequence homologies and compatible, favorable (a priori unknown) structures can be inferred only in dependency of each other. Thus, simultaneous alignment and folding (SA&F) remains the gold-standard of comparative RNA analysis, even if this method is computationally highly demanding. This text introduces to the recent release 2.0 of the software package LocARNA, focusing on its practical application. The package enables versatile, fast and accurate analysis of multiple RNAs. For this purpose, it implements SA&F algorithms in a specific, lightweight flavor that makes them routinely applicable in large scale. Its high performance is achieved by combining ensemble-based sparsification of the structure space and banding strategies. Probabilistic banding strongly improves the performance of LocARNA 2.0 even over previous releases, while simplifying its effective use. Enabling flexible application to various use cases, LocARNA provides tools to globally and locally compare, cluster, and multiply aligned RNAs based on optimization and probabilistic variants of SA&F, which optionally integrate prior knowledge, expressible by anchor and structure constraints.


Asunto(s)
Algoritmos , Biología Computacional , Pliegue del ARN , ARN , Programas Informáticos , ARN/genética , ARN/química , Biología Computacional/métodos , Conformación de Ácido Nucleico , Alineación de Secuencia/métodos , Análisis de Secuencia de ARN/métodos
18.
Methods Mol Biol ; 2726: 255-284, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38780735

RESUMEN

Effective homology search for non-coding RNAs is frequently not possible via sequence similarity alone. Current methods leverage evolutionary information like structure conservation or covariance scores to identify homologs in organisms that are phylogenetically more distant. In this chapter, we introduce the theoretical background of evolutionary structure conservation and covariance score, and we show hands-on how current methods in the field are applied on example datasets.


Asunto(s)
Biología Computacional , Evolución Molecular , Biología Computacional/métodos , Filogenia , Algoritmos , ARN no Traducido/genética , Secuencia Conservada , Humanos , Animales , Programas Informáticos , Alineación de Secuencia/métodos
19.
Brief Bioinform ; 25(3)2024 Mar 27.
Artículo en Inglés | MEDLINE | ID: mdl-38695119

RESUMEN

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Asunto(s)
Algoritmos , Biología Computacional , Alineación de Secuencia , Alineación de Secuencia/métodos , Biología Computacional/métodos , Programas Informáticos , Análisis de Secuencia de Proteína/métodos , Secuencia de Aminoácidos , Proteínas/química , Proteínas/genética , Aprendizaje Profundo , Bases de Datos de Proteínas
20.
Mol Ecol Resour ; 24(5): e13962, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38646687

RESUMEN

Preparation of DNA polymorphism datasets for analysis is an important step in evolutionary genetic and molecular ecology studies. Ever-growing dataset sizes make this step time consuming, but few convenient software tools are available to facilitate processing of large-scale datasets including thousands of sequence alignments. Here I report "processor of sequences v4" (proSeq4)-a user-friendly multiplatform software for preparation and evolutionary genetic analyses of genome- or transcriptome-scale sequence polymorphism datasets. The program has an easy-to-use graphic user interface and is designed to process and analyse many thousands of datasets. It supports over two dozen file formats, includes a flexible sequence editor and various tools for data visualization, quality control and most commonly used evolutionary genetic analyses, such as NJ-phylogeny reconstruction, DNA polymorphism analyses and coalescent simulations. Command line tools (e.g. vcf2fasta) are also provided for easier integration into bioinformatic pipelines. Apart of molecular ecology and evolution research, proSeq4 may be useful for teaching, e.g. for visual illustration of different shapes of phylogenies generated with coalescent simulations in different scenarios. ProSeq4 source code and binaries for Windows, MacOS and Ubuntu are available from https://sourceforge.net/projects/proseq/.


Asunto(s)
Biología Computacional , Polimorfismo Genético , Programas Informáticos , Biología Computacional/métodos , Análisis de Secuencia de ADN/métodos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...