Search | VHL Regional Portal

1.

CAREx: context-aware read extension of paired-end sequencing data.

Kallenborn, Felix; Schmidt, Bertil.

BMC Bioinformatics ; 25(1): 186, 2024 May 10.

Article in English | MEDLINE | ID: mdl-38730374

ABSTRACT

BACKGROUND: Commonly used next generation sequencing machines typically produce large amounts of short reads of a few hundred base-pairs in length. However, many downstream applications would generally benefit from longer reads. RESULTS: We present CAREx-an algorithm for the generation of pseudo-long reads from paired-end short-read Illumina data based on the concept of repeatedly computing multiple-sequence-alignments to extend a read until its partner is found. Our performance evaluation on both simulated data and real data shows that CAREx is able to connect significantly more read pairs (up to 99 % for simulated data) and to produce more error-free pseudo-long reads than previous approaches. When used prior to assembly it can achieve superior de novo assembly results. Furthermore, the GPU-accelerated version of CAREx exhibits the fastest execution times among all tested tools. CONCLUSION: CAREx is a new MSA-based algorithm and software for producing pseudo-long reads from paired-end short read data. It outperforms other state-of-the-art programs in terms of (i) percentage of connected read pairs, (ii) reduction of error rates of filled gaps, (iii) runtime, and (iv) downstream analysis using de novo assembly. CAREx is open-source software written in C++ (CPU version) and in CUDA/C++ (GPU version). It is licensed under GPLv3 and can be downloaded at ( https://github.com/fkallen/CAREx ).

Subject(s)

Algorithms , High-Throughput Nucleotide Sequencing , Software , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Humans , Sequence Alignment/methods

2.

Scoring alignments by embedding vector similarity.

Ashrafzadeh, Sepehr; Golding, G Brian; Ilie, Silvana; Ilie, Lucian.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38695119

ABSTRACT

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.

Subject(s)

Algorithms , Computational Biology , Sequence Alignment , Sequence Alignment/methods , Computational Biology/methods , Software , Sequence Analysis, Protein/methods , Amino Acid Sequence , Proteins/chemistry , Proteins/genetics , Deep Learning , Databases, Protein

3.

PanDepth, an ultrafast and efficient genomic tool for coverage calculation.

Yu, Huiyang; Shi, Chunmei; He, Weiming; Li, Feng; Ouyang, Bo.

Brief Bioinform ; 25(3)2024 Mar 27.

Article in English | MEDLINE | ID: mdl-38701418

ABSTRACT

Coverage quantification is required in many sequencing datasets within the field of genomics research. However, most existing tools fail to provide comprehensive statistical results and exhibit limited performance gains from multithreading. Here, we present PanDepth, an ultra-fast and efficient tool for calculating coverage and depth from sequencing alignments. PanDepth outperforms other tools in computation time and memory efficiency for both BAM and CRAM-format alignment files from sequencing data, regardless of read length. It employs chromosome parallel computation and optimized data structures, resulting in ultrafast computation speeds and memory efficiency. It accepts sorted or unsorted BAM and CRAM-format alignment files as well as GTF, GFF and BED-formatted interval files or a specific window size. When provided with a reference genome sequence and the option to enable GC content calculation, PanDepth includes GC content statistics, enhancing the accuracy and reliability of copy number variation analysis. Overall, PanDepth is a powerful tool that accelerates scientific discovery in genomics research.

Subject(s)

Genomics , Software , Genomics/methods , Humans , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Base Composition , DNA Copy Number Variations , Computational Biology/methods , Algorithms , Sequence Alignment/methods

4.

Accelerating spliced alignment of long RNA sequencing reads using parallel maximal exact match retrieval.

Wang, Rongxing; Zhang, Yanju.

Comput Biol Med ; 175: 108542, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38714048

ABSTRACT

The genomics landscape has undergone a revolutionary transformation with the emergence of third-generation sequencing technologies. Fueled by the exponential surge in sequencing data, there is an urgent demand for accurate and rapid algorithms to effectively handle this burgeoning influx. Under such circumstances, we developed a parallelized, yet accuracy-lossless algorithm for maximal exact match (MEM) retrieval to strategically address the computational bottleneck of uLTRA, a leading spliced alignment algorithm known for its precision in handling long RNA sequencing (RNA-seq) reads. The design of the algorithm incorporates a multi-threaded strategy, enabling the concurrent processing of multiple reads simultaneously. Additionally, we implemented the serialization of index required for MEM retrieval to facilitate its reuse, resulting in accelerated startup for practical tasks. Extensive experiments demonstrate that our parallel algorithm achieves significant improvements in runtime, speedup, throughput, and memory usage. When applied to the largest human dataset, the algorithm achieves an impressive speedup of 10.78 × , significantly improving throughput on a large scale. Moreover, the integration of the parallel MEM retrieval algorithm into the uLTRA pipeline introduces a dual-layered parallel capability, consistently yielding a speedup of 4.99 × compared to the multi-process and single-threaded execution of uLTRA. The thorough analysis of experimental results underscores the adept utilization of parallel processing capabilities and its advantageous performance in handling large datasets. This study provides a showcase of parallelized strategies for MEM retrieval within the context of spliced alignment algorithm, effectively facilitating the process of RNA-seq data analysis. The code is available at https://github.com/RongxingWong/AcceleratingSplicedAlignment.

Subject(s)

Algorithms , Sequence Analysis, RNA , Humans , Sequence Analysis, RNA/methods , RNA Splicing , High-Throughput Nucleotide Sequencing/methods , Sequence Alignment/methods , Software

5.

Effect of tokenization on transformers for biological sequences.

Dotan, Edo; Jaschek, Gal; Pupko, Tal; Belinkov, Yonatan.

Bioinformatics ; 40(4)2024 Mar 29.

Article in English | MEDLINE | ID: mdl-38608190

ABSTRACT

MOTIVATION: Deep-learning models are transforming biological research, including many bioinformatics and comparative genomics algorithms, such as sequence alignments, phylogenetic tree inference, and automatic classification of protein functions. Among these deep-learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different from natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. RESULTS: We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a 3-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data. AVAILABILITY AND IMPLEMENTATION: Code, data, and trained tokenizers are available on https://github.com/technion-cs-nlp/BiologicalTokenizers.

Subject(s)

Algorithms , Computational Biology , Deep Learning , Natural Language Processing , Computational Biology/methods , Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein/methods

6.

VirusPredictor: XGBoost-based software to predict virus-related sequences in human data.

Liu, Guangchen; Chen, Xun; Luan, Yihui; Li, Dawei.

Bioinformatics ; 40(4)2024 Mar 29.

Article in English | MEDLINE | ID: mdl-38597887

ABSTRACT

MOTIVATION: Discovering disease causative pathogens, particularly viruses without reference genomes, poses a technical challenge as they are often unidentifiable through sequence alignment. Machine learning prediction of patient high-throughput sequences unmappable to human and pathogen genomes may reveal sequences originating from uncharacterized viruses. Currently, there is a lack of software specifically designed for accurately predicting such viral sequences in human data. RESULTS: We developed a fast XGBoost method and software VirusPredictor leveraging an in-house viral genome database. Our two-step XGBoost models first classify each query sequence into one of three groups: infectious virus, endogenous retrovirus (ERV) or non-ERV human. The prediction accuracies increased as the sequences became longer, i.e. 0.76, 0.93, and 0.98 for 150-350 (Illumina short reads), 850-950 (Sanger sequencing data), and 2000-5000 bp sequences, respectively. Then, sequences predicted to be from infectious viruses are further classified into one of six virus taxonomic subgroups, and the accuracies increased from 0.92 to >0.98 when query sequences increased from 150-350 to >850 bp. The results suggest that Illumina short reads should be de novo assembled into contigs (e.g. â¼1000 bp or longer) before prediction whenever possible. We applied VirusPredictor to multiple real genomic and metagenomic datasets and obtained high accuracies. VirusPredictor, a user-friendly open-source Python software, is useful for predicting the origins of patients' unmappable sequences. This study is the first to classify ERVs in infectious viral sequence prediction. This is also the first study combining virus sub-group predictions. AVAILABILITY AND IMPLEMENTATION: www.dllab.org/software/VirusPredictor.html.

Subject(s)

Genome, Viral , Software , Humans , Viruses/genetics , Sequence Analysis, DNA/methods , Sequence Alignment/methods , Machine Learning

7.

Improvements in viral gene annotation using large language models and soft alignments.

Harrigan, William L; Ferrell, Barbra D; Wommack, K Eric; Polson, Shawn W; Schreiber, Zachary D; Belcaid, Mahdi.

BMC Bioinformatics ; 25(1): 165, 2024 Apr 25.

Article in English | MEDLINE | ID: mdl-38664627

ABSTRACT

BACKGROUND: The annotation of protein sequences in public databases has long posed a challenge in molecular biology. This issue is particularly acute for viral proteins, which demonstrate limited homology to known proteins when using alignment, k-mer, or profile-based homology search approaches. A novel methodology employing Large Language Models (LLMs) addresses this methodological challenge by annotating protein sequences based on embeddings. RESULTS: Central to our contribution is the soft alignment algorithm, drawing from traditional protein alignment but leveraging embedding similarity at the amino acid level to bypass the need for conventional scoring matrices. This method not only surpasses pooled embedding-based models in efficiency but also in interpretability, enabling users to easily trace homologous amino acids and delve deeper into the alignments. Far from being a black box, our approach provides transparent, BLAST-like alignment visualizations, combining traditional biological research with AI advancements to elevate protein annotation through embedding-based analysis while ensuring interpretability. Tests using the Virus Orthologous Groups and ViralZone protein databases indicated that the novel soft alignment approach recognized and annotated sequences that both blastp and pooling-based methods, which are commonly used for sequence annotation, failed to detect. CONCLUSION: The embeddings approach shows the great potential of LLMs for enhancing protein sequence annotation, especially in viral genomics. These findings present a promising avenue for more efficient and accurate protein function inference in molecular biology.

Subject(s)

Algorithms , Molecular Sequence Annotation , Sequence Alignment , Molecular Sequence Annotation/methods , Sequence Alignment/methods , Viral Proteins/genetics , Viral Proteins/chemistry , Genes, Viral , Databases, Protein , Computational Biology/methods , Amino Acid Sequence

8.

Characteristic Attribute Organization System (CAOS): Identifying Classification Rules Based on Phylogenetically Organized Sequences.

Ramanan, Vivek; Sarkar, Indra Neil.

Methods Mol Biol ; 2744: 335-345, 2024.

Article in English | MEDLINE | ID: mdl-38683329

ABSTRACT

Classification is a technique that labels subjects based on the characteristics of the data. It often includes using prior learned information from preexisting data drawn from the same distribution or data type to make informed decisions per each given subject. The method presented here, the Characteristic Attribute Organization System (CAOS), uses a character-based approach to molecular sequence classification. Using a set of aligned sequences (either nucleotide or amino acid) and a maximum parsimony tree, CAOS will generate classification rules for the sequences based on tree structure and provide more interpretable results than other classification or sequence analysis protocols. The code is accessible at https://github.com/JuliaHealth/CAOS.jl/ .

Subject(s)

Phylogeny , Software , Computational Biology/methods , Algorithms , Sequence Alignment/methods

9.

A phylogenetic method linking nucleotide substitution rates to rates of continuous trait evolution.

Gemmell, Patrick; Sackton, Timothy B; Edwards, Scott V; Liu, Jun S.

PLoS Comput Biol ; 20(4): e1011995, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38656999

ABSTRACT

Genomes contain conserved non-coding sequences that perform important biological functions, such as gene regulation. We present a phylogenetic method, PhyloAcc-C, that associates nucleotide substitution rates with changes in a continuous trait of interest. The method takes as input a multiple sequence alignment of conserved elements, continuous trait data observed in extant species, and a background phylogeny and substitution process. Gibbs sampling is used to assign rate categories (background, conserved, accelerated) to lineages and explore whether the assigned rate categories are associated with increases or decreases in the rate of trait evolution. We test our method using simulations and then illustrate its application using mammalian body size and lifespan data previously analyzed with respect to protein coding genes. Like other studies, we find processes such as tumor suppression, telomere maintenance, and p53 regulation to be related to changes in longevity and body size. In addition, we also find that skeletal genes, and developmental processes, such as sprouting angiogenesis, are relevant.

Subject(s)

Evolution, Molecular , Models, Genetic , Phylogeny , Animals , Longevity/genetics , Humans , Computational Biology/methods , Computer Simulation , Body Size/genetics , Nucleotides/genetics , Sequence Alignment/methods

10.

A hepatitis B virus (HBV) sequence variation graph improves alignment and sample-specific consensus sequence construction.

Duchen, Dylan; Clipman, Steven J; Vergara, Candelaria; Thio, Chloe L; Thomas, David L; Duggal, Priya; Wojcik, Genevieve L.

PLoS One ; 19(4): e0301069, 2024.

Article in English | MEDLINE | ID: mdl-38669259

ABSTRACT

Nearly 300 million individuals live with chronic hepatitis B virus (HBV) infection (CHB), for which no curative therapy is available. As viral diversity is associated with pathogenesis and immunological control of infection, improved methods to characterize this diversity could aid drug development efforts. Conventionally, viral sequencing data are mapped/aligned to a reference genome, and only the aligned sequences are retained for analysis. Thus, reference selection is critical, yet selecting the most representative reference a priori remains difficult. We investigate an alternative pangenome approach which can combine multiple reference sequences into a graph which can be used during alignment. Using simulated short-read sequencing data generated from publicly available HBV genomes and real sequencing data from an individual living with CHB, we demonstrate alignment to a phylogenetically representative 'genome graph' can improve alignment, avoid issues of reference ambiguity, and facilitate the construction of sample-specific consensus sequences more genetically similar to the individual's infection. Graph-based methods can, therefore, improve efforts to characterize the genetics of viral pathogens, including HBV, and have broader implications in host-pathogen research.

Subject(s)

Consensus Sequence , Genome, Viral , Hepatitis B virus , Hepatitis B virus/genetics , Humans , Consensus Sequence/genetics , Phylogeny , Sequence Alignment/methods , Genetic Variation , Hepatitis B, Chronic/virology , DNA, Viral/genetics , Sequence Analysis, DNA/methods

11.

Choice of Metric Divergence in Genome Sequence Comparison.

Ghosh, Soumen; Pal, Jayanta; Maji, Bansibadan; Cattani, Carlo; Bhattacharya, Dilip Kumar.

Protein J ; 43(2): 259-273, 2024 Apr.

Article in English | MEDLINE | ID: mdl-38492188

ABSTRACT

The paper introduces a novel probability descriptor for genome sequence comparison, employing a generalized form of Jensen-Shannon divergence. This divergence metric stems from a one-parameter family, comprising fractions up to a maximum value of half. Utilizing this metric as a distance measure, a distance matrix is computed for the new probability descriptor, shaping Phylogenetic trees via the neighbor-joining method. Initial exploration involves setting the parameter at half for various species. Assessing the impact of parameter variation, trees drawn at different parameter values (half, one-fourth, one-eighth). However, measurement scales decrease with parameter value increments, with higher similarity accuracy corresponding to lower scale values. Ultimately, the highest accuracy aligns with the maximum parameter value of half. Comparative analyses against previous methods, evaluating via Symmetric Distance (SD) values and rationalized perception, consistently favor the present approach's results. Notably, outcomes at the maximum parameter value exhibit the most accuracy, validating the method's efficacy against earlier approaches.

Subject(s)

Phylogeny , Genome , Algorithms , Sequence Alignment/methods , Genomics/methods

12.

Gauging Dynamics-driven Allostery Using a New Computational Tool: A CAP Case Study.

Kornev, Alexandr P; Weng, Jui-Hung; Maillard, Rodrigo A; Taylor, Susan S.

J Mol Biol ; 436(2): 168395, 2024 01 15.

Article in English | MEDLINE | ID: mdl-38097109

ABSTRACT

In this study, we utilize Protein Residue Networks (PRNs), constructed using Local Spatial Pattern (LSP) alignment, to explore the dynamic behavior of Catabolite Activator Protein (CAP) upon the sequential binding of cAMP. We employed the Degree Centrality of these PRNs to investigate protein dynamics on a sub-nanosecond time scale, hypothesizing that it would reflect changes in CAP's entropy related to its thermal motions. We show that the binding of the first cAMP led to an increase in stability in the Cyclic-Nucleotide Binding Domain A (CNBD-A) and destabilization in CNBD-B, agreeing with previous reports explaining the negative cooperativity of cAMP binding in terms of an entropy-driven allostery. LSP-based PRNs also allow for the study of Betweenness Centrality, another graph-theoretical characteristic of PRNs, providing insights into global residue connectivity within CAP. Using this approach, we were able to correctly identify amino acids that were shown to be critical in mediating allosteric interactions in CAP. The agreement between our studies and previous experimental reports validates our method, particularly with respect to the reliability of Degree Centrality as a proxy for entropy related to protein thermal dynamics. Because LSP-based PRNs can be easily extended to include dynamics of small organic molecules, polynucleotides, or other allosteric proteins, the methods presented here mark a significant advancement in the field, positioning them as vital tools for a fast, cost-effective, and accurate analysis of entropy-driven allostery and identification of allosteric hotspots.

Subject(s)

Allosteric Regulation , Cyclic AMP Receptor Protein , Sequence Alignment , Cyclic AMP Receptor Protein/chemistry , Entropy , Molecular Dynamics Simulation , Protein Binding , Reproducibility of Results , Sequence Alignment/methods

13.

Accurate proteome-wide missense variant effect prediction with AlphaMissense.

Cheng, Jun; Novati, Guido; Pan, Joshua; Bycroft, Clare; Zemgulyte, Akvile; Applebaum, Taylor; Pritzel, Alexander; Wong, Lai Hong; Zielinski, Michal; Sargeant, Tobias; Schneider, Rosalia G; Senior, Andrew W; Jumper, John; Hassabis, Demis; Kohli, Pushmeet; Avsec, Ziga.

Science ; 381(6664): eadg7492, 2023 09 22.

Article in English | MEDLINE | ID: mdl-37733863

ABSTRACT

The vast majority of missense variants observed in the human genome are of unknown clinical significance. We present AlphaMissense, an adaptation of AlphaFold fine-tuned on human and primate variant population frequency databases to predict missense variant pathogenicity. By combining structural context and evolutionary conservation, our model achieves state-of-the-art results across a wide range of genetic and experimental benchmarks, all without explicitly training on such data. The average pathogenicity score of genes is also predictive for their cell essentiality, capable of identifying short essential genes that existing statistical approaches are underpowered to detect. As a resource to the community, we provide a database of predictions for all possible human single amino acid substitutions and classify 89% of missense variants as either likely benign or likely pathogenic.

Subject(s)

Amino Acid Substitution , Disease , Mutation, Missense , Proteome , Sequence Alignment , Humans , Amino Acid Substitution/genetics , Benchmarking , Conserved Sequence , Databases, Genetic , Disease/genetics , Genome, Human , Protein Conformation , Proteome/genetics , Sequence Alignment/methods , Machine Learning

14.

CovET: A covariation-evolutionary trace method that identifies protein structure-function modules.

Konecki, Daniel M; Hamrick, Spencer; Wang, Chen; Agosto, Melina A; Wensel, Theodore G; Lichtarge, Olivier.

J Biol Chem ; 299(7): 104896, 2023 07.

Article in English | MEDLINE | ID: mdl-37290531

ABSTRACT

Measuring the relative effect that any two sequence positions have on each other may improve protein design or help better interpret coding variants. Current approaches use statistics and machine learning but rarely consider phylogenetic divergences which, as shown by Evolutionary Trace studies, provide insight into the functional impact of sequence perturbations. Here, we reframe covariation analyses in the Evolutionary Trace framework to measure the relative tolerance to perturbation of each residue pair during evolution. This approach (CovET) systematically accounts for phylogenetic divergences: at each divergence event, we penalize covariation patterns that belie evolutionary coupling. We find that while CovET approximates the performance of existing methods to predict individual structural contacts, it performs significantly better at finding structural clusters of coupled residues and ligand binding sites. For example, CovET found more functionally critical residues when we examined the RNA recognition motif and WW domains. It correlates better with large-scale epistasis screen data. In the dopamine D2 receptor, top CovET residue pairs recovered accurately the allosteric activation pathway characterized for Class A G protein-coupled receptors. These data suggest that CovET ranks highest the sequence position pairs that play critical functional roles through epistatic and allosteric interactions in evolutionarily relevant structure-function motifs. CovET complements current methods and may shed light on fundamental molecular mechanisms of protein structure and function.

Subject(s)

Evolution, Molecular , Sequence Alignment , Binding Sites/genetics , Phylogeny , Receptors, G-Protein-Coupled/genetics , Sequence Alignment/methods

15.

FAS: assessing the similarity between proteins using multi-layered feature architectures.

Dosch, Julian; Bergmann, Holger; Tran, Vinh; Ebersberger, Ingo.

Bioinformatics ; 39(5)2023 05 04.

Article in English | MEDLINE | ID: mdl-37084276

ABSTRACT

MOTIVATION: Protein sequence comparison is a fundamental element in the bioinformatics toolkit. When sequences are annotated with features such as functional domains, transmembrane domains, low complexity regions or secondary structure elements, the resulting feature architectures allow better informed comparisons. However, many existing schemes for scoring architecture similarities cannot cope with features arising from multiple annotation sources. Those that do fall short in the resolution of overlapping and redundant feature annotations. RESULTS: Here, we introduce FAS, a scoring method that integrates features from multiple annotation sources in a directed acyclic architecture graph. Redundancies are resolved as part of the architecture comparison by finding the paths through the graphs that maximize the pair-wise architecture similarity. In a large-scale evaluation on more than 10 000 human-yeast ortholog pairs, architecture similarities assessed with FAS are consistently more plausible than those obtained using e-values to resolve overlaps or leaving overlaps unresolved. Three case studies demonstrate the utility of FAS on architecture comparison tasks: benchmarking of orthology assignment software, identification of functionally diverged orthologs, and diagnosing protein architecture changes stemming from faulty gene predictions. With the help of FAS, feature architecture comparisons can now be routinely integrated into these and many other applications. AVAILABILITY AND IMPLEMENTATION: FAS is available as python package: https://pypi.org/project/greedyFAS/.

Subject(s)

Amino Acid Sequence , Proteins , Sequence Alignment , Software , Humans , Computational Biology/methods , Proteins/chemistry , Sequence Alignment/methods , Saccharomyces cerevisiae Proteins/chemistry

16.

CircularSTAR3D: a stack-based RNA 3D structural alignment tool for circular matching.

Chen, Xiaoli; Zhang, Shaojie.

Nucleic Acids Res ; 51(9): e53, 2023 05 22.

Article in English | MEDLINE | ID: mdl-36987885

ABSTRACT

The functions of non-coding RNAs usually depend on their 3D structures. Therefore, comparing RNA 3D structures is critical in analyzing their functions. We noticed an interesting phenomenon that two non-coding RNAs may share similar substructures when rotating their sequence order. To the best of our knowledge, no existing RNA 3D structural alignment tools can detect this type of matching. In this article, we defined the RNA 3D structure circular matching problem and developed a software tool named CircularSTAR3D to solve this problem. CircularSTAR3D first uses the conserved stacks (consecutive base pairs with similar 3D structures) in the input RNAs to identify the circular matched internal loops and multiloops. Then it performs a local extension iteratively to obtain the whole circular matched substructures. The computational experiments conducted on a non-redundant RNA structure dataset show that circular matching is ubiquitous. Furthermore, we demonstrated the utility of CircularSTAR3D by detecting the conserved substructures missed by regular alignment tools, including structural motifs and conserved structures between riboswitches and ribozymes from different classes. We anticipate CircularSTAR3D to be a valuable supplement to the existing RNA 3D structural analysis techniques.

Subject(s)

Nucleic Acid Conformation , RNA , Sequence Alignment , Sequence Analysis, RNA , Software , Algorithms , Base Pairing , RNA/genetics , RNA/chemistry , Sequence Alignment/methods , Sequence Analysis, RNA/methods

17.

Predicting Protein-Protein Interactions via Random Ferns with Evolutionary Matrix Representation.

Li, Yang; Wang, Zheng; You, Zhu-Hong; Li, Li-Ping; Hu, Xuegang.

Comput Math Methods Med ; 2022: 7191684, 2022.

Article in English | MEDLINE | ID: mdl-35242211

ABSTRACT

Protein-protein interactions (PPIs) play a crucial role in understanding disease pathogenesis, genetic mechanisms, guiding drug design, and other biochemical processes, thus, the identification of PPIs is of great importance. With the rapid development of high-throughput sequencing technology, a large amount of PPIs sequence data has been accumulated. Researchers have designed many experimental methods to detect PPIs by using these sequence data, hence, the prediction of PPIs has become a research hotspot in proteomics. However, since traditional experimental methods are both time-consuming and costly, it is difficult to analyze and predict the massive amount of PPI data quickly and accurately. To address these issues, many computational systems employing machine learning knowledge were widely applied to PPIs prediction, thereby improving the overall recognition rate. In this paper, a novel and efficient computational technology is presented to implement a protein interaction prediction system using only protein sequence information. First, the Position-Specific Iterated Basic Local Alignment Search Tool (PSI-BLAST) was employed to generate a position-specific scoring matrix (PSSM) containing protein evolutionary information from the initial protein sequence. Second, we used a novel data processing feature representation scheme, MatFLDA, to extract the essential information of PSSM for protein sequences and obtained five training and five testing datasets by adopting a five-fold cross-validation method. Finally, the random fern (RFs) classifier was employed to infer the interactions among proteins, and a model called MatFLDA_RFs was developed. The proposed MatFLDA_RFs model achieved good prediction performance with 95.03% average accuracy on Yeast dataset and 85.35% average accuracy on H. pylori dataset, which effectively outperformed other existing computational methods. The experimental results indicate that the proposed method is capable of yielding better prediction results of PPIs, which provides an effective tool for the detection of new PPIs and the in-depth study of proteomics. Finally, we also developed a web server for the proposed model to predict protein-protein interactions, which is freely accessible online at http://120.77.11.78:5001/webserver/MatFLDA_RFs.

Subject(s)

Protein Interaction Mapping/methods , Protein Interaction Maps/genetics , Amino Acid Sequence , Bacterial Proteins/genetics , Computational Biology , Databases, Protein/statistics & numerical data , Discriminant Analysis , Evolution, Molecular , Helicobacter pylori/genetics , High-Throughput Nucleotide Sequencing/statistics & numerical data , Humans , Machine Learning , Position-Specific Scoring Matrices , Protein Interaction Mapping/statistics & numerical data , Saccharomyces cerevisiae/genetics , Saccharomyces cerevisiae Proteins/genetics , Sequence Alignment/methods , Sequence Alignment/statistics & numerical data , Support Vector Machine

18.

Similarities between bacterial GAD and human GAD65: Implications in gut mediated autoimmune type 1 diabetes.

Bedi, Suhana; Richardson, Tiffany M; Jia, Baofeng; Saab, Hadeel; Brinkman, Fiona S L; Westley, Monica.

PLoS One ; 17(2): e0261103, 2022.

Article in English | MEDLINE | ID: mdl-35196314

ABSTRACT

A variety of islet autoantibodies (AAbs) can predict and possibly dictate eventual type 1 diabetes (T1D) diagnosis. Upwards of 75% of those with T1D are positive for AAbs against glutamic acid decarboxylase (GAD65 or GAD), a producer of gamma-aminobutyric acid (GABA) in human pancreatic beta cells. Interestingly, bacterial populations within the human gut also express GAD and produce GABA. Evidence suggests that dysbiosis of the microbiome may correlate with T1D pathogenesis and physiology. Therefore, autoimmune linkages between the gut microbiome and islets susceptible to autoimmune attack need to be further elucidated. Utilizing in silico analyses, we show that 25 GAD sequences from human gut bacterial sources show sequence and motif similarities to human beta cell GAD65. Our motif analyses determined that most gut GAD sequences contain the pyroxical dependent decarboxylase (PDD) domain of human GAD65, which is important for its enzymatic activity. Additionally, we showed overlap with known human GAD65 T cell receptor epitopes, which may implicate the immune destruction of beta cells. Thus, we propose a physiological hypothesis in which changes in the gut microbiome in those with T1D result in a release of bacterial GAD, thus causing miseducation of the host immune system. Due to the notable similarities we found between human and bacterial GAD, these deputized immune cells may then target human beta cells leading to the development of T1D.

Subject(s)

Autoantibodies/immunology , Bacteria/enzymology , Diabetes Mellitus, Type 1/immunology , Diabetes Mellitus, Type 1/microbiology , Gastrointestinal Microbiome/immunology , Glutamate Decarboxylase/genetics , Glutamate Decarboxylase/immunology , Animals , Antigen-Presenting Cells/immunology , Computer Simulation , Diabetes Mellitus, Type 1/enzymology , Epitopes, T-Lymphocyte/immunology , Genes, Bacterial , Humans , Islets of Langerhans/enzymology , Islets of Langerhans/immunology , Mice , Pan troglodytes/microbiology , Phylogeny , Protein Domains , Sequence Alignment/methods , gamma-Aminobutyric Acid/metabolism

19.

An Improved Strategy for Task Scheduling in the Parallel Computational Alignment of Multiple Sequences.

Ishaq, Muhammad; Khan, Asfandyar; Su'ud, Mazliham Mohd; Alam, Muhammad Mansoor; Bangash, Javed Iqbal; Khan, Abdullah.

Comput Math Methods Med ; 2022: 8691646, 2022.

Article in English | MEDLINE | ID: mdl-35126641

ABSTRACT

Task scheduling in parallel multiple sequence alignment (MSA) through improved dynamic programming optimization speeds up alignment processing. The increased importance of multiple matching sequences also needs the utilization of parallel processor systems. This dynamic algorithm proposes improved task scheduling in case of parallel MSA. Specifically, the alignment of several tertiary structured proteins is computationally complex than simple word-based MSA. Parallel task processing is computationally more efficient for protein-structured based superposition. The basic condition for the application of dynamic programming is also fulfilled, because the task scheduling problem has multiple possible solutions or options. Search space reduction for speedy processing of this algorithm is carried out through greedy strategy. Performance in terms of better results is ensured through computationally expensive recursive and iterative greedy approaches. Any optimal scheduling schemes show better performance in heterogeneous resources using CPU or GPU.

Subject(s)

Algorithms , Computational Biology/methods , Sequence Alignment/methods , Computational Biology/statistics & numerical data , Humans , Sequence Alignment/statistics & numerical data , Software

20.

Slipknot or Crystallographic Error: A Computational Analysis of the Plasmodium falciparum DHFR Structural Folds.

Tata, Rolland B; Alsulami, Ali F; Sheik Amamuddy, Olivier; Blundell, Tom L; Tastan Bishop, Özlem.

Int J Mol Sci ; 23(3)2022 Jan 28.

Article in English | MEDLINE | ID: mdl-35163439

ABSTRACT

The presence of protein structures with atypical folds in the Protein Data Bank (PDB) is rare and may result from naturally occurring knots or crystallographic errors. Proper characterisation of such folds is imperative to understanding the basis of naturally existing knots and correcting crystallographic errors. If left uncorrected, such errors can frustrate downstream experiments that depend on the structures containing them. An atypical fold has been identified in P. falciparum dihydrofolate reductase (PfDHFR) between residues 20-51 (loop 1) and residues 191-205 (loop 2). This enzyme is key to drug discovery efforts in the parasite, necessitating a thorough characterisation of these folds. Using multiple sequence alignments (MSA), a unique insert was identified in loop 1 that exacerbates the appearance of the atypical fold-giving it a slipknot-like topology. However, PfDHFR has not been deposited in the knotted proteins database, and processing its structure failed to identify any knots within its folds. The application of protein homology modelling and molecular dynamics simulations on the DHFR domain of P. falciparum and those of two other organisms (E. coli and M. tuberculosis) that were used as molecular replacement templates in solving the PfDHFR structure revealed plausible unentangled or open conformations of these loops. These results will serve as guides for crystallographic experiments to provide further insights into the atypical folds identified.

Subject(s)

Plasmodium falciparum/enzymology , Sequence Alignment/methods , Tetrahydrofolate Dehydrogenase/chemistry , Tetrahydrofolate Dehydrogenase/genetics , Crystallography, X-Ray , Databases, Protein , Models, Molecular , Molecular Dynamics Simulation , Plasmodium falciparum/genetics , Protein Conformation , Protein Domains , Protein Folding , Protozoan Proteins/chemistry , Protozoan Proteins/genetics , Sequence Analysis, Protein , Sequence Homology, Amino Acid

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL