RESUMO
This paper discusses the properties of proteins and their relations in the interactomes of the selected subsets of SARS-CoV-2 proteome-the membrane protein, nonstructural proteins, and, finally, full proteome. Protein disorder according to several measures, liquid-liquid phase separation probabilities, and protein node degrees in the interaction networks were singled out as the features of interest. Additionally, viral interactomes were combined with the interactome of human lung tissue so as to examine if the new connections in the resulting viral-host interactome are linked to protein disorder. Correlation analysis shows that there is no clear relationship between raw features of interest, whereas there is a positive correlation between the protein disorder and its neighborhood mean disorder. There are also indications that highly connected viral hubs tend to be on average more ordered than proteins with a small number of connections. This is in contrast to previous similar studies conducted on eukaryotic interactomes and possibly raises new questions in research on viral interactomes.
RESUMO
BACKGROUND: In the last decade and a half it has been firmly established that a large number of proteins do not adopt a well-defined (ordered) structure under physiological conditions. Such intrinsically disordered proteins (IDPs) and intrinsically disordered (protein) regions (IDRs) are involved in essential cell processes through two basic mechanisms: the entropic chain mechanism which is responsible for rapid fluctuations among many alternative conformations, and molecular recognition via short recognition elements that bind to other molecules. IDPs possess a high adaptive potential and there is special interest in investigating their involvement in organism evolution. RESULTS: We analyzed 2554 Bacterial and 139 Archaeal proteomes, with a total of 8,455,194 proteins for disorder content and its implications for adaptation of organisms, using three disorder predictors and three measures. Along with other findings, we revealed that for all three predictors and all three measures (1) Bacteria exhibit significantly more disorder than Archaea; (2) plasmid-encoded proteins contain considerably more IDRs than proteins encoded on chromosomes (or whole genomes) in both prokaryote superkingdoms; (3) plasmid proteins are significantly more disordered than chromosomal proteins only in the group of proteins with no COG category assigned; (4) antitoxin proteins in comparison to other proteins, are the most disordered (almost double) in both Bacterial and Archaeal proteomes; (5) plasmidal proteins are more disordered than chromosomal proteins in Bacterial antitoxins and toxin-unclassified proteins, but have almost the same disorder content in toxin proteins. CONCLUSION: Our results suggest that while disorder content depends on genome and proteome characteristics, it is more influenced by functional engagements than by gene location (on chromosome or plasmid).
Assuntos
Archaea/genética , Proteínas Arqueais/química , Bactérias/genética , Proteínas de Bactérias/química , Proteínas Intrinsicamente Desordenadas/química , Plasmídeos/metabolismo , Cromossomos de Archaea/metabolismo , Cromossomos Bacterianos/metabolismo , Proteoma/metabolismo , Toxinas Biológicas/químicaRESUMO
BACKGROUND: A significant number of proteins have been shown to be intrinsically disordered, meaning that they lack a fixed 3 D structure or contain regions that do not posses a well defined 3 D structure. It has also been proven that a protein's disorder content is related to its function. We have performed an exhaustive analysis and comparison of the disorder content of proteins from prokaryotic organisms (i.e., superkingdoms Archaea and Bacteria) with respect to functional categories they belong to, i.e., Clusters of Orthologous Groups of proteins (COGs) and groups of COGs-Cellular processes (Cp), Information storage and processing (Isp), Metabolism (Me) and Poorly characterized (Pc). We also analyzed the disorder content of proteins with respect to various genomic, metabolic and ecological characteristics of the organism they belong to. We used correlations and association rule mining in order to identify the most confident associations between specific modalities of the characteristics considered and disorder content. RESULTS: Bacteria are shown to have a somewhat higher level of protein disorder than archaea, except for proteins in the Me functional group. It is demonstrated that the Isp and Cp functional groups in particular (L-repair function and N-cell motility and secretion COGs of proteins in specific) possess the highest disorder content, while Me proteins, in general, posses the lowest. Disorder fractions have been confirmed to have the lowest level for the so-called order-promoting amino acids and the highest level for the so-called disorder promoters. For each pair of organism characteristics, specific modalities are identified with the maximum disorder proteins in the corresponding organisms, e.g., high genome size-high GC content organisms, facultative anaerobic-low GC content organisms, aerobic-high genome size organisms, etc. Maximum disorder in archaea is observed for high GC content-low genome size organisms, high GC content-facultative anaerobic or aquatic or mesophilic organisms, etc. Maximum disorder in bacteria is observed for high GC content-high genome size organisms, high genome size-aerobic organisms, etc. Some of the most reliable association rules mined establish relationships between high GC content and high protein disorder, medium GC content and both medium and low protein disorder, anaerobic organisms and medium protein disorder, Gammaproteobacteria and low protein disorder, etc. A web site Prokaryote Disorder Database has been designed and implemented at the address http://bioinfo.matf.bg.ac.rs/disorder, which contains complete results of the analysis of protein disorder performed for 296 prokaryotic completely sequenced genomes. CONCLUSIONS: Exhaustive disorder analysis has been performed by functional classes of proteins, for a larger dataset of prokaryotic organisms than previously done. Results obtained are well correlated to those previously published, with some extension in the range of disorder level and clear distinction between functional classes of proteins. Wide correlation and association analysis between protein disorder and genomic and ecological characteristics has been performed for the first time. The results obtained give insight into multi-relationships among the characteristics and protein disorder. Such analysis provides for better understanding of the evolutionary process and may be useful for taxon determination. The main drawback of the approach is the fact that the disorder considered has been predicted and not experimentally established.
Assuntos
Proteínas Arqueais/análise , Proteínas de Bactérias/análise , Biologia Computacional/métodos , Aminoácidos/análise , Archaea/genética , Archaea/metabolismo , Proteínas Arqueais/química , Bactérias/genética , Bactérias/metabolismo , Proteínas de Bactérias/química , Composição de Bases , Análise por Conglomerados , Bases de Dados de Proteínas , Genômica/métodos , Internet , Conformação Proteica , Proteoma/análiseRESUMO
Using the data from Protein Data Bank the correlations of primary and secondary structures of proteins were analyzed. The correlation values of the amino acids and the eight secondary structure types were calculated, where the position of the amino acid and the position in sequence with the particular secondary structure differ at most 25. The diagrams describing these results indicate that correlations are significant at distances between -9 and 10. The results show that the substituents on Cbeta or Cgamma atoms of amino acid play major role in their preference for particular secondary structure at the same position in the sequence, while the polarity of amino acid has significant influence on alpha-helices and strands at some distance in the sequence. The diagrams corresponding to polar amino acids are noticeably asymmetric. The diagrams point out the exchangeability of residues in the proteins; the amino acids with similar diagrams have similar local folding requirements.
Assuntos
Aminoácidos/química , Bases de Dados de Proteínas , Modelos Químicos , Estrutura Secundária de Proteína , Proteínas/química , Algoritmos , Simulação por Computador , Interpretação Estatística de DadosRESUMO
There are two approaches to identifying genomic and pathogenesis islands (GI/PAIs) in bacterial genomes: the compositional and the functional, based on DNA or protein level composition and gene function, respectively. We applied n-gram analysis in addition to other compositional features, combined them by union and intersection and defined two measures for evaluating the results-recall and precision. Using the best criteria (by training on the Escherichia coli O157:H7 EDL933 genome), we predicted GIs for 14 Enterobacteriaceae family members and for 21 randomly selected bacterial genomes. These predictions were compared with results obtained from HGT DB (based on the compositional approach) and PAI DB (based on the combined approach). The results obtained show that intersecting n-grams with other compositional features improves relative precision by up to 10% in case of HGT DB and up to 60% in case of PAI DB. In addition, it was demonstrated that the union of all compositional features results in maximum recall (up to 37%). Thus, the application of n-gram analysis alongside existing or newly developed methods may improve the prediction of GI/PAIs.
Assuntos
Genoma Bacteriano , Escherichia coli O157/genéticaRESUMO
DNA repeats have great importance for biological research and a large number of tools for determining repeats have been developed. Herein we define a method for extracting a statistically significant subset of a determined set of repeats. Our aim was to identify a subset of repeats in the input sequences that are not expected to occur with a number of their appearances in a random sequence of the same length. It is expected that results obtained in such manner would reduce the quantity of processed material and could thereby represent a more important biological signal. With DNA, RNA, and protein sequences serving as input material, we also examined the possibility of statistical filtering of repeats in sequences over an arbitrary alphabet. A new method for selecting statistically significant repeats from a set of determined repeats has been defined. The proposed method was tested on a large number of randomly generated sequences. The application of the method on biological sequences revealed that for some viruses, shorter repeats are more statistically significant than longer ones because of their frequent appearance, whereas for bacteria, the majority of identified repeats are statistically significant.
Assuntos
Algoritmos , Biologia Computacional/métodos , Sequências Repetitivas de Aminoácidos , Sequências Repetitivas de Ácido Nucleico , DNA/química , Humanos , Proteínas/química , RNA/químicaRESUMO
A dataset of 103 SARS-CoV isolates (101 human patients and 2 palm civets) was investigated on different aspects of genome polymorphism and isolate classification. The number and the distribution of single nucleotide variations (SNVs) and insertions and deletions, with respect to a "profile", were determined and discussed ("profile" being a sequence containing the most represented letter per position). Distribution of substitution categories per codon positions, as well as synonymous and non-synonymous substitutions in coding regions of annotated isolates, was determined, along with amino acid (a.a.) property changes. Similar analysis was performed for the spike (S) protein in all the isolates (55 of them being predicted for the first time). The ratio Ka/Ks confirmed that the S gene was subjected to the Darwinian selection during virus transmission from animals to humans. Isolates from the dataset were classified according to genome polymorphism and genotypes. Genome polymorphism yields to two groups, one with a small number of SNVs and another with a large number of SNVs, with up to four subgroups with respect to insertions and deletions. We identified three basic nine-locus genotypes: TTTT/TTCGG, CGCC/TTCAT, and TGCC/TTCGT, with four subgenotypes. Both classifications proposed are in accordance with the new insights into possible epidemiological spread, both in space and time.
Assuntos
Biologia Computacional , Variação Genética , Genoma , Polimorfismo Genético/genética , Síndrome Respiratória Aguda Grave/genética , Coronavírus Relacionado à Síndrome Respiratória Aguda Grave/genética , Viverridae/genética , Sequência de Aminoácidos , Animais , Humanos , Dados de Sequência Molecular , Mutação , Filogenia , Deleção de Sequência , Homologia de Sequência de Aminoácidos , TaiwanRESUMO
BACKGROUND: We have compared 38 isolates of the SARS-CoV complete genome. The main goal was twofold: first, to analyze and compare nucleotide sequences and to identify positions of single nucleotide polymorphism (SNP), insertions and deletions, and second, to group them according to sequence similarity, eventually pointing to phylogeny of SARS-CoV isolates. The comparison is based on genome polymorphism such as insertions or deletions and the number and positions of SNPs. RESULTS: The nucleotide structure of all 38 isolates is presented. Based on insertions and deletions and dissimilarity due to SNPs, the dataset of all the isolates has been qualitatively classified into three groups each having their own subgroups. These are the A-group with "regular" isolates (no insertions / deletions except for 5' and 3' ends), the B-group of isolates with "long insertions", and the C-group of isolates with "many individual" insertions and deletions. The isolate with the smallest average number of SNPs, compared to other isolates, has been identified (TWH). The density distribution of SNPs, insertions and deletions for each group or subgroup, as well as cumulatively for all the isolates is also presented, along with the gene map for TWH. Since individual SNPs may have occurred at random, positions corresponding to multiple SNPs (occurring in two or more isolates) are identified and presented. This result revises some previous results of a similar type. Amino acid changes caused by multiple SNPs are also identified (for the annotated sequences, as well as presupposed amino acid changes for non-annotated ones). Exact SNP positions for the isolates in each group or subgroup are presented. Finally, a phylogenetic tree for the SARS-CoV isolates has been produced using the CLUSTALW program, showing high compatibility with former qualitative classification. CONCLUSIONS: The comparative study of SARS-CoV isolates provides essential information for genome polymorphism, indication of strain differences and variants evolution. It may help with the development of effective treatment.
Assuntos
Biologia Computacional/métodos , Genoma Viral , Polimorfismo Genético/genética , Coronavírus Relacionado à Síndrome Respiratória Aguda Grave/genética , Sequência de Aminoácidos/genética , DNA Viral/genética , Mutagênese Insercional/genética , Filogenia , Polimorfismo de Nucleotídeo Único/genética , Deleção de Sequência/genética , Proteínas Virais/químicaRESUMO
To associate phenotypic characteristics of an organism to molecules encoded by its genome, there is a need for well-structured genotype and phenotype data. We use a novel method for extracting data on phenotype and genotype characteristics of microorganisms from text. As a resource, we use an encyclopedia of microorganisms, which holds phenotypic and genotypic data and create a structured, flexible data resource, which can be exported to a range of database formats, containing genotype and phenotype data for 2412 species and 873 genera of microbes. This data source has great potential as a resource for future biological research on genotype-phenotype associations. In this paper, we focus on describing the structure and content of the resulting database and on evaluating the method used for extracting the data. We conclude that the resulting database can be used as a reliable complementary resource for research into genotype-phenotype association.
Assuntos
Mineração de Dados/métodos , Bases de Dados Genéticas , Enciclopédias como Assunto , Estudos de Associação Genética , Bactérias/genéticaRESUMO
The paper presents a novel, n-gram-based method for analysis of bacterial genome segments known as genomic islands (GIs). Identification of GIs in bacterial genomes is an important task since many of them represent inserts that may contribute to bacterial evolution and pathogenesis. In order to characterize and distinguish GIs from rest of the genome, binary classification of islands based on n-gram frequency distribution have been performed. It consists of testing the agreement of islands n-gram frequency distributions with the complete genome and backbone sequence. In addition, a statistic based on the maximal order Markov model is used to identify significantly overrepresented and underrepresented n-grams in islands. The results may be used as a basis for Zipf-like analysis suggesting that some of the n-grams are overrepresented in a subset of islands and underrepresented in the backbone, or vice versa, thus complementing the binary classification. The method is applied to strain-specific regions in the Escherichia coli O157:H7 EDL933 genome (O-islands), resulting in two groups of O-islands with different n-gram characteristics. It refines a characterization based on other compositional features such as G+C content and codon usage, and may help in identification of GIs, and also in research and development of adequate drugs targeting virulence genes in them.
Assuntos
Biologia Computacional/métodos , Genoma Bacteriano , Ilhas Genômicas , Modelos Estatísticos , Composição de Bases/genética , Sequência de Bases/genética , Códon/análise , Escherichia coli O157/genética , Transferência Genética Horizontal , Genoma Bacteriano/genética , Genômica/métodos , Cadeias de Markov , Dados de Sequência MolecularRESUMO
The correlation between the primary and secondary structures of proteins was analysed using a large data set from the Protein Data Bank. Clear preferences of amino acids towards certain secondary structures classify amino acids into four groups: alpha-helix preferrers, strand preferrers, turn and bend preferrers, and His and Cys (the latter two amino acids show no clear preference for any secondary structure). Amino acids in the same group have similar structural characteristics at their Cbeta and Cgamma atoms that predicts their preference for a particular secondary structure. All alpha-helix preferrers have neither polar heteroatoms on Cbeta and Cgamma atoms, nor branching or aromatic group on the Cbeta atom. All strand preferrers have aromatic groups or branching groups on the Cbeta atom. All turn and bend preferrers have a polar heteroatom on the Cbeta or Cgamma atoms or do not have a Cbeta atom at all. These new rules could be helpful in making predictions about non-natural amino acids.