Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Int J Mol Sci ; 25(8)2024 Apr 18.
Artigo em Inglês | MEDLINE | ID: mdl-38674025

RESUMO

In this study, we applied the iterative procedure (IP) method to search for families of highly diverged dispersed repeats in the genome of Cyanidioschyzon merolae, which contains over 16 million bases. The algorithm included the construction of position weight matrices (PWMs) for repeat families and the identification of more dispersed repeats based on the PWMs using dynamic programming. The results showed that the C. merolae genome contained 20 repeat families comprising a total of 33,938 dispersed repeats, which is significantly more than has been previously found using other methods. The repeats varied in length from 108 to 600 bp (522.54 bp in average) and occupied more than 72% of the C. merolae genome, whereas previously identified repeats, including tandem repeats, have been shown to constitute only about 28%. The high genomic content of dispersed repeats and their location in the coding regions suggest a significant role in the regulation of the functional activity of the genome.


Assuntos
Sequências Repetitivas de Ácido Nucleico , Rodófitas , Rodófitas/genética , Sequências Repetitivas de Ácido Nucleico/genética , Genoma , Algoritmos , Genômica/métodos
2.
Plants (Basel) ; 12(20)2023 Oct 14.
Artigo em Inglês | MEDLINE | ID: mdl-37896036

RESUMO

The exact identification of promoter sequences remains a serious problem in computational biology, as the promoter prediction algorithms under development continue to produce false-positive results. Therefore, to fully assess the validity of predicted sequences, it is necessary to perform a comprehensive test of their properties, such as the presence of downstream transcribed DNA regions behind them, or chromatin accessibility for transcription factor binding. In this paper, we examined the promoter sequences of chromosome 1 of the rice Oryza sativa genome from the Database of Potential Promoter Sequences predicted using a mathematical algorithm based on the derivation and calculation of statistically significant promoter classes. In this paper TATA motifs and cis-regulatory elements were identified in the predicted promoter sequences. We also verified the presence of potential transcription start sites near the predicted promoters by analyzing CAGE-seq data. We searched for unannotated transcripts behind the predicted sequences by de novo assembling transcripts from RNA-seq data. We also examined chromatin accessibility in the region of the predicted promoters by analyzing ATAC-seq data. As a result of this work, we identified the predicted sequences that are most likely to be promoters for further experimental validation in an in vivo or in vitro system.

3.
Int J Mol Sci ; 24(16)2023 Aug 08.
Artigo em Inglês | MEDLINE | ID: mdl-37628742

RESUMO

We have developed a new method for promoter sequence classification based on a genetic algorithm and the MAHDS sequence alignment method. We have created four classes of human promoters, combining 17,310 sequences out of the 29,598 present in the EPD database. We searched the human genome for potential promoter sequences (PPSs) using dynamic programming and position weight matrices representing each of the promoter sequence classes. A total of 3,065,317 potential promoter sequences were found. Only 1,241,206 of them were located in unannotated parts of the human genome. Every other PPS found intersected with either true promoters, transposable elements, or interspersed repeats. We found a strong intersection between PPSs and Alu elements as well as transcript start sites. The number of false positive PPSs is estimated to be 3 × 10-8 per nucleotide, which is several orders of magnitude lower than for any other promoter prediction method. The developed method can be used to search for PPSs in various eukaryotic genomes.


Assuntos
Genoma Humano , Humanos , Elementos Alu/genética , Bases de Dados Factuais , Elementos de DNA Transponíveis/genética
4.
Int J Mol Sci ; 24(13)2023 Jun 30.
Artigo em Inglês | MEDLINE | ID: mdl-37446142

RESUMO

We have developed a de novo method for the identification of dispersed repeats based on the use of random position-weight matrices (PWMs) and an iterative procedure (IP). The created algorithm (IP method) allows detection of dispersed repeats for which the average number of substitutions between any two repeats per nucleotide (x) is less than or equal to 1.5. We have shown that all previously developed methods and algorithms (RED, RECON, and some others) can only find dispersed repeats for x ≤ 1.0. We applied the IP method to find dispersed repeats in the genomes of E. coli and nine other bacterial species. We identify three families of approximately 1.09 × 106, 0.64 × 106, and 0.58 × 106 DNA bases, respectively, constituting almost 50% of the complete E. coli genome. The length of the repeats is in the range of 400 to 600 bp. Other analyzed bacterial genomes contain one to three families of dispersed repeats with a total number of 103 to 6 × 103 copies. The existence of such highly divergent repeats could be associated with the presence of a single-type triplet periodicity in various genes or with the packing of bacterial DNA into a nucleoid.


Assuntos
Bactérias , Escherichia coli , Escherichia coli/genética , Bactérias/genética , DNA , DNA Bacteriano/genética , Genoma Bacteriano
5.
DNA Res ; 2023 Apr 25.
Artigo em Inglês | MEDLINE | ID: mdl-37186267

RESUMO

In this study, we modified the multiple alignment method based on the generation of random position weight matrices (RPWM) and used it to search for tandem repeats (TRs) in the Capsicum annuum genome. The application of the modified (m)RPWM method, which considers the correlation of adjusting nucleotides, resulted in the identification of 908,072 TR regions with repeat lengths from 2 to 200 bp in the C. annuum genome, where they occupied ~29%. The most common TRs were 2 and 3 bp long followed by those of 21, 4, and 15 bp. We performed clustering analysis of TRs with repeat lengths of 2 and 21 bp and created position-weight matrices (PWMs) for each group; these templates could be used to search for TRs of a given length in any nucleotide sequence. All detected TRs can be accessed through publicly available database (http : //victoria.biengi.ac.ru/capsicum_tr/). Comparison of mRPWM with other TR search methods such as Tandem Repeat Finder, T-REKS, and XSTREAM indicated that mRPWM could detect significantly more TRs at similar false discovery rates, indicating its superior performance. The developed mRPWM method can be successfully applied to the identification of highly divergent TRs, which is important for functional analysis of genomes and evolutionary studies.

6.
Biology (Basel) ; 11(8)2022 Jul 26.
Artigo em Inglês | MEDLINE | ID: mdl-35892972

RESUMO

In this study, we used a mathematical method for the multiple alignment of highly divergent sequences (MAHDS) to create a database of potential promoter sequences (PPSs) in the Capsicum annuum genome. To search for PPSs, 20 statistically significant classes of sequences located in the range from -499 to +100 nucleotides near the annotated genes were calculated. For each class, a position-weight matrix (PWM) was computed and then used to identify PPSs in the C. annuum genome. In total, 825,136 PPSs were detected, with a false positive rate of 0.13%. The PPSs obtained with the MAHDS method were tested using TSSFinder, which detects transcription start sites. The databank of the found PPSs provides their coordinates in chromosomes, the alignment of each PPS with the PWM, and the level of statistical significance as a normal distribution argument, and can be used in genetic engineering and biotechnology.

7.
Entropy (Basel) ; 24(5)2022 Apr 30.
Artigo em Inglês | MEDLINE | ID: mdl-35626518

RESUMO

In this paper, we attempted to find a relation between bacteria living conditions and their genome algorithmic complexity. We developed a probabilistic mathematical method for the evaluation of k-words (6 bases length) occurrence irregularity in bacterial gene coding sequences. For this, the coding sequences from different bacterial genomes were analyzed and as an index of k-words occurrence irregularity, we used W, which has a distribution similar to normal. The research results for bacterial genomes show that they can be divided into two uneven groups. First, the smaller one has W in the interval from 170 to 475, while for the second it is from 475 to 875. Plants, metazoan and virus genomes also have W in the same interval as the first bacterial group. We suggested that second bacterial group coding sequences are much less susceptible to evolutionary changes than the first group ones. It is also discussed to use the W index as a biological stress value.

8.
Int J Mol Sci ; 23(7)2022 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-35409125

RESUMO

The aim of this work was to compare the multiple alignment methods MAHDS, T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK in their ability to align highly divergent amino acid sequences. To accomplish this, we created test amino acid sequences with an average number of substitutions per amino acid (x) from 0.6 to 5.6, a total of 81 sets. Comparison of the performance of sequence alignments constructed by MAHDS and previously developed algorithms using the CS and Z score criteria and the benchmark alignment database (BAliBASE) indicated that, although the quality of the alignments built with MAHDS was somewhat lower than that of the other algorithms, it was compensated by greater statistical significance. MAHDS could construct statistically significant alignments of artificial sequences with x ≤ 4.8, whereas the other algorithms (T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK) could not perform that at x > 2.4. The application of MAHDS to align 21 families of highly diverged proteins (identity < 20%) from Pfam and HOMSTRAD databases showed that it could calculate statistically significant alignments in cases when the other methods failed. Thus, MAHDS could be used to construct statistically significant multiple alignments of highly divergent protein sequences, which accumulated multiple mutations during evolution.


Assuntos
Algoritmos , Café , Sequência de Aminoácidos , Proteínas/química , Proteínas/genética , Alinhamento de Sequência , Software
9.
Int J Mol Sci ; 22(13)2021 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-34281150

RESUMO

We report a Method to Search for Highly Divergent Tandem Repeats (MSHDTR) in protein sequences which considers pairwise correlations between adjacent residues. MSHDTR was compared with some previously developed methods for searching for tandem repeats (TRs) in amino acid sequences, such as T-REKS and XSTREAM, which focus on the identification of TRs with significant sequence similarity, whereas MSHDTR detects repeats that significantly diverged during evolution, accumulating deletions, insertions, and substitutions. The application of MSHDTR to a search of the Swiss-Prot databank revealed over 15 thousand TR-containing amino acid sequences that were difficult to find using the other methods. Among the detected TRs, the most representative were those with consensus lengths of two and seven residues; these TRs were subjected to cluster analysis and the classes of patterns were identified. All TRs detected in this study have been combined into a databank accessible over the WWW.


Assuntos
Sequência de Aminoácidos/genética , Análise de Sequência de Proteína/métodos , Sequências de Repetição em Tandem/genética , Algoritmos , Aminoácidos/genética , Animais , Humanos
10.
Genes (Basel) ; 12(4)2021 03 25.
Artigo em Inglês | MEDLINE | ID: mdl-33806152

RESUMO

Currently, there is a lack of bioinformatics approaches to identify highly divergent tandem repeats (TRs) in eukaryotic genomes. Here, we developed a new mathematical method to search for TRs, which uses a novel algorithm for constructing multiple alignments based on the generation of random position weight matrices (RPWMs), and applied it to detect TRs of 2 to 50 nucleotides long in the rice genome. The RPWM method could find highly divergent TRs in the presence of insertions or deletions. Comparison of the RPWM algorithm with the other methods of TR identification showed that RPWM could detect TRs in which the average number of base substitutions per nucleotide (x) was between 1.5 and 3.2, whereas T-REKS and TRF methods could not detect divergent TRs with x > 1.5. Applied to the search of TRs in the rice genome, the RPWM method revealed that TRs occupied 5% of the genome and that most of them were 2 and 3 bases long. Using RPWM, we also revealed the correlation of TRs with dispersed repeats and transposons, suggesting that some transposons originated from TRs. Thus, the novel RPWM algorithm is an effective tool to search for highly divergent TRs in the genomes.


Assuntos
Mapeamento Cromossômico/métodos , Cromossomos de Plantas/genética , Genoma de Planta , Oryza/genética , Sequências de Repetição em Tandem/genética , Filogenia
11.
BMC Bioinformatics ; 22(1): 42, 2021 Feb 02.
Artigo em Inglês | MEDLINE | ID: mdl-33530928

RESUMO

BACKGROUND: Transposable elements (TEs) constitute a significant part of eukaryotic genomes. Short interspersed nuclear elements (SINEs) are non-autonomous TEs, which are widely represented in mammalian genomes and also found in plants. After insertion in a new position in the genome, TEs quickly accumulate mutations, which complicate their identification and annotation by modern bioinformatics methods. In this study, we searched for highly divergent SINE copies in the genome of rice (Oryza sativa subsp. japonica) using the Highly Divergent Repeat Search Method (HDRSM). RESULTS: The HDRSM considers correlations of neighboring symbols to construct position weight matrix (PWM) for a SINE family, which is then used to perform a search for new copies. In order to evaluate the accuracy of the method and compare it with the RepeatMasker program, we generated a set of SINE copies containing nucleotide substitutions and indels and inserted them into an artificial chromosome for analysis. The HDRSM showed better results both in terms of the number of identified inserted repeats and the accuracy of determining their boundaries. A search for the copies of 39 SINE families in the rice genome produced 14,030 hits; among them, 5704 were not detected by RepeatMasker. CONCLUSIONS: The HDRSM could find divergent SINE copies, correctly determine their boundaries, and offer a high level of statistical significance. We also found that RepeatMasker is able to find relatively short copies of the SINE families with a higher level of similarity, while HDRSM is able to find more diverged copies. To obtain a comprehensive profile of SINE distribution in the genome, combined application of the HDRSM and RepeatMasker is recommended.


Assuntos
Elementos de DNA Transponíveis , Oryza , Elementos Nucleotídeos Curtos e Dispersos , Animais , Elementos de DNA Transponíveis/genética , Evolução Molecular , Humanos , Oryza/genética , Filogenia , Matrizes de Pontuação de Posição Específica , Elementos Nucleotídeos Curtos e Dispersos/genética
12.
Genes (Basel) ; 12(2)2021 01 21.
Artigo em Inglês | MEDLINE | ID: mdl-33494278

RESUMO

In this study, we developed a new mathematical method for performing multiple alignment of highly divergent sequences (MAHDS), i.e., sequences that have on average more than 2.5 substitutions per position (x). We generated sets of artificial DNA sequences with x ranging from 0 to 4.4 and applied MAHDS as well as currently used multiple sequence alignment algorithms, including ClustalW, MAFFT, T-Coffee, Kalign, and Muscle to these sets. The results indicated that most of the existing methods could produce statistically significant alignments only for the sets with x < 2.5, whereas MAHDS could operate on sequences with x = 4.4. We also used MAHDS to analyze a set of promoter sequences from the Arabidopsis thaliana genome and discovered many conserved regions upstream of the transcription initiation site (from -499 to +1 bp); a part of the downstream region (from +1 to +70 bp) also significantly contributed to the obtained alignments. The possibilities of applying the newly developed method for the identification of promoter sequences in any genome are discussed. A server for multiple alignment of nucleotide sequences has been created.


Assuntos
Arabidopsis/genética , Biologia Computacional , Genoma de Planta , Genômica , Regiões Promotoras Genéticas , Análise de Sequência de DNA/métodos , Algoritmos , Biologia Computacional/métodos , Genômica/métodos
13.
J Comput Biol ; 26(11): 1253-1261, 2019 11.
Artigo em Inglês | MEDLINE | ID: mdl-31211597

RESUMO

Gene fusion is known to be one of the mechanisms of a new gene formation. Most bioinformatics methods for studying fused genes are based on the sequence similarity search. However, if the ancestral sequences were lost during evolution or changed too much, it is impossible to detect the fusion. Previously, we have developed a method of searching for triplet periodicity (TP) change points in protein-coding sequences (CDS) and showed the possible relation of this phenomenon with gene formation as a result of fusion. In this study, we improved the TP change point detection method and studied the genes of six eukaryotic genomes. At the level of 2%-3% of the probability of type I error, TP change points were found in 20%-40% of genes. Further analysis showed that about 30% of the TP change points can be explained by amino acid repeats. Another 30% can be potentially fused genes, alignment for which was detected by the BLAST program. We believe that the rest of the results can be fused genes, the ancestral sequences for which have been lost. The method is more sensitive to TP changes and allowed us to find up to two to three times more cases of significant TP change points than our previous method.


Assuntos
Biologia Computacional/métodos , Genoma/genética , Fases de Leitura Aberta/genética , Sequências Repetitivas de Aminoácidos/genética , Animais , Eucariotos/genética , Humanos , Alinhamento de Sequência/métodos
14.
Biomed Res Int ; 2017: 7949287, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-28182099

RESUMO

Summary. We analyzed several prokaryotic and eukaryotic genomes looking for the periodicity sequences availability and employing a new mathematical method. The method envisaged using the random position weight matrices and dynamic programming. Insertions and deletions were allowed inside periodicities, thus adding a novelty to the results we obtained. A periodicity length, one of the key periodicity features, varied from 2 to 50 nt. Totally over 60,000 periodicity sequences were found in 15 genomes including some chromosomes of the H. sapiens (partial), C. elegans, D. melanogaster, and A. thaliana genomes.


Assuntos
Genoma , Mutação INDEL/genética , Análise de Sequência de DNA , Animais , Arabidopsis/genética , Caenorhabditis elegans/genética , Cromossomos/genética , Drosophila melanogaster/genética , Humanos , Modelos Teóricos , Células Procarióticas
15.
Stat Appl Genet Mol Biol ; 15(5): 381-400, 2016 10 01.
Artigo em Inglês | MEDLINE | ID: mdl-27337743

RESUMO

The aim of this study was to show that amino acid sequences have a latent periodicity with insertions and deletions of amino acids in unknown positions of the analyzed sequence. Genetic algorithm, dynamic programming and random weight matrices were used to develop a new mathematical algorithm for latent periodicity search. A multiple alignment of periods was calculated with help of the direct optimization of the position-weight matrix without using pairwise alignments. The developed algorithm was applied to analyze amino acid sequences of a small number of proteins. This study showed the presence of latent periodicity with insertions and deletions in the amino acid sequences of such proteins, for which the presence of latent periodicity was not previously known. The origin of latent periodicity with insertions and deletions is discussed.


Assuntos
Algoritmos , Sequência de Aminoácidos , Biologia Computacional/métodos , Modelos Genéticos , Modelos Estatísticos , Mutagênese Insercional , Deleção de Sequência
16.
Stat Appl Genet Mol Biol ; 14(2): 113-23, 2015 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-25719343

RESUMO

Triplet periodicity (TP) is a distinctive feature of the protein coding sequences of both prokaryotic and eukaryotic genomes. In this work, we explored the TP difference inside and between 45 prokaryotic genomes. We constructed two hypotheses of TP distribution on a set of coding sequences and generated artificial datasets that correspond to the hypotheses. We found that TP is more similar inside a genome than between genomes and that TP distribution inside a real genome dataset corresponds to the hypothesis which implies that a common TP pattern exists for the majority of sequences inside a genome. Additionally, we performed gene classification based on TP matrixes. This classification showed that TP allows identification of the genome to which a given gene belongs with more than 85% accuracy.


Assuntos
Genoma/genética , Algoritmos , Bases de Dados Genéticas , Fases de Leitura Aberta/genética , Periodicidade , Células Procarióticas/fisiologia
17.
Adv Bioinformatics ; 2015: 635437, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26770195

RESUMO

Over the last years a great number of bacterial genomes were sequenced. Now one of the most important challenges of computational genomics is the functional annotation of nucleic acid sequences. In this study we presented the computational method and the annotation system for predicting biological functions using phylogenetic profiles. The phylogenetic profile of a gene was created by way of searching for similarities between the nucleotide sequence of the gene and 1204 reference genomes, with further estimation of the statistical significance of found similarities. The profiles of the genes with known functions were used for prediction of possible functions and functional groups for the new genes. We conducted the functional annotation for genes from 104 bacterial genomes and compared the functions predicted by our system with the already known functions. For the genes that have already been annotated, the known function matched the function we predicted in 63% of the time, and in 86% of the time the known function was found within the top five predicted functions. Besides, our system increased the share of annotated genes by 19%. The developed system may be used as an alternative or complementary system to the current annotation systems.

18.
Comput Biol Chem ; 53 Pt A: 43-8, 2014 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-25218218

RESUMO

To determine the periodicity of a DNA sequence, different spectral approaches are applied (discrete Fourier transform (DFT), autocorrelation (CORR), information decomposition (ID), hybrid method (HYB), concept of spectral envelope for spectral analysis (SE), normalized autocorrelation (CORR_N) and profile analysis (PA). In this work, we investigated the possibility of finding the true period length, by depending on the average number of accumulated changes in DNA bases (PM) for the methods stated above. The results show that for periods with short length (≤4 b.p), it is possible to use the hybrid method (HYB), which combines properties of autocorrelation, Fourier transform, and information decomposition (ID). For larger period lengths (>4) with values of point mutation (PM) equal to 1.0 or more per one nucleotide, it is preferable to use information of decomposition method (ID), as the other spectral approaches cannot achieve correct determination of the period length present in the analyzed sequence.


Assuntos
Caenorhabditis elegans/genética , DNA de Helmintos/genética , Modelos Estatísticos , Periodicidade , Análise de Sequência de DNA/estatística & dados numéricos , Animais , Análise de Fourier , Nucleotídeos , Mutação Puntual
19.
Comput Biol Chem ; 51: 12-21, 2014 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-24840641

RESUMO

We describe a new mathematical method for finding very diverged short tandem repeats containing a single indel. The method involves comparison of two frequency matrices: a first matrix for a subsequence before shift and a second one for a subsequence after it. A measure of comparison is based on matrix similarity. The approach developed was applied to analysis of the genomes of Caenorhabditis elegans, Drosophila melanogaster and Saccharomyces cerevisiae. They were investigated regarding the presence of tandem repeats having repeat length equal to 2 - 11 nucleotides except equal to 3, 6 and 9 nucleotides. A number of phase shift regions for these genomes was approximately 2.2 × 10(4), 1.5 × 10(4) and 1.7 × 10(2), respectively. Type I error was less than 5%. The mean length of fuzzy periodicity and phase shift regions was about 220 nucleotides. The regions of fuzzy periodicity having single insertion or deletion occupy substantial parts of the genomes: 5%, 3% and 0.3%, respectively. Only less than 10% of these regions have been detected previously. That is, the number of such regions in the genomes of C. elegans, D. melanogaster and S. cerevisiae is dramatically higher than it has been revealed by any known methods. We suppose that some found regions of fuzzy periodicity could be the regions for protein binding.


Assuntos
Caenorhabditis elegans/genética , Drosophila melanogaster/genética , Mutação da Fase de Leitura , Genoma , Repetições de Microssatélites , Saccharomyces cerevisiae/genética , Animais , Sequência de Bases , Mutação INDEL , Dados de Sequência Molecular , Método de Monte Carlo , Mutagênese Insercional
20.
Artigo em Inglês | MEDLINE | ID: mdl-26356866

RESUMO

It is known that nucleotide sequences are not totally homogeneous and this heterogeneity could not be due to random fluctuations only. Such heterogeneity poses a problem of making sequence segmentation into a set of homogeneous parts divided by the points called "change points". In this work we investigated a special case of change points-paired change points (PCP). We used a well-known property of coding sequences-triplet periodicity (TP). The sequences that we are especially interested in consist of three successive parts: the first and the last parts have similar TP while the middle part has different TP type. We aimed to find the genes with PCP and provide explanation for this phenomenon. We developed a mathematical method for the PCP detection based on the new measure of similarity between TP matrices. We investigated 66,936 bacterial genes from 17 bacterial genomes and revealed 2,700 genes with PCP and 6,459 genes with single change point (SCP). We developed a mathematical approach to visualize the PCP cases. We suppose that PCP could be associated with double fusion or insertion events. The results of investigating the sequences with artificial insertions/fusions and distribution of TP inside the genome support the idea that the real number of genes formed by insertion/ fusion events could be 5-7 times greater than the number of genes revealed in the present work.


Assuntos
Algoritmos , Genes Bacterianos/genética , Alinhamento de Sequência/métodos , Análise de Sequência de DNA/métodos , Fusão Gênica/genética , Mutagênese Insercional/genética
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...