Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 54
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
PeerJ ; 12: e17025, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38464746

RESUMEN

Insects are a highly diverse phylogeny and possess a wide variety of traits, including the presence or absence of wings and metamorphosis. These diverse traits are of great interest for studying genome evolution, and numerous comparative genomic studies have examined a wide phylogenetic range of insects. Here, we analyzed 22 insects belonging to a wide phylogenetic range (Endopterygota, Paraneoptera, Polyneoptera, Palaeoptera, and other insects) by using a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions in their genomic fragments (100-kb or 1-Mb sequences), which is an unsupervised machine learning algorithm that can extract species-specific characteristics of the oligonucleotide compositions (genome signatures). The genome signature is of particular interest in terms of the mechanisms and biological significance that have caused the species-specific difference, and can be used as a powerful search needle to explore the various roles of genome sequences other than protein coding, and can be used to unveil mysteries hidden in the genome sequence. Since BLSOM is an unsupervised clustering method, the clustering of sequences was performed based on the oligonucleotide composition alone, without providing information about the species from which each fragment sequence was derived. Therefore, not only the interspecies separation, but also the intraspecies separation can be achieved. Here, we have revealed the specific genomic regions with oligonucleotide compositions distinct from the usual sequences of each insect genome, e.g., Mb-level structures found for a grasshopper Schistocerca americana. One aim of this study was to compare the genome characteristics of insects with those of vertebrates, especially humans, which are phylogenetically distant from insects. Recently, humans seem to be the "model organism" for which a large amount of information has been accumulated using a variety of cutting-edge and high-throughput technologies. Therefore, it is reasonable to use the abundant information from humans to study insect lineages. The specific regions of Mb length with distinct oligonucleotide compositions have also been previously observed in the human genome. These regions were enriched by transcription factor binding motifs (TFBSs) and hypothesized to be involved in the three-dimensional arrangement of chromosomal DNA in interphase nuclei. The present study characterized the species-specific oligonucleotide compositions (i.e., genome signatures) in insect genomes and identified specific genomic regions with distinct oligonucleotide compositions.


Asunto(s)
Genoma Humano , Genoma de los Insectos , Animales , Humanos , Filogenia , Genoma de los Insectos/genética , Oligonucleótidos/genética , Inteligencia Artificial
2.
Virol J ; 20(1): 39, 2023 03 01.
Artículo en Inglés | MEDLINE | ID: mdl-36859385

RESUMEN

BACKGROUND: Viruses use various host factors for their growth, and efficient growth requires efficient use of these factors. Our previous study revealed that the occurrence frequency of oligonucleotides in the influenza virus genome is distinctly different among derived hosts, and the frequency tends to adapt to the host cells in which they grow. We aimed to study the adaptation mechanisms of a zoonotic virus to host cells. METHODS: Herein, we compared the frequency of oligonucleotides in the genome of alpha- and betacoronavirus with those in the genomes of humans and bats, which are typical hosts of the viruses. RESULTS: By comparing the oligonucleotide frequency in coronaviruses and their host genomes, we found a statistically tested positive correlation between the frequency of coronaviruses and that of the exon regions of the host from which the virus is derived. To examine the characteristics of early-stage changes in the viral genome, which are assumed to accompany the host change from non-humans to humans, we compared the oligonucleotide frequency between severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) at the beginning of the pandemic and the prevalent variants thereafter, and found changes towards the frequency of the host exon regions. CONCLUSIONS: In alpha- and betacoronaviruses, the genome oligonucleotide frequency is thought to change in response to the cellular environment in which the virus is replicating, and actually the frequency has approached the frequency in exon regions in the host.


Asunto(s)
COVID-19 , Quirópteros , Animales , SARS-CoV-2 , Exones , Genoma Viral , Oligonucleótidos
3.
PLoS One ; 17(8): e0273860, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36044525

RESUMEN

Among mutations that occur in SARS-CoV-2, efficient identification of mutations advantageous for viral replication and transmission is important to characterize and defeat this rampant virus. Mutations rapidly expanding frequency in a viral population are candidates for advantageous mutations, but neutral mutations hitchhiking with advantageous mutations are also likely to be included. To distinguish these, we focus on mutations that appear to occur independently in different lineages and expand in frequency in a convergent evolutionary manner. Batch-learning SOM (BLSOM) can separate SARS-CoV-2 genome sequences according by lineage from only providing the oligonucleotide composition. Focusing on remarkably expanding 20-mers, each of which is only represented by one copy in the viral genome, allows us to correlate the expanding 20-mers to mutations. Using visualization functions in BLSOM, we can efficiently identify mutations that have expanded remarkably both in the Omicron lineage, which is phylogenetically distinct from other lineages, and in other lineages. Most of these mutations involved changes in amino acids, but there were a few that did not, such as an intergenic mutation.


Asunto(s)
COVID-19 , Mutación , Oligonucleótidos , SARS-CoV-2 , Inteligencia Artificial , COVID-19/genética , Genoma Viral , Humanos , Aprendizaje Automático , Oligonucleótidos/genética , Filogenia , SARS-CoV-2/genética , Glicoproteína de la Espiga del Coronavirus/genética
4.
BMC Genomics ; 23(1): 497, 2022 Jul 08.
Artículo en Inglés | MEDLINE | ID: mdl-35804296

RESUMEN

BACKGROUND: Emerging infectious disease-causing RNA viruses, such as the SARS-CoV-2 and Ebola viruses, are thought to rely on bats as natural reservoir hosts. Since these zoonotic viruses pose a great threat to humans, it is important to characterize the bat genome from multiple perspectives. Unsupervised machine learning methods for extracting novel information from big sequence data without prior knowledge or particular models are highly desirable for obtaining unexpected insights. We previously established a batch-learning self-organizing map (BLSOM) of the oligonucleotide composition that reveals novel genome characteristics from big sequence data. RESULTS: In this study, using the oligonucleotide BLSOM, we conducted a comparative genomic study of humans and six bat species. BLSOM is an explainable-type machine learning algorithm that reveals the diagnostic oligonucleotides contributing to sequence clustering (self-organization). When unsupervised machine learning reveals unexpected and/or characteristic features, these features can be studied in more detail via the much simpler and more direct standard distribution map method. Based on this combined strategy, we identified the Mb-level enrichment of CG dinucleotide (Mb-level CpG islands) around the termini of bat long-scaffold sequences. In addition, a class of CG-containing oligonucleotides were enriched in the centromeric and pericentromeric regions of human chromosomes. Oligonucleotides longer than tetranucleotides often represent binding motifs for a wide variety of proteins (e.g., transcription factor binding sequences (TFBSs)). By analyzing the penta- and hexanucleotide composition, we observed the evident enrichment of a wide range of hexanucleotide TFBSs in centromeric and pericentromeric heterochromatin regions on all human chromosomes. CONCLUSION: Function of transcription factors (TFs) beyond their known regulation of gene expression (e.g., TF-mediated looping interactions between two different genomic regions) has received wide attention. The Mb-level TFBS and CpG islands are thought to be involved in the large-scale nuclear organization, such as centromere and telomere clustering. TFBSs, which are enriched in centromeric and pericentromeric heterochromatin regions, are thought to play an important role in the formation of nuclear 3D structures. Our machine learning-based analysis will help us to understand the differential features of nuclear 3D structures in the human and bat genomes.


Asunto(s)
COVID-19 , Quirópteros/genética , Genoma Humano/genética , SARS-CoV-2/fisiología , Animales , COVID-19/transmisión , Quirópteros/virología , Islas de CpG , Genómica/métodos , Heterocromatina/química , Heterocromatina/genética , Humanos , Conformación Molecular , Oligonucleótidos/química , Aprendizaje Automático no Supervisado
5.
BMC Microbiol ; 22(1): 73, 2022 03 10.
Artículo en Inglés | MEDLINE | ID: mdl-35272618

RESUMEN

BACKGROUND: Unsupervised AI (artificial intelligence) can obtain novel knowledge from big data without particular models or prior knowledge and is highly desirable for unveiling hidden features in big data. SARS-CoV-2 poses a serious threat to public health and one important issue in characterizing this fast-evolving virus is to elucidate various aspects of their genome sequence changes. We previously established unsupervised AI, a BLSOM (batch-learning SOM), which can analyze five million genomic sequences simultaneously. The present study applied the BLSOM to the oligonucleotide compositions of forty thousand SARS-CoV-2 genomes. RESULTS: While only the oligonucleotide composition was given, the obtained clusters of genomes corresponded primarily to known main clades and internal divisions in the main clades. Since the BLSOM is explainable AI, it reveals which features of the oligonucleotide composition are responsible for clade clustering. Additionally, BLSOM also provided information concerning the special genomic region possibly undergoing RNA modifications. CONCLUSIONS: The BLSOM has powerful image display capabilities and enables efficient knowledge discovery about viral evolutionary processes, and it can complement phylogenetic methods based on sequence alignment.


Asunto(s)
COVID-19 , SARS-CoV-2 , Inteligencia Artificial , Evolución Molecular , Humanos , Filogenia , SARS-CoV-2/genética
6.
Genes Genet Syst ; 96(4): 165-176, 2021 Dec 16.
Artículo en Inglés | MEDLINE | ID: mdl-34565757

RESUMEN

In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: an unsupervised AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4-6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers that appeared after the epidemic started could be connected to mutations, and the BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explain BLSOMs for various topics, such as the tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explain BLSOMs for various eukaryotes, including fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found enrichments in transcription factor-binding sequences in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) could be separated according to their corresponding amino acid.


Asunto(s)
Inteligencia Artificial , Biología Computacional/métodos , Genoma Humano , Genoma Viral , SARS-CoV-2/genética , Análisis por Conglomerados , Uso de Codones , Humanos , Metagenómica/métodos , Mutación , ARN de Transferencia , Factores de Tiempo
7.
BMC Microbiol ; 21(1): 89, 2021 03 23.
Artículo en Inglés | MEDLINE | ID: mdl-33757449

RESUMEN

BACKGROUND: When a virus that has grown in a nonhuman host starts an epidemic in the human population, human cells may not provide growth conditions ideal for the virus. Therefore, the invasion of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), which is usually prevalent in the bat population, into the human population is thought to have necessitated changes in the viral genome for efficient growth in the new environment. In the present study, to understand host-dependent changes in coronavirus genomes, we focused on the mono- and oligonucleotide compositions of SARS-CoV-2 genomes and investigated how these compositions changed time-dependently in the human cellular environment. We also compared the oligonucleotide compositions of SARS-CoV-2 and other coronaviruses prevalent in humans or bats to investigate the causes of changes in the host environment. RESULTS: Time-series analyses of changes in the nucleotide compositions of SARS-CoV-2 genomes revealed a group of mono- and oligonucleotides whose compositions changed in a common direction for all clades, even though viruses belonging to different clades should evolve independently. Interestingly, the compositions of these oligonucleotides changed towards those of coronaviruses that have been prevalent in humans for a long period and away from those of bat coronaviruses. CONCLUSIONS: Clade-independent, time-dependent changes are thought to have biological significance and should relate to viral adaptation to a new host environment, providing important clues for understanding viral host adaptation mechanisms.


Asunto(s)
Composición de Base , Evolución Molecular , Genoma Viral , SARS-CoV-2/genética , Animales , Quirópteros/virología , Humanos , Oligonucleótidos
8.
Life Sci Alliance ; 4(5)2021 05.
Artículo en Inglés | MEDLINE | ID: mdl-33712508

RESUMEN

The Japanese wrinkled frog (Glandirana rugosa) is unique in having both XX-XY and ZZ-ZW types of sex chromosomes within the species. The genome sequencing and comparative genomics with other frogs should be important to understand mechanisms of turnover of sex chromosomes within one species or during a short period. In this study, we analyzed the newly sequenced genome of G. rugosa using a batch-learning self-organizing map which is unsupervised artificial intelligence for oligonucleotide compositions. To clarify genome characteristics of G. rugosa, we compared its short oligonucleotide compositions in all 1-Mb genomic fragments with those of other six frog species (Pyxicephalus adspersus, Rhinella marina, Spea multiplicata, Leptobrachium leishanense, Xenopus laevis, and Xenopus tropicalis). In G. rugosa, we found an Mb-level large size of repeat sequences having a high identity with the W chromosome of the African bullfrog (P. adspersus). Our study concluded that G. rugosa has unique genome characteristics with a high CG frequency, and its genome is assumed to heterochromatinize a large size of genome via methylataion of CG.


Asunto(s)
Composición de Base/genética , Ranidae/genética , Cromosomas Sexuales/genética , Animales , Secuencia de Bases/genética , Femenino , Genómica/métodos , Masculino , Filogenia , Aprendizaje Automático no Supervisado
9.
Sci Rep ; 10(1): 17440, 2020 10 15.
Artículo en Inglés | MEDLINE | ID: mdl-33060757

RESUMEN

Transfer RNA genes (tDNAs) are essential genes that encode tRNAs in all species. To understand new functions of tDNAs, other than that of encoding tRNAs, we used ENCODE data to examine binding characteristics of transcription factors (TFs) for all tDNA regions (489 loci) in the human genome. We divided the tDNAs into three groups based on the number of TFs that bound to them. At the two extremes were tDNAs to which many TFs bound (Group 1) and those to which no TFs bound (Group 3). Several TFs involved in chromatin remodeling such as ATF3, EP300 and TBL1XR1 bound to almost all Group 1 tDNAs. Furthermore, almost all Group 1 tDNAs included DNase I hypersensitivity sites and may thus interact with other chromatin regions through their bound TFs, and they showed highly conserved synteny across tetrapods. In contrast, Group 3 tDNAs did not possess these characteristics. These data suggest the presence of a previously uncharacterized function of these tDNAs. We also examined binding of CTCF to tDNAs and their involvement in topologically associating domains (TADs) and lamina-associated domains (LADs), which suggest a new perspective on the evolution and function of tDNAs.


Asunto(s)
Cromatina/química , ARN de Transferencia/metabolismo , Factores de Transcripción/metabolismo , Células A549 , Factor de Transcripción Activador 3/metabolismo , Secuencias de Aminoácidos , Biología Computacional , Bases de Datos Factuales , Proteína p300 Asociada a E1A/metabolismo , Genoma Humano , Células HeLa , Células Hep G2 , Humanos , Células K562 , Dominios Proteicos , Receptores Citoplasmáticos y Nucleares/metabolismo , Proteínas Represoras/metabolismo , Sintenía , Factores de Transcripción TFIII/metabolismo
10.
PLoS One ; 15(8): e0236515, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-32764757

RESUMEN

With the increasing availability of high quality genomic data, there is opportunity to deeply explore the genealogical relationships of different gene loci between closely related species. In this study, we utilized genomes of Xenopus laevis (XLA, a tetraploid species with (L) and (S) sub-genomes) and X. tropicalis (XTR, a diploid species) to investigate whether synonymous substitution rates among orthologous or homoeologous genes displayed any heterogeneity. From over 1500 orthologous/homoeologous genes collected, we calculated proportion of synonymous substitutions between genomes/sub-genomes (k) and found variation within and between chromosomes. Within most chromosomes, we identified higher k with distance from the centromere, likely attributed to higher substitution rates and recombination in these regions. Using maximum likelihood methods, we identified further evidence supporting rate heterogeneity, and estimated species divergence times and ancestral population sizes. Estimated species divergence times (XLA.L-XLA.S: ~25.5 mya; XLA-XTR: ~33.0 mya) were slightly younger compared to a past study, attributed to consideration of population size in our study. Meanwhile, we found very large estimated population size in the ancestral populations of the two species (NA = 2.55 x 106). Local hybridization and population structure, which have not yet been well elucidated in frogs, may be a contributing factor to these possible large population sizes.


Asunto(s)
Evolución Molecular , Genoma/genética , Mutación Silenciosa/genética , Xenopus laevis/genética , Animales , Cromosomas , Heterogeneidad Genética , Hibridación Genética , Hibridación Fluorescente in Situ , Filogenia
11.
Gene X ; 5: 100038, 2020 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-32835214

RESUMEN

We first conducted time-series analysis of mono- and dinucleotide composition for over 10,000 SARS-CoV-2 genomes, as well as over 1500 Zaire ebolavirus genomes, and found clear time-series changes in the compositions on a monthly basis, which should reflect viral adaptations for efficient growth in human cells. We next developed a sequence alignment free method that extensively searches for advantageous mutations and rank them in an increase level for their intrapopulation frequency. Time-series analysis of occurrences of oligonucleotides of diverse lengths for SARS-CoV-2 genomes revealed seven distinctive mutations that rapidly expanded their intrapopulation frequency and are thought to be candidates of advantageous mutations for the efficient growth in human cells.

12.
Genes Genet Syst ; 95(1): 29-41, 2020 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-32161227

RESUMEN

Unsupervised machine learning that can discover novel knowledge from big sequence data without prior knowledge or particular models is highly desirable for current genome study. We previously established a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions, which can reveal various novel genome characteristics from big sequence data, and found that transcription factor binding sequences (TFBSs) and CpG-containing oligonucleotides are enriched in human centromeric and pericentromeric regions, which support centromere clustering and form the condensed heterochromatin "chromocenter" in interphase nuclei. The number and size of chromocenters, as well as the type of centromeres gathered in individual chromocenters, vary depending on cell type. To study molecular mechanisms of cell type-dependent chromocenter formation, we analyzed distribution patterns of occurrence per Mb of hexa- and heptanucleotide TFBSs, which have been compiled by the SwissRegulon Portal, and of CpG-containing oligonucleotides. We found Mb-level islands enriched for TFBSs and CpG-containing oligonucleotides in centromeric and pericentromeric regions on all human chromosomes except chrY. Considering molecular mechanisms for cell type-dependent centromere clustering, the chromosome-dependent enrichment of a set of TFBSs and CpG-containing oligonucleotides is of particular interest, since the cellular content of TFs and methyl-CpG-binding proteins exhibits cell type-dependent regulation. A newly introduced BLSOM, which analyzed occurrences of a total of 3,946 octanucleotide TFBSs compiled by the SwissRegulon Portal, has self-organized (separated) the sequences that are characteristically enriched in TFBSs and shown that these sequences are derived primarily from centromeric and pericentromeric constitutive heterochromatin regions. Furthermore, the BLSOM identified and visualized characteristic TFBSs that are enriched in these regions. By analyzing Hi-C data for interchromosomal interactions, the present study showed that the chromatin segments supporting the interchromosomal interactions locate primarily in Mb-level TFBS and CpG islands and are thus enriched for a wide variety of TFBSs and CG-containing oligonucleotides.


Asunto(s)
Inteligencia Artificial , Cromosomas Humanos/genética , Islas de CpG/genética , Genoma Humano/genética , Sitios de Unión , Centrómero/genética , Heterocromatina/genética , Humanos , Oligonucleótidos/genética , Unión Proteica , Factores de Transcripción/genética , Factores de Transcripción/metabolismo
13.
Genes Genet Syst ; 95(1): 11-19, 2020 Apr 22.
Artículo en Inglés | MEDLINE | ID: mdl-32161228

RESUMEN

As a result of the extensive decoding of a massive amount of genomic and metagenomic sequence data, a large number of genes whose functions cannot be predicted by sequence similarity searches are accumulating, and such genes are of little use to science or industry. Current genome and metagenome sequencing largely depend on high-throughput and low-cost methods. In the case of genome sequencing for a single species, high-density sequencing can reduce sequencing errors. For metagenome sequences, however, high-density sequencing does not necessarily increase the sequence quality because multiple and unknown genomes, including those of closely related species, are likely to exist in the sample. Therefore, a function prediction method that is robust against sequence errors becomes an increased need. Here, we present a method for predicting protein gene function that does not depend on sequence similarity searches. Using an unsupervised machine learning method called BLSOM (batch-learning self-organizing map) for short oligopeptide frequencies, we previously developed a sequence alignment-free method for clustering bacterial protein genes according to clusters of orthologous groups of proteins (COGs), without using information from COGs during machine learning. This allows function-unknown proteins to cluster with function-known proteins, based solely on similarity with respect to oligopeptide frequency, although the method required high-performance supercomputers (HPCs). Based on a wide range of knowledge obtained with HPCs, we have now developed a strategy to correlate function-unknown proteins with COG categories, using only oligopeptide frequency distances (OPDs), which can be conducted with PC-level computers. The OPD strategy is suitable for predicting the functions of proteins with low sequence similarity and is applied here to predict the functions of a large number of gene candidates discovered using metagenome sequencing.


Asunto(s)
Algoritmos , Eucariontes/genética , Metagenoma/genética , Metagenómica , Anotación de Secuencia Molecular , Oligopéptidos/genética , Secuencia de Aminoácidos , Mapeo Cromosómico , Proteínas/genética , Proteínas/metabolismo , Alineación de Secuencia
14.
Gene ; 763S: 100038, 2020 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-34493367

RESUMEN

We first conducted time-series analysis of mono- and dinucleotide composition for over 10,000 SARS-CoV-2 genomes, as well as over 1500 Zaire ebolavirus genomes, and found clear time-series changes in the compositions on a monthly basis, which should reflect viral adaptations for efficient growth in human cells. We next developed a sequence alignment free method that extensively searches for advantageous mutations and rank them in an increase level for their intrapopulation frequency. Time-series analysis of occurrences of oligonucleotides of diverse lengths for SARS-CoV-2 genomes revealed seven distinctive mutations that rapidly expanded their intrapopulation frequency and are thought to be candidates of advantageous mutations for the efficient growth in human cells.


Asunto(s)
COVID-19/genética , Genoma Viral/genética , ARN Viral/genética , SARS-CoV-2/genética , COVID-19/patología , Humanos , Mutación/genética , Oligonucleótidos/genética , SARS-CoV-2/patogenicidad , Alineación de Secuencia
15.
Genes Genet Syst ; 92(1): 43-54, 2017 Sep 12.
Artículo en Inglés | MEDLINE | ID: mdl-28344190

RESUMEN

Unsupervised data mining capable of extracting a wide range of knowledge from big data without prior knowledge or particular models is a timely application in the era of big sequence data accumulation in genome research. By handling oligonucleotide compositions as high-dimensional data, we have previously modified the conventional self-organizing map (SOM) for genome informatics and established BLSOM, which can analyze more than ten million sequences simultaneously. Here, we develop BLSOM specialized for tRNA genes (tDNAs) that can cluster (self-organize) more than one million microbial tDNAs according to their cognate amino acid solely depending on tetra- and pentanucleotide compositions. This unsupervised clustering can reveal combinatorial oligonucleotide motifs that are responsible for the amino acid-dependent clustering, as well as other functionally and structurally important consensus motifs, which have been evolutionarily conserved. BLSOM is also useful for identifying tDNAs as phylogenetic markers for special phylotypes. When we constructed BLSOM with 'species-unknown' tDNAs from metagenomic sequences plus 'species-known' microbial tDNAs, a large portion of metagenomic tDNAs self-organized with species-known tDNAs, yielding information on microbial communities in environmental samples. BLSOM can also enhance accuracy in the tDNA database obtained from big sequence data. This unsupervised data mining should become important for studying numerous functionally unclear RNAs obtained from a wide range of organisms.


Asunto(s)
Inteligencia Artificial , Genómica/métodos , ARN de Transferencia/genética , Análisis de Secuencia de ADN/métodos , Programas Informáticos , Animales , Humanos
16.
Sci Rep ; 6: 36197, 2016 11 03.
Artículo en Inglés | MEDLINE | ID: mdl-27808119

RESUMEN

Ebolavirus, MERS coronavirus and influenza virus are zoonotic RNA viruses, which mutate very rapidly. Viral growth depends on many host factors, but human cells may not provide the ideal growth conditions for viruses invading from nonhuman hosts. The present time-series analyses of short and long oligonucleotide compositions in these genomes showed directional changes in their composition after invasion from a nonhuman host, which are thought to recur after future invasions. In the recent West Africa Ebola outbreak, directional time-series changes in a wide range of oligonucleotides were observed in common for three geographic areas, and the directional changes were observed also for the recent MERS coronavirus epidemics starting in the Middle East. In addition, common directional changes in human influenza A viruses were observed for three subtypes, whose epidemics started independently. Long oligonucleotides that showed an evident directional change observed in common for the three subtypes corresponded to some of influenza A siRNAs, whose activities have been experimentally proven. Predicting directional and reoccurring changes in oligonucleotide composition should become important for designing diagnostic RT-PCR primers and therapeutic oligonucleotides with long effectiveness.


Asunto(s)
Genoma Viral , Zoonosis/virología , Animales , Secuencia de Bases , Ebolavirus/genética , Humanos , Coronavirus del Síndrome Respiratorio de Oriente Medio/genética , Oligonucleótidos/genética , Orthomyxoviridae/genética , Factores de Tiempo
17.
Biomed Res Int ; 2015: 506052, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26495297

RESUMEN

With the remarkable increase in genomic sequence data from various organisms, novel tools are needed for comprehensive analyses of available big sequence data. We previously developed a Batch-Learning Self-Organizing Map (BLSOM), which can cluster genomic fragment sequences according to phylotype solely dependent on oligonucleotide composition and applied to genome and metagenomic studies. BLSOM is suitable for high-performance parallel-computing and can analyze big data simultaneously, but a large-scale BLSOM needs a large computational resource. We have developed Self-Compressing BLSOM (SC-BLSOM) for reduction of computation time, which allows us to carry out comprehensive analysis of big sequence data without the use of high-performance supercomputers. The strategy of SC-BLSOM is to hierarchically construct BLSOMs according to data class, such as phylotype. The first-layer BLSOM was constructed with each of the divided input data pieces that represents the data subclass, such as phylotype division, resulting in compression of the number of data pieces. The second BLSOM was constructed with a total of weight vectors obtained in the first-layer BLSOMs. We compared SC-BLSOM with the conventional BLSOM by analyzing bacterial genome sequences. SC-BLSOM could be constructed faster than BLSOM and cluster the sequences according to phylotype with high accuracy, showing the method's suitability for efficient knowledge discovery from big sequence data.


Asunto(s)
Algoritmos , Mapeo Cromosómico/métodos , Compresión de Datos/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Reconocimiento de Normas Patrones Automatizadas/métodos , Análisis de Secuencia de ADN/métodos , Genoma Bacteriano/genética , Programas Informáticos
18.
Genes Genet Syst ; 90(1): 43-53, 2015.
Artículo en Inglés | MEDLINE | ID: mdl-26119665

RESUMEN

Unsupervised data mining capable of extracting a wide range of information from big sequence data without prior knowledge or particular models is highly desirable in an era of big data accumulation for research on genes, genomes and genetic systems. By handling oligonucleotide compositions in genomic sequences as high-dimensional data, we have previously modified the conventional SOM (self-organizing map) for genome informatics and established BLSOM for oligonucleotide composition, which can analyze more than ten million sequences simultaneously and is thus suitable for big data analyses. Oligonucleotides often represent motif sequences responsible for sequence-specific binding of proteins such as transcription factors. The distribution of such functionally important oligonucleotides is probably biased in genomic sequences, and may differ among genomic regions. When constructing BLSOMs to analyze pentanucleotide composition in 50-kb sequences derived from the human genome in this study, we found that BLSOMs did not classify human sequences according to chromosome but revealed several specific zones, which are enriched for a class of CG-containing pentanucleotides; these zones are composed primarily of sequences derived from pericentric regions. The biological significance of enrichment of these pentanucleotides in pericentric regions is discussed in connection with cell type- and stage-dependent formation of the condensed heterochromatin in the chromocenter, which is formed through association of pericentric regions of multiple chromosomes.


Asunto(s)
Composición de Base , Sitios de Unión , Cromosomas Humanos , Genoma Humano , Genómica , Motivos de Nucleótidos , Oligonucleótidos , Factores de Transcripción/metabolismo , Genómica/métodos , Humanos
19.
Biomed Res Int ; 2014: 765648, 2014.
Artículo en Inglés | MEDLINE | ID: mdl-24804244

RESUMEN

With remarkable increase of genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional data such as oligonucleotide composition on one map. By modifying the conventional SOM, we have previously developed Batch-Learning SOM (BLSOM), which allows classification of sequence fragments according to species, solely depending on the oligonucleotide composition. In the present study, we introduce the oligonucleotide BLSOM used for characterization of vertebrate genome sequences. We first analyzed pentanucleotide compositions in 100 kb sequences derived from a wide range of vertebrate genomes and then the compositions in the human and mouse genomes in order to investigate an efficient method for detecting differences between the closely related genomes. BLSOM can recognize the species-specific key combination of oligonucleotide frequencies in each genome, which is called a "genome signature," and the specific regions specifically enriched in transcription-factor-binding sequences. Because the classification and visualization power is very high, BLSOM is an efficient powerful tool for extracting a wide range of information from massive amounts of genomic sequences (i.e., big sequence data).


Asunto(s)
Mapeo Cromosómico , Biología Computacional/métodos , Genoma , Genómica/métodos , Algoritmos , Animales , Humanos , Ratones , Análisis de Secuencia de ADN
20.
DNA Res ; 21(5): 459-67, 2014 Oct.
Artículo en Inglés | MEDLINE | ID: mdl-24800745

RESUMEN

With a remarkable increase in genomic sequence data of a wide range of species, novel tools are needed for comprehensive analyses of the big sequence data. Self-organizing map (SOM) is a powerful tool for clustering high-dimensional data on one plane. For oligonucleotide compositions handled as high-dimensional data, we have previously modified the conventional SOM for genome informatics: BLSOM. In the present study, we constructed BLSOMs for oligonucleotide compositions in fragment sequences (e.g. 100 kb) from a wide range of vertebrates, including coelacanth, and found that the sequences were clustered primarily according to species without species information. As one of the nearest living relatives of tetrapod ancestors, coelacanth is believed to provide access to the phenotypic and genomic transitions leading to the emergence of tetrapods. The characteristic oligonucleotide composition found for coelacanth was connected with the lowest dinucleotide CG occurrence (i.e. the highest CG suppression) among fishes, which was rather equivalent to that of tetrapods. This evident CG suppression in coelacanth should reflect molecular evolutionary processes of epigenetic systems including DNA methylation during vertebrate evolution. Sequence of a de novo DNA methylase (Dntm3a) of coelacanth was found to be more closely related to that of tetrapods than that of other fishes.


Asunto(s)
Evolución Molecular , Genoma , Vertebrados/genética , Animales , Biología Computacional , Metilasas de Modificación del ADN/genética , Peces/genética , Filogenia
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA