Pesquisa | Portal de Pesquisa da BVS Enfermagem

1.

Oligonucleotide usage in coronavirus genomes mimics that in exon regions in host genomes.

Iwasaki, Yuki; Abe, Takashi; Ikemura, Toshimichi.

Virol J ; 20(1): 39, 2023 03 01.

Artigo em Inglês | MEDLINE | ID: mdl-36859385

RESUMO

BACKGROUND: Viruses use various host factors for their growth, and efficient growth requires efficient use of these factors. Our previous study revealed that the occurrence frequency of oligonucleotides in the influenza virus genome is distinctly different among derived hosts, and the frequency tends to adapt to the host cells in which they grow. We aimed to study the adaptation mechanisms of a zoonotic virus to host cells. METHODS: Herein, we compared the frequency of oligonucleotides in the genome of alpha- and betacoronavirus with those in the genomes of humans and bats, which are typical hosts of the viruses. RESULTS: By comparing the oligonucleotide frequency in coronaviruses and their host genomes, we found a statistically tested positive correlation between the frequency of coronaviruses and that of the exon regions of the host from which the virus is derived. To examine the characteristics of early-stage changes in the viral genome, which are assumed to accompany the host change from non-humans to humans, we compared the oligonucleotide frequency between severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) at the beginning of the pandemic and the prevalent variants thereafter, and found changes towards the frequency of the host exon regions. CONCLUSIONS: In alpha- and betacoronaviruses, the genome oligonucleotide frequency is thought to change in response to the cellular environment in which the virus is replicating, and actually the frequency has approached the frequency in exon regions in the host.

Assuntos

COVID-19 , Quirópteros , Animais , SARS-CoV-2 , Éxons , Genoma Viral , Oligonucleotídeos

2.

Comparative genomic analysis of the human genome and six bat genomes using unsupervised machine learning: Mb-level CpG and TFBS islands.

Iwasaki, Yuki; Ikemura, Toshimichi; Wada, Kennosuke; Wada, Yoshiko; Abe, Takashi.

BMC Genomics ; 23(1): 497, 2022 Jul 08.

Artigo em Inglês | MEDLINE | ID: mdl-35804296

RESUMO

BACKGROUND: Emerging infectious disease-causing RNA viruses, such as the SARS-CoV-2 and Ebola viruses, are thought to rely on bats as natural reservoir hosts. Since these zoonotic viruses pose a great threat to humans, it is important to characterize the bat genome from multiple perspectives. Unsupervised machine learning methods for extracting novel information from big sequence data without prior knowledge or particular models are highly desirable for obtaining unexpected insights. We previously established a batch-learning self-organizing map (BLSOM) of the oligonucleotide composition that reveals novel genome characteristics from big sequence data. RESULTS: In this study, using the oligonucleotide BLSOM, we conducted a comparative genomic study of humans and six bat species. BLSOM is an explainable-type machine learning algorithm that reveals the diagnostic oligonucleotides contributing to sequence clustering (self-organization). When unsupervised machine learning reveals unexpected and/or characteristic features, these features can be studied in more detail via the much simpler and more direct standard distribution map method. Based on this combined strategy, we identified the Mb-level enrichment of CG dinucleotide (Mb-level CpG islands) around the termini of bat long-scaffold sequences. In addition, a class of CG-containing oligonucleotides were enriched in the centromeric and pericentromeric regions of human chromosomes. Oligonucleotides longer than tetranucleotides often represent binding motifs for a wide variety of proteins (e.g., transcription factor binding sequences (TFBSs)). By analyzing the penta- and hexanucleotide composition, we observed the evident enrichment of a wide range of hexanucleotide TFBSs in centromeric and pericentromeric heterochromatin regions on all human chromosomes. CONCLUSION: Function of transcription factors (TFs) beyond their known regulation of gene expression (e.g., TF-mediated looping interactions between two different genomic regions) has received wide attention. The Mb-level TFBS and CpG islands are thought to be involved in the large-scale nuclear organization, such as centromere and telomere clustering. TFBSs, which are enriched in centromeric and pericentromeric heterochromatin regions, are thought to play an important role in the formation of nuclear 3D structures. Our machine learning-based analysis will help us to understand the differential features of nuclear 3D structures in the human and bat genomes.

Assuntos

COVID-19 , Quirópteros/genética , Genoma Humano/genética , SARS-CoV-2/fisiologia , Animais , COVID-19/transmissão , Quirópteros/virologia , Ilhas de CpG , Genômica/métodos , Heterocromatina/química , Heterocromatina/genética , Humanos , Conformação Molecular , Oligonucleotídeos/química , Aprendizado de Máquina não Supervisionado

3.

Unsupervised explainable AI for molecular evolutionary study of forty thousand SARS-CoV-2 genomes.

Iwasaki, Yuki; Abe, Takashi; Wada, Kennosuke; Wada, Yoshiko; Ikemura, Toshimichi.

BMC Microbiol ; 22(1): 73, 2022 03 10.

Artigo em Inglês | MEDLINE | ID: mdl-35272618

RESUMO

BACKGROUND: Unsupervised AI (artificial intelligence) can obtain novel knowledge from big data without particular models or prior knowledge and is highly desirable for unveiling hidden features in big data. SARS-CoV-2 poses a serious threat to public health and one important issue in characterizing this fast-evolving virus is to elucidate various aspects of their genome sequence changes. We previously established unsupervised AI, a BLSOM (batch-learning SOM), which can analyze five million genomic sequences simultaneously. The present study applied the BLSOM to the oligonucleotide compositions of forty thousand SARS-CoV-2 genomes. RESULTS: While only the oligonucleotide composition was given, the obtained clusters of genomes corresponded primarily to known main clades and internal divisions in the main clades. Since the BLSOM is explainable AI, it reveals which features of the oligonucleotide composition are responsible for clade clustering. Additionally, BLSOM also provided information concerning the special genomic region possibly undergoing RNA modifications. CONCLUSIONS: The BLSOM has powerful image display capabilities and enables efficient knowledge discovery about viral evolutionary processes, and it can complement phylogenetic methods based on sequence alignment.

Assuntos

COVID-19 , SARS-CoV-2 , Inteligência Artificial , Evolução Molecular , Humanos , Filogenia , SARS-CoV-2/genética

4.

Human cell-dependent, directional, time-dependent changes in the mono- and oligonucleotide compositions of SARS-CoV-2 genomes.

Iwasaki, Yuki; Abe, Takashi; Ikemura, Toshimichi.

BMC Microbiol ; 21(1): 89, 2021 03 23.

Artigo em Inglês | MEDLINE | ID: mdl-33757449

RESUMO

BACKGROUND: When a virus that has grown in a nonhuman host starts an epidemic in the human population, human cells may not provide growth conditions ideal for the virus. Therefore, the invasion of severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), which is usually prevalent in the bat population, into the human population is thought to have necessitated changes in the viral genome for efficient growth in the new environment. In the present study, to understand host-dependent changes in coronavirus genomes, we focused on the mono- and oligonucleotide compositions of SARS-CoV-2 genomes and investigated how these compositions changed time-dependently in the human cellular environment. We also compared the oligonucleotide compositions of SARS-CoV-2 and other coronaviruses prevalent in humans or bats to investigate the causes of changes in the host environment. RESULTS: Time-series analyses of changes in the nucleotide compositions of SARS-CoV-2 genomes revealed a group of mono- and oligonucleotides whose compositions changed in a common direction for all clades, even though viruses belonging to different clades should evolve independently. Interestingly, the compositions of these oligonucleotides changed towards those of coronaviruses that have been prevalent in humans for a long period and away from those of bat coronaviruses. CONCLUSIONS: Clade-independent, time-dependent changes are thought to have biological significance and should relate to viral adaptation to a new host environment, providing important clues for understanding viral host adaptation mechanisms.

Assuntos

Composição de Bases , Evolução Molecular , Genoma Viral , SARS-CoV-2/genética , Animais , Quirópteros/virologia , Humanos , Oligonucleotídeos

5.

Notable clustering of transcription-factor-binding motifs in human pericentric regions and its biological significance.

Iwasaki, Yuki; Wada, Kennosuke; Wada, Yoshiko; Abe, Takashi; Ikemura, Toshimichi.

Chromosome Res ; 21(5): 461-74, 2013 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-23896648

RESUMO

Since oligonucleotide composition in the genome sequence varies significantly among species even among those possessing the same genome G + C%, the composition has been used to distinguish a wide range of genomes and called as "genome signature". Oligonucleotides often represent motif sequences responsible for sequence-specific protein binding (e.g., transcription-factor binding). Occurrences of such motif oligonucleotides in the genome should be biased compared to those observed in random sequences and may differ among genomes and genomic portions. Self-Organizing Map (SOM) is a powerful tool for clustering high-dimensional data such as oligonucleotide composition on one plane. We previously modified the conventional SOM for genome informatics to batch learning SOM or "BLSOM". When we constructed BLSOMs to analyze pentanucleotide composition in 20-, 50-, and 100-kb sequences derived from the human genome, BLSOMs did not classify human sequences according to chromosome but revealed several specific zones composed primarily of sequences derived from pericentric regions. Interestingly, various transcription-factor-binding motifs were characteristically overrepresented in pericentric regions but underrepresented in most genomic sequences. When we focused on much shorter sequences (e.g., 1 kb), the clustering of transcription-factor-binding motifs was evident in pericentric, subtelomeric and sex chromosome pseudoautosomal regions. The biological significance of the clustering in these regions was discussed in connection with cell-type and -stage-dependent chromocenter formation and nuclear organization.

Assuntos

Sítios de Ligação , Biologia Computacional/métodos , Genoma Humano , Genômica/métodos , Motivos de Nucleotídeos , Fatores de Transcrição/metabolismo , Sequência de Bases , Mapeamento Cromossômico , Análise por Conglomerados , Sequência Consenso , Bases de Dados Genéticas , Humanos

6.

Unsupervised AI reveals insect species-specific genome signatures.

Sawada, Yui; Minei, Ryuhei; Tabata, Hiromasa; Ikemura, Toshimichi; Wada, Kennosuke; Wada, Yoshiko; Nagata, Hiroshi; Iwasaki, Yuki.

PeerJ ; 12: e17025, 2024.

Artigo em Inglês | MEDLINE | ID: mdl-38464746

RESUMO

Insects are a highly diverse phylogeny and possess a wide variety of traits, including the presence or absence of wings and metamorphosis. These diverse traits are of great interest for studying genome evolution, and numerous comparative genomic studies have examined a wide phylogenetic range of insects. Here, we analyzed 22 insects belonging to a wide phylogenetic range (Endopterygota, Paraneoptera, Polyneoptera, Palaeoptera, and other insects) by using a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions in their genomic fragments (100-kb or 1-Mb sequences), which is an unsupervised machine learning algorithm that can extract species-specific characteristics of the oligonucleotide compositions (genome signatures). The genome signature is of particular interest in terms of the mechanisms and biological significance that have caused the species-specific difference, and can be used as a powerful search needle to explore the various roles of genome sequences other than protein coding, and can be used to unveil mysteries hidden in the genome sequence. Since BLSOM is an unsupervised clustering method, the clustering of sequences was performed based on the oligonucleotide composition alone, without providing information about the species from which each fragment sequence was derived. Therefore, not only the interspecies separation, but also the intraspecies separation can be achieved. Here, we have revealed the specific genomic regions with oligonucleotide compositions distinct from the usual sequences of each insect genome, e.g., Mb-level structures found for a grasshopper Schistocerca americana. One aim of this study was to compare the genome characteristics of insects with those of vertebrates, especially humans, which are phylogenetically distant from insects. Recently, humans seem to be the "model organism" for which a large amount of information has been accumulated using a variety of cutting-edge and high-throughput technologies. Therefore, it is reasonable to use the abundant information from humans to study insect lineages. The specific regions of Mb length with distinct oligonucleotide compositions have also been previously observed in the human genome. These regions were enriched by transcription factor binding motifs (TFBSs) and hypothesized to be involved in the three-dimensional arrangement of chromosomal DNA in interphase nuclei. The present study characterized the species-specific oligonucleotide compositions (i.e., genome signatures) in insect genomes and identified specific genomic regions with distinct oligonucleotide compositions.

Assuntos

Genoma Humano , Genoma de Inseto , Animais , Humanos , Filogenia , Genoma de Inseto/genética , Oligonucleotídeos/genética , Inteligência Artificial

7.

Systematization of the protein sequence diversity in enzymes related to secondary metabolic pathways in plants, in the context of big data biology inspired by the KNApSAcK motorcycle database.

Ikeda, Shun; Abe, Takashi; Nakamura, Yukiko; Kibinge, Nelson; Hirai Morita, Aki; Nakatani, Atsushi; Ono, Naoaki; Ikemura, Toshimichi; Nakamura, Kensuke; Altaf-Ul-Amin, Md; Kanaya, Shigehiko.

Plant Cell Physiol ; 54(5): 711-27, 2013 May.

Artigo em Inglês | MEDLINE | ID: mdl-23509110

RESUMO

Biology is increasingly becoming a data-intensive science with the recent progress of the omics fields, e.g. genomics, transcriptomics, proteomics and metabolomics. The species-metabolite relationship database, KNApSAcK Core, has been widely utilized and cited in metabolomics research, and chronological analysis of that research work has helped to reveal recent trends in metabolomics research. To meet the needs of these trends, the KNApSAcK database has been extended by incorporating a secondary metabolic pathway database called Motorcycle DB. We examined the enzyme sequence diversity related to secondary metabolism by means of batch-learning self-organizing maps (BL-SOMs). Initially, we constructed a map by using a big data matrix consisting of the frequencies of all possible dipeptides in the protein sequence segments of plants and bacteria. The enzyme sequence diversity of the secondary metabolic pathways was examined by identifying clusters of segments associated with certain enzyme groups in the resulting map. The extent of diversity of 15 secondary metabolic enzyme groups is discussed. Data-intensive approaches such as BL-SOM applied to big data matrices are needed for systematizing protein sequences. Handling big data has become an inevitable part of biology.

Assuntos

Bases de Dados como Assunto , Proteínas de Plantas/química , Plantas/metabolismo , Metabolismo Secundário , Alcaloides/metabolismo , Alquil e Aril Transferases/metabolismo , Sequência de Aminoácidos , Sistema Enzimático do Citocromo P-450/metabolismo , Flavonoides/metabolismo , Metabolômica , Peptídeos/química , Plantas/enzimologia

8.

Rice Annotation Project Database (RAP-DB): an integrative and interactive database for rice genomics.

Sakai, Hiroaki; Lee, Sung Shin; Tanaka, Tsuyoshi; Numa, Hisataka; Kim, Jungsok; Kawahara, Yoshihiro; Wakimoto, Hironobu; Yang, Ching-chia; Iwamoto, Masao; Abe, Takashi; Yamada, Yuko; Muto, Akira; Inokuchi, Hachiro; Ikemura, Toshimichi; Matsumoto, Takashi; Sasaki, Takuji; Itoh, Takeshi.

Plant Cell Physiol ; 54(2): e6, 2013 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-23299411

RESUMO

The Rice Annotation Project Database (RAP-DB, http://rapdb.dna.affrc.go.jp/) has been providing a comprehensive set of gene annotations for the genome sequence of rice, Oryza sativa (japonica group) cv. Nipponbare. Since the first release in 2005, RAP-DB has been updated several times along with the genome assembly updates. Here, we present our newest RAP-DB based on the latest genome assembly, Os-Nipponbare-Reference-IRGSP-1.0 (IRGSP-1.0), which was released in 2011. We detected 37,869 loci by mapping transcript and protein sequences of 150 monocot species. To provide plant researchers with highly reliable and up to date rice gene annotations, we have been incorporating literature-based manually curated data, and 1,626 loci currently incorporate literature-based annotation data, including commonly used gene names or gene symbols. Transcriptional activities are shown at the nucleotide level by mapping RNA-Seq reads derived from 27 samples. We also mapped the Illumina reads of a Japanese leading japonica cultivar, Koshihikari, and a Chinese indica cultivar, Guangluai-4, to the genome and show alignments together with the single nucleotide polymorphisms (SNPs) and gene functional annotations through a newly developed browser, Short-Read Assembly Browser (S-RAB). We have developed two satellite databases, Plant Gene Family Database (PGFD) and Integrative Database of Cereal Gene Phylogeny (IDCGP), which display gene family and homologous gene relationships among diverse plant species. RAP-DB and the satellite databases offer simple and user-friendly web interfaces, enabling plant and genome researchers to access the data easily and facilitating a broad range of plant research topics.

Assuntos

Bases de Dados Genéticas , Anotação de Sequência Molecular , Oryza/genética , Sequência de Bases , Perfilação da Expressão Gênica , Genes de Plantas , Loci Gênicos , Genômica/métodos , Repetições de Microssatélites , Dados de Sequência Molecular , Oryza/classificação , Filogenia , Polimorfismo de Nucleotídeo Único , Ferramenta de Busca , Homologia de Sequência

9.

Novel bioinformatics strategies for prediction of directional sequence changes in influenza virus genomes and for surveillance of potentially hazardous strains.

Iwasaki, Yuki; Abe, Takashi; Wada, Yoshiko; Wada, Kennosuke; Ikemura, Toshimichi.

BMC Infect Dis ; 13: 386, 2013 Aug 21.

Artigo em Inglês | MEDLINE | ID: mdl-23964903

RESUMO

BACKGROUND: With the remarkable increase of microbial and viral sequence data obtained from high-throughput DNA sequencers, novel tools are needed for comprehensive analysis of the big sequence data. We have developed "Batch-Learning Self-Organizing Map (BLSOM)" which can characterize very many, even millions of, genomic sequences on one plane. Influenza virus is one of zoonotic viruses and shows clear host tropism. Important issues for bioinformatics studies of influenza viruses are prediction of genomic sequence changes in the near future and surveillance of potentially hazardous strains. METHODS: To characterize sequence changes in influenza virus genomes after invasion into humans from other animal hosts, we applied BLSOMs to analyses of mono-, di-, tri-, and tetranucleotide compositions in all genome sequences of influenza A and B viruses and found clear host-dependent clustering (self-organization) of the sequences. RESULTS: Viruses isolated from humans and birds differed in mononucleotide composition from each other. In addition, host-dependent oligonucleotide compositions that could not be explained with the host-dependent mononucleotide composition were revealed by oligonucleotide BLSOMs. Retrospective time-dependent directional changes of mono- and oligonucleotide compositions, which were visualized for human strains on BLSOMs, could provide predictive information about sequence changes in newly invaded viruses from other animal hosts (e.g. the swine-derived pandemic H1N1/09). CONCLUSIONS: Basing on the host-dependent oligonucleotide composition, we proposed a strategy for prediction of directional changes of virus sequences and for surveillance of potentially hazardous strains when introduced into human populations from non-human sources. Millions of genomic sequences from infectious microbes and viruses have become available because of their medical and social importance, and BLSOM can characterize the big data and support efficient knowledge discovery.

Assuntos

Genoma Viral , Genômica/métodos , Vírus da Influenza A/genética , Vírus da Influenza B/genética , Influenza Humana/virologia , Bases de Dados Genéticas , Humanos , Vírus da Influenza A Subtipo H1N1/genética , Modelos Genéticos , RNA Viral , Estudos Retrospectivos , Análise de Sequência de RNA , Tropismo Viral

10.

tRNADB-CE 2011: tRNA gene database curated manually by experts.

Abe, Takashi; Ikemura, Toshimichi; Sugahara, Junichi; Kanai, Akio; Ohara, Yasuo; Uehara, Hiroshi; Kinouchi, Makoto; Kanaya, Shigehiko; Yamada, Yuko; Muto, Akira; Inokuchi, Hachiro.

Nucleic Acids Res ; 39(Database issue): D210-3, 2011 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-21071414

RESUMO

We updated the tRNADB-CE by analyzing 939 complete and 1301 draft genomes of prokaryotes and eukaryotes, 171 complete virus genomes, 121 complete chloroplast genomes and approximately 230 million sequences obtained by metagenome analyses of 210 environmental samples. The 287 102 tRNA genes in total, and thus two times of the tRNA genes compiled previously, are compiled, in which sequence information, clover-leaf structure and results of sequence similarity and oligonucleotide-pattern search can be browsed. In order to pool collective knowledge with help from any experts in the tRNA research field, we included a column to which comments can be added on each tRNA gene. By compiling tRNAs of known prokaryotes with identical sequences, we found high phylogenetic preservation of tRNA sequences, especially at a phylum level. Furthermore, a large number of tRNAs obtained by metagenome analyses of environmental samples had sequences identical to those found in known prokaryotes. The identical sequence group, therefore, can be used as phylogenetic markers to clarify the microbial community structure of an ecosystem. The updated tRNADB-CE provided functions, with which users can obtain the phylotype-specific markers (e.g. genus-specific markers) by themselves and clarify microbial community structures of ecosystems in detail. tRNADB-CE can be accessed freely at http://trna.nagahama-i-bio.ac.jp.

Assuntos

Bases de Dados de Ácidos Nucleicos , RNA de Transferência/genética , Genes , Genômica , Metagenômica , Filogenia , RNA de Transferência/química , RNA de Transferência/classificação , Análise de Sequência de DNA

11.

AI-based search for convergently expanding, advantageous mutations in SARS-CoV-2 by focusing on oligonucleotide frequencies.

Ikemura, Toshimichi; Iwasaki, Yuki; Wada, Kennosuke; Wada, Yoshiko; Abe, Takashi.

PLoS One ; 17(8): e0273860, 2022.

Artigo em Inglês | MEDLINE | ID: mdl-36044525

RESUMO

Among mutations that occur in SARS-CoV-2, efficient identification of mutations advantageous for viral replication and transmission is important to characterize and defeat this rampant virus. Mutations rapidly expanding frequency in a viral population are candidates for advantageous mutations, but neutral mutations hitchhiking with advantageous mutations are also likely to be included. To distinguish these, we focus on mutations that appear to occur independently in different lineages and expand in frequency in a convergent evolutionary manner. Batch-learning SOM (BLSOM) can separate SARS-CoV-2 genome sequences according by lineage from only providing the oligonucleotide composition. Focusing on remarkably expanding 20-mers, each of which is only represented by one copy in the viral genome, allows us to correlate the expanding 20-mers to mutations. Using visualization functions in BLSOM, we can efficiently identify mutations that have expanded remarkably both in the Omicron lineage, which is phylogenetically distinct from other lineages, and in other lineages. Most of these mutations involved changes in amino acids, but there were a few that did not, such as an intergenic mutation.

Assuntos

COVID-19 , Mutação , Oligonucleotídeos , SARS-CoV-2 , Inteligência Artificial , COVID-19/genética , Genoma Viral , Humanos , Aprendizado de Máquina , Oligonucleotídeos/genética , Filogenia , SARS-CoV-2/genética , Glicoproteína da Espícula de Coronavírus/genética

12.

tRNADB-CE: tRNA gene database curated manually by experts.

Abe, Takashi; Ikemura, Toshimichi; Ohara, Yasuo; Uehara, Hiroshi; Kinouchi, Makoto; Kanaya, Shigehiko; Yamada, Yuko; Muto, Akira; Inokuchi, Hachiro.

Nucleic Acids Res ; 37(Database issue): D163-8, 2009 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-18842632

RESUMO

We constructed a new large-scale database of tRNA genes by analyzing 534 complete genomes of prokaryotes and 394 draft genomes in WGS (Whole Genome Shotgun) division in DDBJ/EMBL/GenBank and approximately 6.2 million DNA fragment sequences obtained from metagenomic analyses. This exhaustive search for tRNA genes was performed by running three computer programs to enhance completeness and accuracy of the prediction. Discordances of assignment among three programs were found for approximately 4% of the total of tRNA gene candidates obtained from these prokaryote genomes analyzed. The discordant cases were manually checked by experts in the tRNA experimental field. In total, 144,061 tRNA genes were registered in the database 'tRNADB-CE', and the number of the genes was more than four times of that of the genes previously reported by the database from analyses of complete genomes with tRNAscan-SE program. The tRNADB-CE allows for browsing sequence information, cloverleaf structures and results of similarity searches among all tRNA genes. For each of the complete genomes, the number of tRNA genes for individual anticodons and the codon usage frequency in all protein genes and the positioning of individual tRNA genes in each genome can be browsed. tRNADB-CE can be accessed freely at http://trna.nagahama-i-bio.ac.jp.

Assuntos

Bases de Dados de Ácidos Nucleicos , Genes Arqueais , Genes Bacterianos , RNA de Transferência/genética , Genômica

13.

AI for the collective analysis of a massive number of genome sequences: various examples from the small genome of pandemic SARS-CoV-2 to the human genome.

Ikemura, Toshimichi; Iwasaki, Yuki; Wada, Kennosuke; Wada, Yoshiko; Abe, Takashi.

Genes Genet Syst ; 96(4): 165-176, 2021 Dec 16.

Artigo em Inglês | MEDLINE | ID: mdl-34565757

RESUMO

In genetics and related fields, huge amounts of data, such as genome sequences, are accumulating, and the use of artificial intelligence (AI) suitable for big data analysis has become increasingly important. Unsupervised AI that can reveal novel knowledge from big data without prior knowledge or particular models is highly desirable for analyses of genome sequences, particularly for obtaining unexpected insights. We have developed a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions that can reveal various novel genome characteristics. Here, we explain the data mining by the BLSOM: an unsupervised AI. As a specific target, we first selected SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) because a large number of viral genome sequences have been accumulated via worldwide efforts. We analyzed more than 0.6 million sequences collected primarily in the first year of the pandemic. BLSOMs for short oligonucleotides (e.g., 4-6-mers) allowed separation into known clades, but longer oligonucleotides further increased the separation ability and revealed subgrouping within known clades. In the case of 15-mers, there is mostly one copy in the genome; thus, 15-mers that appeared after the epidemic started could be connected to mutations, and the BLSOM for 15-mers revealed the mutations that contributed to separation into known clades and their subgroups. After introducing the detailed methodological strategies, we explain BLSOMs for various topics, such as the tetranucleotide BLSOM for over 5 million 5-kb fragment sequences derived from almost all microorganisms currently available and its use in metagenome studies. We also explain BLSOMs for various eukaryotes, including fishes, frogs and Drosophila species, and found a high separation ability among closely related species. When analyzing the human genome, we found enrichments in transcription factor-binding sequences in centromeric and pericentromeric heterochromatin regions. The tDNAs (tRNA genes) could be separated according to their corresponding amino acid.

Assuntos

Inteligência Artificial , Biologia Computacional/métodos , Genoma Humano , Genoma Viral , SARS-CoV-2/genética , Análise por Conglomerados , Uso do Códon , Humanos , Metagenômica/métodos , Mutação , RNA de Transferência , Fatores de Tempo

14.

Comparative genomics of Glandirana rugosa using unsupervised AI reveals a high CG frequency.

Katsura, Yukako; Ikemura, Toshimichi; Kajitani, Rei; Toyoda, Atsushi; Itoh, Takehiko; Ogata, Mitsuaki; Miura, Ikuo; Wada, Kennosuke; Wada, Yoshiko; Satta, Yoko.

Life Sci Alliance ; 4(5)2021 05.

Artigo em Inglês | MEDLINE | ID: mdl-33712508

RESUMO

The Japanese wrinkled frog (Glandirana rugosa) is unique in having both XX-XY and ZZ-ZW types of sex chromosomes within the species. The genome sequencing and comparative genomics with other frogs should be important to understand mechanisms of turnover of sex chromosomes within one species or during a short period. In this study, we analyzed the newly sequenced genome of G. rugosa using a batch-learning self-organizing map which is unsupervised artificial intelligence for oligonucleotide compositions. To clarify genome characteristics of G. rugosa, we compared its short oligonucleotide compositions in all 1-Mb genomic fragments with those of other six frog species (Pyxicephalus adspersus, Rhinella marina, Spea multiplicata, Leptobrachium leishanense, Xenopus laevis, and Xenopus tropicalis). In G. rugosa, we found an Mb-level large size of repeat sequences having a high identity with the W chromosome of the African bullfrog (P. adspersus). Our study concluded that G. rugosa has unique genome characteristics with a high CG frequency, and its genome is assumed to heterochromatinize a large size of genome via methylataion of CG.

Assuntos

Composição de Bases/genética , Ranidae/genética , Cromossomos Sexuais/genética , Animais , Sequência de Bases/genética , Feminino , Genômica/métodos , Masculino , Filogenia , Aprendizado de Máquina não Supervisionado

15.

Implication of a new function of human tDNAs in chromatin organization.

Iwasaki, Yuki; Ikemura, Toshimichi; Kurokawa, Ken; Okada, Norihiro.

Sci Rep ; 10(1): 17440, 2020 10 15.

Artigo em Inglês | MEDLINE | ID: mdl-33060757

RESUMO

Transfer RNA genes (tDNAs) are essential genes that encode tRNAs in all species. To understand new functions of tDNAs, other than that of encoding tRNAs, we used ENCODE data to examine binding characteristics of transcription factors (TFs) for all tDNA regions (489 loci) in the human genome. We divided the tDNAs into three groups based on the number of TFs that bound to them. At the two extremes were tDNAs to which many TFs bound (Group 1) and those to which no TFs bound (Group 3). Several TFs involved in chromatin remodeling such as ATF3, EP300 and TBL1XR1 bound to almost all Group 1 tDNAs. Furthermore, almost all Group 1 tDNAs included DNase I hypersensitivity sites and may thus interact with other chromatin regions through their bound TFs, and they showed highly conserved synteny across tetrapods. In contrast, Group 3 tDNAs did not possess these characteristics. These data suggest the presence of a previously uncharacterized function of these tDNAs. We also examined binding of CTCF to tDNAs and their involvement in topologically associating domains (TADs) and lamina-associated domains (LADs), which suggest a new perspective on the evolution and function of tDNAs.

Assuntos

Cromatina/química , RNA de Transferência/metabolismo , Fatores de Transcrição/metabolismo , Células A549 , Fator 3 Ativador da Transcrição/metabolismo , Motivos de Aminoácidos , Biologia Computacional , Bases de Dados Factuais , Proteína p300 Associada a E1A/metabolismo , Genoma Humano , Células HeLa , Células Hep G2 , Humanos , Células K562 , Domínios Proteicos , Receptores Citoplasmáticos e Nucleares/metabolismo , Proteínas Repressoras/metabolismo , Sintenia , Fatores de Transcrição TFIII/metabolismo

16.

Time-series analyses of directional sequence changes in SARS-CoV-2 genomes and an efficient search method for candidates for advantageous mutations for growth in human cells.

Wada, Kennosuke; Wada, Yoshiko; Ikemura, Toshimichi.

Gene ; 763S: 100038, 2020 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-34493367

RESUMO

We first conducted time-series analysis of mono- and dinucleotide composition for over 10,000 SARS-CoV-2 genomes, as well as over 1500 Zaire ebolavirus genomes, and found clear time-series changes in the compositions on a monthly basis, which should reflect viral adaptations for efficient growth in human cells. We next developed a sequence alignment free method that extensively searches for advantageous mutations and rank them in an increase level for their intrapopulation frequency. Time-series analysis of occurrences of oligonucleotides of diverse lengths for SARS-CoV-2 genomes revealed seven distinctive mutations that rapidly expanded their intrapopulation frequency and are thought to be candidates of advantageous mutations for the efficient growth in human cells.

Assuntos

COVID-19/genética , Genoma Viral/genética , RNA Viral/genética , SARS-CoV-2/genética , COVID-19/patologia , Humanos , Mutação/genética , Oligonucleotídeos/genética , SARS-CoV-2/patogenicidade , Alinhamento de Sequência

17.

Mb-level CpG and TFBS islands visualized by AI and their roles in the nuclear organization of the human genome.

Wada, Kennosuke; Wada, Yoshiko; Ikemura, Toshimichi.

Genes Genet Syst ; 95(1): 29-41, 2020 Apr 22.

Artigo em Inglês | MEDLINE | ID: mdl-32161227

RESUMO

Unsupervised machine learning that can discover novel knowledge from big sequence data without prior knowledge or particular models is highly desirable for current genome study. We previously established a batch-learning self-organizing map (BLSOM) for oligonucleotide compositions, which can reveal various novel genome characteristics from big sequence data, and found that transcription factor binding sequences (TFBSs) and CpG-containing oligonucleotides are enriched in human centromeric and pericentromeric regions, which support centromere clustering and form the condensed heterochromatin "chromocenter" in interphase nuclei. The number and size of chromocenters, as well as the type of centromeres gathered in individual chromocenters, vary depending on cell type. To study molecular mechanisms of cell type-dependent chromocenter formation, we analyzed distribution patterns of occurrence per Mb of hexa- and heptanucleotide TFBSs, which have been compiled by the SwissRegulon Portal, and of CpG-containing oligonucleotides. We found Mb-level islands enriched for TFBSs and CpG-containing oligonucleotides in centromeric and pericentromeric regions on all human chromosomes except chrY. Considering molecular mechanisms for cell type-dependent centromere clustering, the chromosome-dependent enrichment of a set of TFBSs and CpG-containing oligonucleotides is of particular interest, since the cellular content of TFs and methyl-CpG-binding proteins exhibits cell type-dependent regulation. A newly introduced BLSOM, which analyzed occurrences of a total of 3,946 octanucleotide TFBSs compiled by the SwissRegulon Portal, has self-organized (separated) the sequences that are characteristically enriched in TFBSs and shown that these sequences are derived primarily from centromeric and pericentromeric constitutive heterochromatin regions. Furthermore, the BLSOM identified and visualized characteristic TFBSs that are enriched in these regions. By analyzing Hi-C data for interchromosomal interactions, the present study showed that the chromatin segments supporting the interchromosomal interactions locate primarily in Mb-level TFBS and CpG islands and are thus enriched for a wide variety of TFBSs and CG-containing oligonucleotides.

Assuntos

Inteligência Artificial , Cromossomos Humanos/genética , Ilhas de CpG/genética , Genoma Humano/genética , Sítios de Ligação , Centrômero/genética , Heterocromatina/genética , Humanos , Oligonucleotídeos/genética , Ligação Proteica , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo

18.

Time-series analyses of directional sequence changes in SARS-CoV-2 genomes and an efficient search method for candidates for advantageous mutations for growth in human cells.

Wada, Kennosuke; Wada, Yoshiko; Ikemura, Toshimichi.

Gene X ; 5: 100038, 2020 Dec.

Artigo em Inglês | MEDLINE | ID: mdl-32835214

RESUMO

We first conducted time-series analysis of mono- and dinucleotide composition for over 10,000 SARS-CoV-2 genomes, as well as over 1500 Zaire ebolavirus genomes, and found clear time-series changes in the compositions on a monthly basis, which should reflect viral adaptations for efficient growth in human cells. We next developed a sequence alignment free method that extensively searches for advantageous mutations and rank them in an increase level for their intrapopulation frequency. Time-series analysis of occurrences of oligonucleotides of diverse lengths for SARS-CoV-2 genomes revealed seven distinctive mutations that rapidly expanded their intrapopulation frequency and are thought to be candidates of advantageous mutations for the efficient growth in human cells.

19.

A strategy for predicting gene functions from genome and metagenome sequences on the basis of oligopeptide frequency distance.

Abe, Takashi; Ikarashi, Ryo; Mizoguchi, Masaya; Otake, Masashi; Ikemura, Toshimichi.

Genes Genet Syst ; 95(1): 11-19, 2020 Apr 22.

Artigo em Inglês | MEDLINE | ID: mdl-32161228

RESUMO

As a result of the extensive decoding of a massive amount of genomic and metagenomic sequence data, a large number of genes whose functions cannot be predicted by sequence similarity searches are accumulating, and such genes are of little use to science or industry. Current genome and metagenome sequencing largely depend on high-throughput and low-cost methods. In the case of genome sequencing for a single species, high-density sequencing can reduce sequencing errors. For metagenome sequences, however, high-density sequencing does not necessarily increase the sequence quality because multiple and unknown genomes, including those of closely related species, are likely to exist in the sample. Therefore, a function prediction method that is robust against sequence errors becomes an increased need. Here, we present a method for predicting protein gene function that does not depend on sequence similarity searches. Using an unsupervised machine learning method called BLSOM (batch-learning self-organizing map) for short oligopeptide frequencies, we previously developed a sequence alignment-free method for clustering bacterial protein genes according to clusters of orthologous groups of proteins (COGs), without using information from COGs during machine learning. This allows function-unknown proteins to cluster with function-known proteins, based solely on similarity with respect to oligopeptide frequency, although the method required high-performance supercomputers (HPCs). Based on a wide range of knowledge obtained with HPCs, we have now developed a strategy to correlate function-unknown proteins with COG categories, using only oligopeptide frequency distances (OPDs), which can be conducted with PC-level computers. The OPD strategy is suitable for predicting the functions of proteins with low sequence similarity and is applied here to predict the functions of a large number of gene candidates discovered using metagenome sequencing.

Assuntos

Algoritmos , Eucariotos/genética , Metagenoma/genética , Metagenômica , Anotação de Sequência Molecular , Oligopeptídeos/genética , Sequência de Aminoácidos , Mapeamento Cromossômico , Proteínas/genética , Proteínas/metabolismo , Alinhamento de Sequência

20.

Heterogeneity of synonymous substitution rates in the Xenopus frog genome.

Lau, Quintin; Igawa, Takeshi; Ogino, Hajime; Katsura, Yukako; Ikemura, Toshimichi; Satta, Yoko.

PLoS One ; 15(8): e0236515, 2020.

Artigo em Inglês | MEDLINE | ID: mdl-32764757

RESUMO

With the increasing availability of high quality genomic data, there is opportunity to deeply explore the genealogical relationships of different gene loci between closely related species. In this study, we utilized genomes of Xenopus laevis (XLA, a tetraploid species with (L) and (S) sub-genomes) and X. tropicalis (XTR, a diploid species) to investigate whether synonymous substitution rates among orthologous or homoeologous genes displayed any heterogeneity. From over 1500 orthologous/homoeologous genes collected, we calculated proportion of synonymous substitutions between genomes/sub-genomes (k) and found variation within and between chromosomes. Within most chromosomes, we identified higher k with distance from the centromere, likely attributed to higher substitution rates and recombination in these regions. Using maximum likelihood methods, we identified further evidence supporting rate heterogeneity, and estimated species divergence times and ancestral population sizes. Estimated species divergence times (XLA.L-XLA.S: ~25.5 mya; XLA-XTR: ~33.0 mya) were slightly younger compared to a past study, attributed to consideration of population size in our study. Meanwhile, we found very large estimated population size in the ancestral populations of the two species (NA = 2.55 x 106). Local hybridization and population structure, which have not yet been well elucidated in frogs, may be a contributing factor to these possible large population sizes.

Assuntos

Evolução Molecular , Genoma/genética , Mutação Silenciosa/genética , Xenopus laevis/genética , Animais , Cromossomos , Heterogeneidade Genética , Hibridização Genética , Hibridização in Situ Fluorescente , Filogenia

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA