RESUMO
BACKGROUND: Comprehensively understanding the dynamics of biological systems is among the biggest current challenges in biology and medicine. To acquire this understanding, researchers have measured the time-series expression profiles of cell lines of various organisms. Biological technologies have also drastically improved, providing a huge amount of information with support from bioinformatics and systems biology. However, the transitions between the activation and inactivation of gene regulations, at the temporal resolution of single time points, are difficult to extract from time-course gene expression profiles. RESULTS: Our proposed method reports the activation period of each gene regulation from gene expression profiles and a gene regulatory network. The correctness and effectiveness of the method were validated by analyzing the diauxic shift from glucose to lactose in Escherichia coli. The method completely detected the three periods of the shift; 1) consumption of glucose as nutrient source, 2) the period of seeking another nutrient source and 3) consumption of lactose as nutrient source. We then applied the method to mouse adipocyte differentiation data. Cell differentiation into adipocytes is known to involve two waves of the gene regulation cascade, and sub-waves are predicted. From the gene expression profiles of the cell differentiation process from ES to adipose cells (62 time points), our method acquired four periods; three periods covering the two known waves of the cascade, and a final period of gene regulations when the differentiation to adipocytes was completed. CONCLUSIONS: Our proposed method identifies the transitions of gene regulations from time-series gene expression profiles. Dynamic analyses are essential for deep understanding of biological systems and for identifying the causes of the onset of diseases such as diabetes and osteoporosis. The proposed method can greatly contribute to the progress of biology and medicine.
Assuntos
Adipócitos/citologia , Diferenciação Celular/genética , Escherichia coli/genética , Regulação da Expressão Gênica , Adipócitos/metabolismo , Algoritmos , Animais , Automação , Regulação Bacteriana da Expressão Gênica , Redes Reguladoras de Genes , Genes Bacterianos , Camundongos , Fatores de TempoRESUMO
Recent studies have shown that environmental DNA is found almost everywhere. Flower petal surfaces are an attractive tissue to use for investigation of the dispersal of environmental DNA in nature as they are isolated from the external environment until the bud opens and only then can the petal surface accumulate environmental DNA. Here, we performed a crowdsourced experiment, the "Ohanami Project", to obtain environmental DNA samples from petal surfaces of Cerasus × yedoensis 'Somei-yoshino' across the Japanese archipelago during spring 2015. C. × yedoensis is the most popular garden cherry species in Japan and clones of this cultivar bloom simultaneously every spring. Data collection spanned almost every prefecture and totaled 577 DNA samples from 149 collaborators. Preliminary amplicon-sequencing analysis showed the rapid attachment of environmental DNA onto the petal surfaces. Notably, we found DNA of other common plant species in samples obtained from a wide distribution; this DNA likely originated from the pollen of the Japanese cedar. Our analysis supports our belief that petal surfaces after blossoming are a promising target to reveal the dynamics of environmental DNA in nature. The success of our experiment also shows that crowdsourced environmental DNA analyses have considerable value in ecological studies.
Assuntos
DNA de Plantas/genética , DNA/genética , Meio Ambiente , Flores/genética , Prunus/genética , Cloroplastos/genética , Cianobactérias/genética , Flores/microbiologia , Japão , Proteobactérias/genética , Prunus/microbiologia , Alinhamento de Sequência , Análise de Sequência de DNARESUMO
BACKGROUND: Bayesian networks (BNs) have been widely used to estimate gene regulatory networks. Many BN methods have been developed to estimate networks from microarray data. However, two serious problems reduce the effectiveness of current BN methods. The first problem is that BN-based methods require huge computational time to estimate large-scale networks. The second is that the estimated network cannot have cyclic structures, even if the actual network has such structures. RESULTS: In this paper, we present a novel BN-based deterministic method with reduced computational time that allows cyclic structures. Our approach generates all the combinational triplets of genes, estimates networks of the triplets by BN, and unites the networks into a single network containing all genes. This method decreases the search space of predicting gene regulatory networks without degrading the solution accuracy compared with the greedy hill climbing (GHC) method. The order of computational time is the cube of number of genes. In addition, the network estimated by our method can include cyclic structures. CONCLUSIONS: We verified the effectiveness of the proposed method for all known gene regulatory networks and their expression profiles. The results demonstrate that this approach can predict regulatory networks with reduced computational time without degrading the solution accuracy compared with the GHC method.
Assuntos
Teorema de Bayes , Biologia Computacional/métodos , Redes Reguladoras de Genes/genéticaRESUMO
BACKGROUND: With the advent of next-generation sequencers, the growing demands to map short DNA sequences to a genome have promoted the development of fast algorithms and tools. The tools commonly used today are based on either a hash table or the suffix array/Burrow-Wheeler transform. These algorithms are the best suited to finding the genome position of exactly matching short reads. However, they have limited capacity to handle the mismatches. To find n-mismatches, they requires O(2n) times the computation time of exact matches. Therefore, acceleration techniques are required. RESULTS: We propose a hash-based method for genome mapping that reduces the number of hash references for finding mismatches without increasing the size of the hash table. The method regards DNA subsequences as words on Galois extension field GF(2²) and each word is encoded to a code word of a perfect Hamming code. The perfect Hamming code defines equivalence classes of DNA subsequences. Each equivalence class includes subsequence whose corresponding words on GF(2²) are encoded to a corresponding code word. The code word is used as a hash key to store these subsequences in a hash table. Specifically, it reduces by about 70% the number of hash keys necessary for searching the genome positions of all 2-mismatches of 21-base-long DNA subsequence. CONCLUSIONS: The paper shows perfect hamming code can reduce the number of hash references for hash-based genome mapping. As the computation time to calculate code words is far shorter than a hash reference, our method is effective to reduce the computation time to map short DNA sequences to genome. The amount of data that DNA sequencers generate continues to increase and more accurate genome mappings are required. Thus our method will be a key technology to develop faster genome mapping software.
Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Sequência de Bases , Ferramenta de Busca , Análise de Sequência de DNARESUMO
BACKGROUND: Peritoneal relapse is the most common pattern of tumor progression in advanced gastric cancer. Clinicopathological findings are sometimes inadequate for predicting peritoneal relapse. The aim of this study was to identify patients at high risk of peritoneal relapse in a prospective study based on molecular prediction. METHODS: RNA samples from 141 primary gastric cancer tissues after curative surgery were profiled using oligonucleotide microarrays covering 30,000 human probes. Firstly, we constructed a molecular prediction system and validated its robustness and prognostic validity by 500 times multiple validation by repeated random sampling in a retrospective set of 56 (38 relapse-free and 18 peritoneal-relapse) patients. Secondly, we applied this prediction to 85 patients of the prospective set to assess predictive accuracy and prognostic validity. RESULTS: In the retrospective phase, repeated random validation yielded approximately 68% predictive accuracy and a 22-gene expression profile associated with peritoneal relapse was identified. The prediction system identified patients with poor prognosis. In the prospective phase, the molecular prediction yielded 76.9% overall accuracy. Kaplan-Meier analysis of peritoneal-relapse-free survival showed a significant difference between the "good signature group" and "poor signature group" (log-rank p = 0.0017). Multivariate analysis by Cox regression hazards model identified the molecular prediction as the only independent prognostic factor for peritoneal relapse. CONCLUSIONS: Gene expression profile inherent to primary gastric cancer tissues can be useful in prospective prediction of peritoneal relapse after curative surgery, potentially allowing individualized postoperative management to improve the prognosis of patients with advanced gastric cancer.
Assuntos
Biomarcadores Tumorais/genética , Perfilação da Expressão Gênica , Recidiva Local de Neoplasia/genética , Neoplasias Peritoneais/genética , Neoplasias Gástricas/genética , Idoso , Biomarcadores Tumorais/metabolismo , Feminino , Seguimentos , Humanos , Metástase Linfática , Masculino , Recidiva Local de Neoplasia/patologia , Recidiva Local de Neoplasia/cirurgia , Análise de Sequência com Séries de Oligonucleotídeos , Neoplasias Peritoneais/patologia , Neoplasias Peritoneais/cirurgia , Estudos Prospectivos , Estudos Retrospectivos , Neoplasias Gástricas/patologia , Neoplasias Gástricas/cirurgia , Taxa de Sobrevida , Resultado do Tratamento , Estudos de Validação como AssuntoRESUMO
Efficient execution of data-intensive workflows has been playing an important role in bioinformatics as the amount of data has been rapidly increasing. The execution of such workflows must take into account the volume and pattern of communication. When orchestrating data-centric workflows, a centralized workflow engine can become a bottleneck to performance. To cope with the bottleneck, a hybrid approach with choreography for data management of workflows is proposed. However, when a workflow includes many repetitive operations, the approach might not gain good performance because of the overheads of its additional mechanism. This paper presents and evaluates an improvement of the hybrid approach for managing a large amount of data. The performance of the proposed method is demonstrated by measuring execution times of example workflows.
Assuntos
Biologia Computacional , Sistemas de Gerenciamento de Base de DadosRESUMO
With the advancement of genome research, it is becoming clear that genes are not distributed on the genome in random order. Clusters of genes distributed at localized genome positions have been reported in several eukaryotes. Various correlations have been observed between the expressions of genes in adjacent or nearby positions along the chromosomes depending on tissue type and developmental stage. Moreover, in several cases, their transcripts, which control epigenetic transcription via processes such as transcriptional interference and genomic imprinting, occur in clusters. It is reasonable that genomic regions that have similar mechanisms show similar expression patterns and that the characteristics of expression in the same genomic regions differ depending on tissue type and developmental stage. In this study, we analyzed gene expression patterns using the cap analysis gene expression (CAGE) method for exploring systematic views of the mouse transcriptome. Counting the number of mapped CAGE tags for fixed-length regions allowed us to determine genomic expression levels. These expression levels were normalized, quantified, and converted into four types of descriptors, allowing the expression patterns along the genome to be represented by character strings. We analyzed them using dynamic programming in the same manner as for sequence analysis. We have developed a novel algorithm that provides a novel view of the genome from the perspective of genomic positional expression. In a similarity search of expression patterns across chromosomes and tissues, we found regions that had clusters of genes that showed expression patterns similar to each other depending on tissue type. Our results suggest the possibility that the regions that have sense-antisense transcription show similar expression patterns between forward and reverse strands.
Assuntos
Mapeamento Cromossômico/métodos , Genoma , Camundongos/genética , Transcrição Gênica , Algoritmos , Animais , Composição de Bases , Regulação da Expressão Gênica , Genoma Humano , Humanos , Macrófagos/fisiologia , MicroRNAs/genética , Modelos Genéticos , RNA não Traduzido/genéticaRESUMO
The international FANTOM consortium aims to produce a comprehensive picture of the mammalian transcriptome, based upon an extensive cDNA collection and functional annotation of full-length enriched cDNAs. The previous dataset, FANTOM2, comprised 60,770 full-length enriched cDNAs. Functional annotation revealed that this cDNA dataset contained only about half of the estimated number of mouse protein-coding genes, indicating that a number of cDNAs still remained to be collected and identified. To pursue the complete gene catalog that covers all predicted mouse genes, cloning and sequencing of full-length enriched cDNAs has been continued since FANTOM2. In FANTOM3, 42,031 newly isolated cDNAs were subjected to functional annotation, and the annotation of 4,347 FANTOM2 cDNAs was updated. To accomplish accurate functional annotation, we improved our automated annotation pipeline by introducing new coding sequence prediction programs and developed a Web-based annotation interface for simplifying the annotation procedures to reduce manual annotation errors. Automated coding sequence and function prediction was followed with manual curation and review by expert curators. A total of 102,801 full-length enriched mouse cDNAs were annotated. Out of 102,801 transcripts, 56,722 were functionally annotated as protein coding (including partial or truncated transcripts), providing to our knowledge the greatest current coverage of the mouse proteome by full-length cDNAs. The total number of distinct non-protein-coding transcripts increased to 34,030. The FANTOM3 annotation system, consisting of automated computational prediction, manual curation, and final expert curation, facilitated the comprehensive characterization of the mouse transcriptome, and could be applied to the transcriptomes of other species.
Assuntos
DNA Complementar/genética , Bases de Dados Genéticas , Camundongos/genética , Transcrição Gênica , Animais , Automação , DNA Complementar/química , GenomaRESUMO
Recent epigenetics research has demonstrated that chromatin conformation plays an important role in various aspects of gene regulation. Chromosome Conformation Capture (3C) technology makes it possible to analyze the spatial organization of chromatin in a cell. Several algorithms for three-dimensional reconstruction of chromatin structure from 3C experimental data have been proposed. Compared to other algorithms, ShRec3D, one of the most advanced algorithms, can reconstruct a chromatin model in the shortest time for high-resolution whole-genome experimental data. However, ShRec3D employs a graph shortest path algorithm, which introduces errors in the resulting model. We propose an improved algorithm that optimizes shortest path distances using a genetic algorithm approach. The proposed algorithm and ShRec3D were compared using in silico 3C experimental data. Compared to ShRec3D, the proposed algorithm demonstrated significant improvement relative to the similarity between the algorithm's output and the original model with a reasonable increase to calculation time.
Assuntos
Algoritmos , Cromatina , Biologia Computacional/métodos , Imageamento Tridimensional/métodos , Cromatina/química , Cromatina/genética , Modelos GenéticosRESUMO
Background: Epigenetic factors associated with the development of autoimmune diseases are unclear. Monozygotic twin pairs discordant for positive antithyroglobulin autoantibodies (TgAb) are useful to examine the epigenetic factors because of their identical genetic background. This study aimed to clarify the discordant epigenetic differences affecting the development of TgAb. Methods: Subjects were selected from 257 Japanese monozygotic twins, recruited from the registry established by the Center for Twin Research at Osaka University. TgAb positive concordant (PC) pairs were 5.7% (four pairs) and 9.6% (18 pairs) of male and female pairs, respectively. TgAb discordant (DC) pairs were 11.4% (eight pairs) and 8.0% (15 pairs) of male and female pairs, respectively. TgAb negative concordant (NC) pairs were 78.6% (55 pairs) of male pairs and 74.3% (139 pairs) of female pairs. To perform stricter grouping, the cut-off value for positive TgAb was set to 50.0 IU/mL (TgAb negative: <28.0 IU/mL; TgAb positive: ≥50.0 IU/mL; TgAb borderline: ≥28.0 IU/mL and <50.0 IU/mL). Nineteen discordant (6 male and 13 female pairs) and 185 concordant pairs (48 male and 137 female pairs) for TgAb positivity were finally examined. DNA methylation levels of genomic DNA were evaluated using the Infinium HumanMethylation450 BeadChip Kit. Gene polymorphisms were also genotyped using the Omni5-4 BeadChip Kit to clarify genetic background specific for discordant twins. Results: No CpG sites were found with significant within-pair differences of methylation levels in TgAb DC pairs after correction for multiple comparisons. However, 155 polymorphisms specific for TgAb DC pairs were significantly different in genotype frequencies from those of concordant pairs, and none of them were located on the HLA region of chromosome 6. In TgAb DC pairs with some specific genotypes of these polymorphisms, four CpG sites were observed exhibiting significant within-pair differences in each DC pair, even after correction for multiple comparisons. Conclusions: This study found that the genetic background specific for TgAb DC twins who are susceptible to epigenetic changes are different from that specific for TgAb PC twins, and it clarified the genotype-based epigenetic differences in TgAb DC monozygotic twins.
RESUMO
We present a new method to describe tissue-specific function that leverages the advantage of the Cap Analysis of Gene Expression (CAGE) data. The CAGE expression data represent the number of mRNAs of each gene in a sample. The feature enables us to compare or add the expression amount of genes in the sample. As usual methods compared the gene expression values among tissues for each gene respectively and ruled out to compare them among genes, they have not exploited the feature to reveal tissue specificity. To utilize the feature, we used Gene Ontology terms (GO-terms) as unit to sum up the expression values and described specificities of tissues by them. We regard GO-terms as events that occur in the tissue according to probabilities that are defined by means of the CAGE. Our method is applied to mouse CAGE data on 22 tissues. Among them, we show the results of molecular functions and cellular components on liver. We also show the most expressed genes in liver to compare with our method. The results agree well with well-known specific functions such as amino acid metabolisms of liver. Moreover, the difference of inter-cellular junction among liver, lung, heart, muscle and prostate gland are apparently observed. The results of our method provide researchers a clue to the further research of the tissue roles and the deeper functions of the tissue-specific genes. All the results and supplementary materials are available via our web site.
Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Aminoácidos/metabolismo , Animais , Membrana Celular/metabolismo , Computadores , Bases de Dados Genéticas , Fígado/metabolismo , Camundongos , Modelos Biológicos , Modelos Estatísticos , Software , Distribuição TecidualRESUMO
Gene expressions differ depending on tissue types and developmental stages. Analyzing how each gene is expressed is thus important. One way of analyzing gene expression patterns is to identify tissue-specific functions. This is useful for understanding how vital activities are performed. DNA microarray has been widely used to observe gene expressions exhaustively. However, comparing the expression value of a gene to that of other genes is impossible, as the gene expression value of a condition is measured as a proportion of that for the same gene under a control condition. We therefore could not determine whether one gene is more expressed than other genes. Cap analysis gene expression (CAGE) allows high-throughput analysis of gene expressions by counting the number of cDNAs of expressed genes. CAGE enables comparison of the expression value of the gene to that of other genes in the same tissue. In this study, we propose a method for exploring tissue-specific functions using data from CAGE. To identify tissue-specificity, one of the simplest ways is to assume that the function of the most expressed gene is regarded as the most tissue-specific. However, the most expressed gene in a tissue might highly express in all tissues, as seen with housekeeping genes. Functions of such genes cannot be tissue-specific. To remove these from consideration, we propose measuring tissue specificity of functions based on information content of gene ontology terms. We applied our method to data from 16 human tissues and 22 mouse tissues. The results from liver and prostate gland indicated that well-known functions of these tissues, such as functions related to signaling and muscle in prostate gland and immune function in liver, displayed high rank.
Assuntos
Perfilação da Expressão Gênica/métodos , Animais , DNA Complementar/genética , Regulação da Expressão Gênica , Humanos , Fígado/metabolismo , Masculino , Camundongos , Especificidade de Órgãos , Próstata/metabolismoRESUMO
Comprehensively understanding the dynamics of biological systems is one of the greatest challenges in biology. Vastly improved biological technologies have provided vast amounts of information that must be understood by bioinformatics and systems biology researchers. Gene regulations have been frequently modeled by ordinary differential equations or graphical models based on time-course gene expression profiles. The state-of-the-art computational approaches for analyzing gene regulations assume that their models are same throughout time-course experiments. However, these approaches cannot easily analyze transient changes at a time point, such as diauxic shift. We propose a score that analyzes the gene regulations at each time point. The score is based on the information gains of information criterion values. The method detects the shifts in gene regulatory networks (GRNs) during time-course experiments with single-time-point resolution. The effectiveness of the method is evaluated on the diauxic shift from glucose to lactose in Escherichia coli. Gene regulation shifts were detected at two time points: the first corresponding to the time at which the growth of E. coli ceased and the second corresponding to the end of the experiment, when the nutrient sources (glucose and lactose) had become exhausted. According to these results, the proposed score and method can appropriately detect the time of gene regulation shifts. The method based on the proposed score provides a new tool for analyzing dynamic biological systems. Because the score value indicates the strength of gene regulation at each time point in a gene expression profile, it can potentially infer hidden GRNs from time-course experiments.
Assuntos
Biologia Computacional/métodos , Redes Reguladoras de Genes , Algoritmos , Animais , Gráficos por Computador , Escherichia coli/genética , Escherichia coli/crescimento & desenvolvimento , Escherichia coli/metabolismo , Perfilação da Expressão Gênica , Genoma Bacteriano , Glucose/metabolismo , Humanos , Lactose/metabolismo , Cadeias de Markov , Camundongos , Modelos Genéticos , Biologia de Sistemas , Fatores de TempoRESUMO
Recently, gene expression data under various conditions have largely been obtained by the utilization of the DNA microarrays and oligonucleotide arrays. There have been emerging demands to analyze the function of genes from the gene expression profiles. For clustering genes from their expression profiles, hierarchical clustering has been widely used. The clustering method represents the relationships of genes as a tree structure by connecting genes using their similarity scores based on the Pearson correlation coefficient. But the clustering method is sensitive to experimental noise. To cope with the problem, we propose another type of clustering method (the p-quasi complete linkage clustering). We apply this method to the gene expression data of yeast cell-cycles and human lung cancer. The effectiveness of our method is demonstrated by comparing clustering results with other methods.
Assuntos
Algoritmos , Inteligência Artificial , Proteínas de Ciclo Celular/genética , Perfilação da Expressão Gênica/métodos , Neoplasias Pulmonares/genética , Proteínas de Neoplasias/fisiologia , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Proteínas de Saccharomyces cerevisiae/genética , Análise por Conglomerados , Humanos , Neoplasias Pulmonares/metabolismoRESUMO
A vast amount of metagenomic data has been obtained by extracting multiple genomes simultaneously from microbial communities, including genomes from uncultivable microbes. By analyzing these metagenomic data, novel microbes are discovered and new microbial functions are elucidated. The first step in analyzing these data is sequenced-read classification into reference genomes from which each read can be derived. The Naïve Bayes Classifier is a method for this classification. To identify the derivation of the reads, this method calculates a score based on the occurrence of a DNA sequence motif in each reference genome. However, large differences in the sizes of the reference genomes can bias the scoring of the reads. This bias might cause erroneous classification and decrease the classification accuracy. To address this issue, we have updated the Naïve Bayes Classifier method using multiple sets of occurrence profiles for each reference genome by normalizing the genome sizes, dividing each genome sequence into a set of subsequences of similar length and generating profiles for each subsequence. This multiple profile strategy improves the accuracy of the results generated by the Naïve Bayes Classifier method for simulated and Sargasso Sea datasets.
RESUMO
BACKGROUND: Identifying the differences between gene regulatory networks under varying biological conditions or external stimuli is an important challenge in systems biology. Several methods have been developed to reverse-engineer a cellular system, called a gene regulatory network, from gene expression profiles in order to understand transcriptomic behavior under various conditions of interest. Conventional methods infer the gene regulatory network independently from each of the multiple gene expression profiles under varying conditions to find the important regulatory relations for understanding cellular behavior. However, the inferred networks with conventional methods include a large number of misleading relations, and the accuracy of the inference is low. This is because conventional methods do not consider other related conditions, and the results of conventional methods include considerable noise due to the limited number of observation points in each expression profile of interest. RESULTS: We propose a more accurate method for estimating key gene regulatory networks for understanding cellular behavior under various conditions. Our method utilizes multiple gene expression profiles that compose a tree structure under varying conditions. The root represents the original cellular state, and the leaves represent the changed cellular states under various conditions. By using this tree-structured gene expression profiles, our method more powerfully estimates the networks that are key to understanding the cellular behavior of interest under varying conditions. CONCLUSION: We confirmed that the proposed method in cell differentiation was more rigorous than the conventional method. The results show that our assumptions as to which relations are unimportant for understanding the differences of cellular states in cell differentiation are appropriate, and that our method can infer more accurately the core networks of the cell types.
Assuntos
Redes Reguladoras de Genes , Modelos Genéticos , Transcriptoma , Adipócitos/citologia , Adipócitos/fisiologia , Algoritmos , Animais , Diferenciação Celular/genética , Perfilação da Expressão Gênica/métodos , Células-Tronco Mesenquimais/citologia , Células-Tronco Mesenquimais/fisiologia , Camundongos , Osteoblastos/citologia , Osteoblastos/fisiologiaRESUMO
The S-system model is one of the nonlinear differential equation models of gene regulatory networks, and it can describe various dynamics of the relationships among genes. If we successfully infer rigorous S-system model parameters that describe a target gene regulatory network, we can simulate gene expressions mathematically. However, the problem of finding an optimal S-system model parameter is too complex to be solved analytically. Thus, some heuristic search methods that offer approximate solutions are needed for reducing the computational time. In previous studies, several heuristic search methods such as Genetic Algorithms (GAs) have been applied to the parameter search of the S-system model. However, they have not achieved enough estimation accuracy. One of the conceivable reasons is that the mechanisms to escape local optima. We applied an Immune Algorithm (IA) to search for the S-system parameters. IA is also a heuristic search method, which is inspired by the biological mechanism of acquired immunity. Compared to GA, IA is able to search large solution space, thereby avoiding local optima, and have multiple candidates of the solutions. These features work well for searching the S-system model. Actually, our algorithm showed higher performance than GA for both simulation and real data analyses.
Assuntos
Algoritmos , Redes Reguladoras de Genes/imunologia , Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodosRESUMO
MOTIVATION: Clustering of protein sequences is widely used for the functional characterization of proteins. However, it is still not easy to cluster distantly-related proteins, which have only regional similarity among their sequences. It is therefore necessary to develop an algorithm for clustering such distantly-related proteins. RESULTS: We have developed a time and space efficient clustering algorithm. It uses a graph representation where its vertices and edges denote proteins and their sequence similarities above a certain cutoff score, respectively. It repeatedly partitions the graph by removing edges that have small weights, which correspond to low sequence similarities. To find the appropriate partitions, we introduce a score combining the normalized cut and a locally minimal cut capacities. Our method is applied to the entire 40,703 human proteins in SWISS-PROT and TrEMBL. The resulting clusters shows a 76% recall (20,529 proteins) of the 26,917 classified by InterPro. It also finds relationships not found by other clustering methods. AVAILABILITY: The complete result of our algorithm for all the human proteins in SWISS-PROT and TrEMBL, and other supplementary information are available at http://motif.ics.es.osaka-u.ac.jp/Ncut-KL/