Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 45
Filtrar
Más filtros

Banco de datos
País/Región como asunto
Tipo del documento
Intervalo de año de publicación
1.
Bull Math Biol ; 85(3): 21, 2023 02 13.
Artículo en Inglés | MEDLINE | ID: mdl-36780044

RESUMEN

The study of native motifs of RNA secondary structures helps us better understand the formation and eventually the functions of these molecules. Commonly known structural motifs include helices, hairpin loops, bulges, interior loops, exterior loops and multiloops. However, enumerative results and generating algorithms taking into account the joint distribution of these motifs are sparse. In this paper, we present progress on deriving such distributions employing a tree-bijection of RNA secondary structures obtained by Schmitt and Waterman and a novel rake decomposition of plane trees. The key feature of the latter is that the derived components encode motifs of the RNA secondary structures without pseudoknots associated with the plane trees very well. As an application, we present an algorithm (RakeSamp) generating uniformly random secondary structures without pseudoknots that satisfy fine motif specifications on the length and degree of various types of loops as well as helices.


Asunto(s)
Conceptos Matemáticos , ARN , ARN/química , Conformación de Ácido Nucleico , Modelos Biológicos , Algoritmos
2.
IEEE Trans Inf Theory ; 67(6): 3287-3294, 2021 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-34257466

RESUMEN

Levenshtein edit distance has played a central role-both past and present-in sequence alignment in particular and biological database similarity search in general. We start our review with a history of dynamic programming algorithms for computing Levenshtein distance and sequence alignments. Following, we describe how those algorithms led to heuristics employed in the most widely used software in bioinformatics, BLAST, a program to search DNA and protein databases for evolutionarily relevant similarities. More recently, the advent of modern genomic sequencing and the volume of data it generates has resulted in a return to the problem of local alignment. We conclude with how the mathematical formulation of Levenshtein distance as a metric made possible additional optimizations to similarity search in biological contexts. These modern optimizations are built around the low metric entropy and fractional dimensionality of biological databases, enabling orders of magnitude acceleration of biological similarity search.

3.
Bioinformatics ; 35(22): 4596-4606, 2019 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-30993316

RESUMEN

MOTIVATION: Detecting sequences containing repetitive regions is a basic bioinformatics task with many applications. Several methods have been developed for various types of repeat detection tasks. An efficient generic method for detecting most types of repetitive sequences is still desirable. Inspired by the excellent properties and successful applications of the D2 family of statistics in comparative analyses of genomic sequences, we developed a new statistic D2R that can efficiently discriminate sequences with or without repetitive regions. RESULTS: Using the statistic, we developed an algorithm of linear time and space complexity for detecting most types of repetitive sequences in multiple scenarios, including finding candidate clustered regularly interspaced short palindromic repeats regions from bacterial genomic or metagenomics sequences. Simulation and real data experiments show that the method works well on both assembled sequences and unassembled short reads. AVAILABILITY AND IMPLEMENTATION: The codes are available at https://github.com/XuegongLab/D2R_codes under GPL 3.0 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Repeticiones Palindrómicas Cortas Agrupadas y Regularmente Espaciadas , Genómica , Algoritmos , Genoma Bacteriano , Metagenómica
4.
Bioinformatics ; 34(4): 617-624, 2018 02 15.
Artículo en Inglés | MEDLINE | ID: mdl-29040382

RESUMEN

Motivation: Capturing association patterns in gene expression levels under different conditions or time points is important for inferring gene regulatory interactions. In practice, temporal changes in gene expression may result in complex association patterns that require more sophisticated detection methods than simple correlation measures. For instance, the effect of regulation may lead to time-lagged associations and interactions local to a subset of samples. Furthermore, expression profiles of interest may not be aligned or directly comparable (e.g. gene expression profiles from two species). Results: We propose a count statistic for measuring association between pairs of gene expression profiles consisting of ordered samples (e.g. time-course), where correlation may only exist locally in subsequences separated by a position shift. The statistic is simple and fast to compute, and we illustrate its use in two applications. In a cross-species comparison of developmental gene expression levels, we show our method not only measures association of gene expressions between the two species, but also provides alignment between different developmental stages. In the second application, we applied our statistic to expression profiles from two distinct phenotypic conditions, where the samples in each profile are ordered by the associated phenotypic values. The detected associations can be useful in building correspondence between gene association networks under different phenotypes. On the theoretical side, we provide asymptotic distributions of the statistic for different regions of the parameter space and test its power on simulated data. Availability and implementation: The code used to perform the analysis is available as part of the Supplementary Material. Contact: msw@usc.edu or hhuang@stat.berkeley.edu. Supplementary information: Supplementary data are available at Bioinformatics online.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Regulación de la Expresión Génica , Redes Reguladoras de Genes , Programas Informáticos , Algoritmos , Biología Computacional/métodos , Fenotipo , Análisis de Secuencia de ARN/métodos
5.
Nucleic Acids Res ; 45(W1): W554-W559, 2017 07 03.
Artículo en Inglés | MEDLINE | ID: mdl-28472388

RESUMEN

Alignment-free genome and metagenome comparisons are increasingly important with the development of next generation sequencing (NGS) technologies. Recently developed state-of-the-art k-mer based alignment-free dissimilarity measures including CVTree, $d_2^*$ and $d_2^S$ are more computationally expensive than measures based solely on the k-mer frequencies. Here, we report a standalone software, aCcelerated Alignment-FrEe sequence analysis (CAFE), for efficient calculation of 28 alignment-free dissimilarity measures. CAFE allows for both assembled genome sequences and unassembled NGS shotgun reads as input, and wraps the output in a standard PHYLIP format. In downstream analyses, CAFE can also be used to visualize the pairwise dissimilarity measures, including dendrograms, heatmap, principal coordinate analysis and network display. CAFE serves as a general k-mer based alignment-free analysis platform for studying the relationships among genomes and metagenomes, and is freely available at https://github.com/younglululu/CAFE.


Asunto(s)
Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Animales , Genoma Microbiano , Internet , Metagenómica , Primates/genética , Alineación de Secuencia , Vertebrados/genética
6.
Proc Natl Acad Sci U S A ; 111(46): 16371-6, 2014 Nov 18.
Artículo en Inglés | MEDLINE | ID: mdl-25288767

RESUMEN

With the advent of high-throughput technologies making large-scale gene expression data readily available, developing appropriate computational tools to process these data and distill insights into systems biology has been an important part of the "big data" challenge. Gene coexpression is one of the earliest techniques developed that is still widely in use for functional annotation, pathway analysis, and, most importantly, the reconstruction of gene regulatory networks, based on gene expression data. However, most coexpression measures do not specifically account for local features in expression profiles. For example, it is very likely that the patterns of gene association may change or only exist in a subset of the samples, especially when the samples are pooled from a range of experiments. We propose two new gene coexpression statistics based on counting local patterns of gene expression ranks to take into account the potentially diverse nature of gene interactions. In particular, one of our statistics is designed for time-course data with local dependence structures, such as time series coupled over a subregion of the time domain. We provide asymptotic analysis of their distributions and power, and evaluate their performance against a wide range of existing coexpression measures on simulated and real data. Our new statistics are fast to compute, robust against outliers, and show comparable and often better general performance.


Asunto(s)
Biología Computacional/estadística & datos numéricos , Perfilación de la Expresión Génica/estadística & datos numéricos , Redes Reguladoras de Genes , Algoritmos , Arabidopsis/genética , Arabidopsis/metabolismo , Proteínas de Arabidopsis/biosíntesis , Proteínas de Arabidopsis/genética , Proteínas de Ciclo Celular/biosíntesis , Proteínas de Ciclo Celular/genética , Biología Computacional/métodos , Simulación por Computador , Regulación Fúngica de la Expresión Génica , Regulación de la Expresión Génica de las Plantas , Genes Fúngicos , Genes de Plantas , Modelos Genéticos , Método de Montecarlo , Saccharomyces cerevisiae/citología , Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/biosíntesis , Proteínas de Saccharomyces cerevisiae/genética , Factores de Tiempo
7.
Brief Bioinform ; 15(3): 343-53, 2014 May.
Artículo en Inglés | MEDLINE | ID: mdl-24064230

RESUMEN

With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.


Asunto(s)
Biología Computacional/métodos , Análisis de Secuencia/métodos , Algoritmos , Biología Computacional/tendencias , Genómica/métodos , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento , Cadenas de Markov , Modelos Estadísticos , Alineación de Secuencia , Análisis de Secuencia/estadística & datos numéricos
8.
Proc Natl Acad Sci U S A ; 107(24): 10848-53, 2010 Jun 15.
Artículo en Inglés | MEDLINE | ID: mdl-20534489

RESUMEN

Variation in genome structure is an important source of human genetic polymorphism: It affects a large proportion of the genome and has a variety of phenotypic consequences relevant to health and disease. In spite of this, human genome structure variation is incompletely characterized due to a lack of approaches for discovering a broad range of structural variants in a global, comprehensive fashion. We addressed this gap with Optical Mapping, a high-throughput, high-resolution single-molecule system for studying genome structure. We used Optical Mapping to create genome-wide restriction maps of a complete hydatidiform mole and three lymphoblast-derived cell lines, and we validated the approach by demonstrating a strong concordance with existing methods. We also describe thousands of new variants with sizes ranging from kb to Mb.


Asunto(s)
Genoma Humano , Mapeo de Restricción Óptica/métodos , Algoritmos , Línea Celular , Línea Celular Tumoral , Femenino , Variación Genética , Estudio de Asociación del Genoma Completo , Humanos , Mola Hidatiforme/genética , Linfocitos/metabolismo , Mapeo de Restricción Óptica/estadística & datos numéricos , Embarazo , Neoplasias Uterinas/genética
9.
PLoS Comput Biol ; 7(6): e1001106, 2011 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-21698123

RESUMEN

The rapid accumulation of biological networks poses new challenges and calls for powerful integrative analysis tools. Most existing methods capable of simultaneously analyzing a large number of networks were primarily designed for unweighted networks, and cannot easily be extended to weighted networks. However, it is known that transforming weighted into unweighted networks by dichotomizing the edges of weighted networks with a threshold generally leads to information loss. We have developed a novel, tensor-based computational framework for mining recurrent heavy subgraphs in a large set of massive weighted networks. Specifically, we formulate the recurrent heavy subgraph identification problem as a heavy 3D subtensor discovery problem with sparse constraints. We describe an effective approach to solving this problem by designing a multi-stage, convex relaxation protocol, and a non-uniform edge sampling technique. We applied our method to 130 co-expression networks, and identified 11,394 recurrent heavy subgraphs, grouped into 2,810 families. We demonstrated that the identified subgraphs represent meaningful biological modules by validating against a large set of compiled biological knowledge bases. We also showed that the likelihood for a heavy subgraph to be meaningful increases significantly with its recurrence in multiple networks, highlighting the importance of the integrative approach to biological network analysis. Moreover, our approach based on weighted graphs detects many patterns that would be overlooked using unweighted graphs. In addition, we identified a large number of modules that occur predominately under specific phenotypes. This analysis resulted in a genome-wide mapping of gene network modules onto the phenome. Finally, by comparing module activities across many datasets, we discovered high-order dynamic cooperativeness in protein complex networks and transcriptional regulatory networks.


Asunto(s)
Biología Computacional/métodos , Bases de Datos Genéticas , Redes Reguladoras de Genes , Modelos Genéticos , Procesamiento de Señales Asistido por Computador , Algoritmos , Bases de Datos de Proteínas , Expresión Génica , Regulación de la Expresión Génica , Humanos , Análisis de Secuencia por Matrices de Oligonucleótidos , Fenotipo , Proteínas
10.
PLoS Genet ; 5(11): e1000711, 2009 Nov.
Artículo en Inglés | MEDLINE | ID: mdl-19936062

RESUMEN

About 85% of the maize genome consists of highly repetitive sequences that are interspersed by low-copy, gene-coding sequences. The maize community has dealt with this genomic complexity by the construction of an integrated genetic and physical map (iMap), but this resource alone was not sufficient for ensuring the quality of the current sequence build. For this purpose, we constructed a genome-wide, high-resolution optical map of the maize inbred line B73 genome containing >91,000 restriction sites (averaging 1 site/ approximately 23 kb) accrued from mapping genomic DNA molecules. Our optical map comprises 66 contigs, averaging 31.88 Mb in size and spanning 91.5% (2,103.93 Mb/ approximately 2,300 Mb) of the maize genome. A new algorithm was created that considered both optical map and unfinished BAC sequence data for placing 60/66 (2,032.42 Mb) optical map contigs onto the maize iMap. The alignment of optical maps against numerous data sources yielded comprehensive results that proved revealing and productive. For example, gaps were uncovered and characterized within the iMap, the FPC (fingerprinted contigs) map, and the chromosome-wide pseudomolecules. Such alignments also suggested amended placements of FPC contigs on the maize genetic map and proactively guided the assembly of chromosome-wide pseudomolecules, especially within complex genomic regions. Lastly, we think that the full integration of B73 optical maps with the maize iMap would greatly facilitate maize sequence finishing efforts that would make it a valuable reference for comparative studies among cereals, or other maize inbred lines and cultivars.


Asunto(s)
Genoma de Planta/genética , Zea mays/genética , Algoritmos , Secuencia de Bases , Cromosomas Artificiales Bacterianos/genética , Mapeo Contig , Datos de Secuencia Molecular , Fenómenos Ópticos , Mapeo Físico de Cromosoma , Alineación de Secuencia
11.
J Theor Biol ; 284(1): 106-16, 2011 Sep 07.
Artículo en Inglés | MEDLINE | ID: mdl-21723298

RESUMEN

Alignment-free sequence comparison is widely used for comparing gene regulatory regions and for identifying horizontally transferred genes. Recent studies on the power of a widely used alignment-free comparison statistic D2 and its variants D*2 and D(s)2 showed that their power approximates a limit smaller than 1 as the sequence length tends to infinity under a pattern transfer model. We develop new alignment-free statistics based on D2, D*2 and D(s)2 by comparing local sequence pairs and then summing over all the local sequence pairs of certain length. We show that the new statistics are much more powerful than the corresponding statistics and the power tends to 1 as the sequence length tends to infinity under the pattern transfer model.


Asunto(s)
Secuencias Reguladoras de Ácidos Nucleicos/genética , Análisis de Secuencia de ADN/métodos , Algoritmos , Animales , Interpretación Estadística de Datos , Drosophila/genética , Evolución Molecular , VIH-1/genética , Modelos Estadísticos , Alineación de Secuencia , Homología de Secuencia de Ácido Nucleico
12.
J Comput Biol ; 28(3): 248-256, 2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33275493

RESUMEN

COVID-19 is an infectious disease caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The viral genome is considered to be relatively stable and the mutations that have been observed and reported thus far are mainly focused on the coding region. This article provides evidence that macrolevel pandemic dynamics, such as social distancing, modulate the genomic evolution of SARS-CoV-2. This view complements the prevalent paradigm that microlevel observables control macrolevel parameters such as death rates and infection patterns. First, we observe differences in mutational signals for geospatially separated populations such as the prevalence of A23404G in CA versus NY and WA. We show that the feedback between macrolevel dynamics and the viral population can be captured employing a transfer entropy framework. Second, we observe complex interactions within mutational clades. Namely, when C14408T first appeared in the viral population, the frequency of A23404G spiked in the subsequent week. Third, we identify a noncoding mutation, G29540A, within the segment between the coding gene of the N protein and the ORF10 gene, which is largely confined to NY (>95%). These observations indicate that macrolevel sociobehavioral measures have an impact on the viral genomics and may be useful for the dashboard-like tracking of its evolution. Finally, despite the fact that SARS-CoV-2 is a genetically robust organism, our findings suggest that we are dealing with a high degree of adaptability. Owing to its ample spread, mutations of unusual form are observed and a high complexity of mutational interaction is exhibited.


Asunto(s)
COVID-19/virología , Evolución Molecular , Genoma Viral , SARS-CoV-2/genética , COVID-19/epidemiología , COVID-19/transmisión , Biología Computacional , Frecuencia de los Genes , Conductas Relacionadas con la Salud , Política de Salud , Humanos , Modelos Genéticos , Mutación , Pandemias , Filogenia , Distanciamiento Físico , SARS-CoV-2/patogenicidad , SARS-CoV-2/fisiología , Glicoproteína de la Espiga del Coronavirus/genética
13.
BMC Bioinformatics ; 11 Suppl 1: S62, 2010 Jan 18.
Artículo en Inglés | MEDLINE | ID: mdl-20122238

RESUMEN

BACKGROUND: Complex human diseases are often caused by multiple mutations, each of which contributes only a minor effect to the disease phenotype. To study the basis for these complex phenotypes, we developed a network-based approach to identify coexpression modules specifically activated in particular phenotypes. We integrated these modules, protein-protein interaction data, Gene Ontology annotations, and our database of gene-phenotype associations derived from literature to predict novel human gene-phenotype associations. Our systematic predictions provide us with the opportunity to perform a global analysis of human gene pleiotropy and its underlying regulatory mechanisms. RESULTS: We applied this method to 338 microarray datasets, covering 178 phenotype classes, and identified 193,145 phenotype-specific coexpression modules. We trained random forest classifiers for each phenotype and predicted a total of 6,558 gene-phenotype associations. We showed that 40.9% genes are pleiotropic, highlighting that pleiotropy is more prevalent than previously expected. We collected 77 ChIP-chip datasets studying 69 transcription factors binding over 16,000 targets under various phenotypic conditions. Utilizing this unique data source, we confirmed that dynamic transcriptional regulation is an important force driving the formation of phenotype specific gene modules. CONCLUSION: We created a genome-wide gene to phenotype mapping that has many potential implications, including providing potential new drug targets and uncovering the basis for human disease phenotypes. Our analysis of these phenotype-specific coexpression modules reveals a high prevalence of gene pleiotropy, and suggests that phenotype-specific transcription factor binding may contribute to phenotypic diversity. All resources from our study are made freely available on our online Phenotype Prediction Database.


Asunto(s)
Biología Computacional/métodos , Genoma , Fenotipo , Bases de Datos Genéticas , Perfilación de la Expresión Génica
14.
Bioinformatics ; 25(18): 2430-1, 2009 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-19561337

RESUMEN

SUMMARY: Haplotype assembly is becoming a very important tool in genome sequencing of human and other organisms. Although haplotypes were previously inferred from genome assemblies, there has never been a comparative haplotype browser that depicts a global picture of whole-genome alignments among haplotypes of different organisms. We introduce a whole-genome HAPLotype brOWSER (HAPLOWSER), providing evolutionary perspectives from multiple aligned haplotypes and functional annotations. Haplowser enables the comparison of haplotypes from metagenomes, and associates conserved regions or the bases at the conserved regions with functional annotations and custom tracks. The associations are quantified for further analysis and presented as pie charts. Functional annotations and custom tracks that are projected onto haplotypes are saved as multiple files in FASTA format. Haplowser provides a user-friendly interface, and can display alignments of haplotypes with functional annotations at any resolution. AVAILABILITY: Haplowser, written in Java, supports multiple platforms including Windows and Linux. Haplowser is publicly available at http://embio.yonsei.ac.kr/haplowser .


Asunto(s)
Biología Computacional/métodos , Genoma , Haplotipos , Metagenoma , Programas Informáticos , Bases de Datos Genéticas , Genómica , Internet
15.
J Comput Sci Technol ; 25(1): 3-9, 2010 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-22121326

RESUMEN

New generation sequencing systems are changing how molecular biology is practiced. The widely promoted $1000 genome will be a reality with attendant changes for healthcare, including personalized medicine. More broadly the genomes of many new organisms with large samplings from populations will be commonplace. What is less appreciated is the explosive demands on computation, both for CPU cycles and storage as well as the need for new computational methods. In this article we will survey some of these developments and demands.

16.
Nucleic Acids Res ; 35(Database issue): D756-9, 2007 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-17090592

RESUMEN

The recent development of microarray technology provided unprecedented opportunities to understand the genetic basis of aging. So far, many microarray studies have addressed aging-related expression patterns in multiple organisms and under different conditions. The number of relevant studies continues to increase rapidly. However, efficient exploitation of these vast data is frustrated by the lack of an integrated data mining platform or other unifying bioinformatic resource to enable convenient cross-laboratory searches of array signals. To facilitate the integrative analysis of microarray data on aging, we developed a web database and analysis platform 'Gene Aging Nexus' (GAN) that is freely accessible to the research community to query/analyze/visualize cross-platform and cross-species microarray data on aging. By providing the possibility of integrative microarray analysis, GAN should be useful in building the systems-biology understanding of aging. GAN is accessible at http://gan.usc.edu.


Asunto(s)
Envejecimiento/genética , Bases de Datos Genéticas , Perfilación de la Expresión Génica , Análisis de Secuencia por Matrices de Oligonucleótidos , Envejecimiento/metabolismo , Animales , Humanos , Internet , Ratones , Ratas , Programas Informáticos , Interfaz Usuario-Computador
17.
Genome Biol ; 20(1): 144, 2019 07 25.
Artículo en Inglés | MEDLINE | ID: mdl-31345254

RESUMEN

BACKGROUND: Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS: Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION: The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.


Asunto(s)
Análisis de Secuencia , Benchmarking , Transferencia de Gen Horizontal , Internet , Filogenia , Secuencias Reguladoras de Ácidos Nucleicos , Alineación de Secuencia , Análisis de Secuencia de Proteína , Programas Informáticos
18.
Bioinformatics ; 23(13): i222-9, 2007 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-17646300

RESUMEN

MOTIVATION: The rapid accumulation of microarray datasets provides unique opportunities to perform systematic functional characterization of the human genome. We designed a graph-based approach to integrate cross-platform microarray data, and extract recurrent expression patterns. A series of microarray datasets can be modeled as a series of co-expression networks, in which we search for frequently occurring network patterns. The integrative approach provides three major advantages over the commonly used microarray analysis methods: (1) enhance signal to noise separation (2) identify functionally related genes without co-expression and (3) provide a way to predict gene functions in a context-specific way. RESULTS: We integrate 65 human microarray datasets, comprising 1105 experiments and over 11 million expression measurements. We develop a data mining procedure based on frequent itemset mining and biclustering to systematically discover network patterns that recur in at least five datasets. This resulted in 143,401 potential functional modules. Subsequently, we design a network topology statistic based on graph random walk that effectively captures characteristics of a gene's local functional environment. Function annotations based on this statistic are then subject to the assessment using the random forest method, combining six other attributes of the network modules. We assign 1126 functions to 895 genes, 779 known and 116 unknown, with a validation accuracy of 70%. Among our assignments, 20% genes are assigned with multiple functions based on different network environments. AVAILABILITY: http://zhoulab.usc.edu/ContextAnnotation.


Asunto(s)
Algoritmos , Mapeo Cromosómico/métodos , Perfilación de la Expresión Génica/métodos , Expresión Génica/genética , Genoma Humano/genética , Proteoma/genética , Transducción de Señal/genética , Humanos
19.
Bioinformatics ; 23(13): i577-86, 2007 Jul 01.
Artículo en Inglés | MEDLINE | ID: mdl-17646346

RESUMEN

MOTIVATION: A major challenge in studying gene regulation is to systematically reconstruct transcription regulatory modules, which are defined as sets of genes that are regulated by a common set of transcription factors. A commonly used approach for transcription module reconstruction is to derive coexpression clusters from a microarray dataset. However, such results often contain false positives because genes from many transcription modules may be simultaneously perturbed upon a given type of conditions. In this study, we propose and validate that genes, which form a coexpression cluster in multiple microarray datasets across diverse conditions, are more likely to form a transcription module. However, identifying genes coexpressed in a subset of many microarray datasets is not a trivial computational problem. RESULTS: We propose a graph-based data-mining approach to efficiently and systematically identify frequent coexpression clusters. Given m microarray datasets, we model each microarray dataset as a coexpression graph, and search for vertex sets which are frequently densely connected across [theta m] datasets (0 < or = theta < or = 1). For this novel graph-mining problem, we designed two techniques to narrow down the search space: (1) partition the input graphs into (overlapping) groups sharing common properties; (2) summarize the vertex neighbor information from the partitioned datasets onto the 'Neighbor Association Summary Graph's for effective mining. We applied our method to 105 human microarray datasets, and identified a large number of potential transcription modules, activated under different subsets of conditions. Validation by ChIP-chip data demonstrated that the likelihood of a coexpression cluster being a transcription module increases significantly with its recurrence. Our method opens a new way to exploit the vast amount of existing microarray data accumulation for gene regulation study. Furthermore, the algorithm is applicable to other biological networks for approximate network module mining. AVAILABILITY: http://zhoulab.usc.edu/NeMo/.


Asunto(s)
Mapeo Cromosómico/métodos , Genoma Humano/genética , Elementos Reguladores de la Transcripción/genética , Análisis de Secuencia de ADN/métodos , Factores de Transcripción/genética , Transcripción Genética/genética , Algoritmos , Secuencia de Bases , Sitios de Unión , Gráficos por Computador , Humanos , Datos de Secuencia Molecular , Unión Proteica
20.
J Comput Biol ; 14(3): 255-66, 2007 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-17563310

RESUMEN

Optical mapping is an integrated system for the analysis of single DNA molecules. It constructs restriction maps (noted as "optical map" ) from individual DNA molecules presented on surfaces after they are imaged by fluorescence microscopy. Because restriction digestion and fluorochrome staining are performed after molecules are mounted, resulting restriction fragments retain their order. Maps of fragment sizes and order are constructed by image processing techniques employing integrated fluorescence intensity measurements. Such analysis, in place of molecular length measurements, obviates need for uniformly elongated molecules, but requires samples containing small fluorescent reference molecules for accurate sizing. Although robust in practice, elimination of internal reference molecules would reduce errors and extend single molecule analysis to other platforms. In this paper, we introduce a new approach that does not use reference molecules for direct estimation of restriction fragment sizes, by the exploitation of the quantiles associated with their expected distribution. We show that this approach is comparable to the current reference-based method as evaluated by map alignment techniques in terms of the rate of placement of optical maps to published sequence.


Asunto(s)
Algoritmos , Biología Computacional , ADN , Mapeo Restrictivo/métodos , Humanos , Microscopía Fluorescente
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA