RESUMO
The best ideotypes are under mounting pressure due to increased aridity. Understanding the conserved molecular mechanisms that evolve in wild plants adapted to harsh environments is crucial in developing new strategies for agriculture. Yet our knowledge of such mechanisms in wild species is scant. We performed metabolic pathway reconstruction using transcriptome information from 32 Atacama and phylogenetically related species that do not live in Atacama (sister species). We analyzed reaction enrichment to understand the commonalities and differences of Atacama plants. To gain insights into the mechanisms that ensure survival, we compared expressed gene isoform numbers and gene expression patterns between the annotated biochemical reactions from 32 Atacama and sister species. We found biochemical convergences characterized by reactions enriched in at least 50% of the Atacama species, pointing to potential advantages against drought and nitrogen starvation, for instance. These findings suggest that the adaptation in the Atacama Desert may result in part from shared genetic legacies governing the expression of key metabolic pathways to face harsh conditions. Enriched reactions corresponded to ubiquitous compounds common to extreme and agronomic species and were congruent with our previous metabolomic analyses. Convergent adaptive traits offer promising candidates for improving abiotic stress resilience in crop species.
Assuntos
Clima Desértico , Filogenia , Transcriptoma , Chile , Adaptação Fisiológica , Redes e Vias MetabólicasRESUMO
This study evaluates both a variety of existing base causal inference methods and a variety of ensemble methods. We show that: (i) base network inference methods vary in their performance across different datasets, so a method that works poorly on one dataset may work well on another; (ii) a non-homogeneous ensemble method in the form of a Naive Bayes classifier leads overall to as good or better results than using the best single base method or any other ensemble method; (iii) for the best results, the ensemble method should integrate all methods that satisfy a statistical test of normality on training data. The resulting ensemble model EnsInfer easily integrates all kinds of RNA-seq data as well as new and existing inference methods. The paper categorizes and reviews state-of-the-art underlying methods, describes the EnsInfer ensemble approach in detail, and presents experimental results. The source code and data used will be made available to the community upon publication.
Assuntos
Algoritmos , Software , Teorema de Bayes , RNA-SeqRESUMO
BACKGROUND: Systems biology increasingly relies on deep sequencing with combinatorial index tags to associate biological sequences with their sample, cell, or molecule of origin. Accurate data interpretation depends on the ability to classify sequences based on correct decoding of these combinatorial barcodes. The probability of correct decoding is influenced by both sequence quality and the number and arrangement of barcodes. The rising complexity of experimental designs calls for a probability model that accounts for both sequencing errors and random noise, generalizes to multiple combinatorial tags, and can handle any barcoding scheme. The needs for reproducibility and community benchmark standards demand a peer-reviewed tool that preserves decoding quality scores and provides tunable control over classification confidence that balances precision and recall. Moreover, continuous improvements in sequencing throughput require a fast, parallelized and scalable implementation. RESULTS AND DISCUSSION: We developed a flexible, robustly engineered software that performs probabilistic decoding and supports arbitrarily complex barcoding designs. Pheniqs computes the full posterior decoding error probability of observed barcodes by consulting basecalling quality scores and prior distributions, and reports sequences and confidence scores in Sequence Alignment/Map (SAM) fields. The product of posteriors for multiple independent barcodes provides an overall confidence score for each read. Pheniqs achieves greater accuracy than minimum edit distance or simple maximum likelihood estimation, and it scales linearly with core count to enable the classification of > 11 billion reads in 1 h 15 m using < 50 megabytes of memory. Pheniqs has been in production use for seven years in our genomics core facility. CONCLUSION: We introduce a computationally efficient software that implements both probabilistic and minimum distance decoders and show that decoding barcodes using posterior probabilities is more accurate than available methods. Pheniqs allows fine-tuning of decoding sensitivity using intuitive confidence thresholds and is extensible with alternative decoders and new error models. Any arbitrary arrangement of barcodes is easily configured, enabling computation of combinatorial confidence scores for any barcoding strategy. An optimized multithreaded implementation assures that Pheniqs is faster and scales better with complex barcode sets than existing tools. Support for POSIX streams and multiple sequencing formats enables easy integration with automated analysis pipelines.
Assuntos
Processamento Eletrônico de Dados , Sequenciamento de Nucleotídeos em Larga Escala , Teorema de Bayes , Código de Barras de DNA Taxonômico , Reprodutibilidade dos Testes , Análise de Sequência de DNA , SoftwareRESUMO
This study exploits time, the relatively unexplored fourth dimension of gene regulatory networks (GRNs), to learn the temporal transcriptional logic underlying dynamic nitrogen (N) signaling in plants. Our "just-in-time" analysis of time-series transcriptome data uncovered a temporal cascade of cis elements underlying dynamic N signaling. To infer transcription factor (TF)-target edges in a GRN, we applied a time-based machine learning method to 2,174 dynamic N-responsive genes. We experimentally determined a network precision cutoff, using TF-regulated genome-wide targets of three TF hubs (CRF4, SNZ, and CDF1), used to "prune" the network to 155 TFs and 608 targets. This network precision was reconfirmed using genome-wide TF-target regulation data for four additional TFs (TGA1, HHO5/6, and PHL1) not used in network pruning. These higher-confidence edges in the GRN were further filtered by independent TF-target binding data, used to calculate a TF "N-specificity" index. This refined GRN identifies the temporal relationship of known/validated regulators of N signaling (NLP7/8, TGA1/4, NAC4, HRS1, and LBD37/38/39) and 146 additional regulators. Six TFs-CRF4, SNZ, CDF1, HHO5/6, and PHL1-validated herein regulate a significant number of genes in the dynamic N response, targeting 54% of N-uptake/assimilation pathway genes. Phenotypically, inducible overexpression of CRF4 in planta regulates genes resulting in altered biomass, root development, and 15NO3- uptake, specifically under low-N conditions. This dynamic N-signaling GRN now provides the temporal "transcriptional logic" for 155 candidate TFs to improve nitrogen use efficiency with potential agricultural applications. Broadly, these time-based approaches can uncover the temporal transcriptional logic for any biological response system in biology, agriculture, or medicine.
Assuntos
Arabidopsis/genética , Arabidopsis/metabolismo , Regulação da Expressão Gênica de Plantas/genética , Redes Reguladoras de Genes/genética , Nitrogênio/metabolismo , Transcrição Gênica/genética , Proteínas de Arabidopsis/genética , Perfilação da Expressão Gênica/métodos , Lógica , Ligação Proteica/genética , Transdução de Sinais/genética , Fatores de Transcrição/genéticaRESUMO
BACKGROUND: Several large public repositories of microarray datasets and RNA-seq data are available. Two prominent examples include ArrayExpress and NCBI GEO. Unfortunately, there is no easy way to import and manipulate data from such resources, because the data is stored in large files, requiring large bandwidth to download and special purpose data manipulation tools to extract subsets relevant for the specific analysis. RESULTS: TACITuS is a web-based system that supports rapid query access to high-throughput microarray and NGS repositories. The system is equipped with modules capable of managing large files, storing them in a cloud environment and extracting subsets of data in an easy and efficient way. The system also supports the ability to import data into Galaxy for further analysis. CONCLUSIONS: TACITuS automates most of the pre-processing needed to analyze high-throughput microarray and NGS data from large publicly-available repositories. The system implements several modules to manage large files in an easy and efficient way. Furthermore, it is capable deal with Galaxy environment allowing users to analyze data through a user-friendly interface.
Assuntos
Big Data , Coleta de Dados , Software , Transcriptoma/genética , Linhagem Celular Tumoral , Bases de Dados Genéticas , Humanos , Interface Usuário-ComputadorRESUMO
BACKGROUND: Networks whose nodes have labels can seem complex. Fortunately, many have substructures that occur often ("motifs"). A societal example of a motif might be a household. Replacing such motifs by named supernodes reduces the complexity of the network and can bring out insightful features. Doing so repeatedly may give hints about higher level structures of the network. We call this recursive process Recursive Supernode Extraction. RESULTS: This paper describes algorithms and a tool to discover disjoint (i.e. non-overlapping) motifs in a network, replacing those motifs by new nodes, and then recursing. We show applications in food-web and protein-protein interaction (PPI) networks where our methods reduce the complexity of the network and yield insights. CONCLUSIONS: SuperNoder is a web-based and standalone tool which enables the simplification of big graphs based on the reduction of high frequency motifs. It applies various strategies for identifying disjoint motifs with the goal of enhancing the understandability of networks.
Assuntos
Algoritmos , Biologia Computacional/métodos , Redes e Vias Metabólicas , Mapas de Interação de Proteínas , Software , HumanosRESUMO
RNAi is a powerful tool for the regulation of gene expression. It is widely and successfully employed in functional studies and is now emerging as a promising therapeutic approach. Several RNAi-based clinical trials suggest encouraging results in the treatment of a variety of diseases, including cancer. Here we present miR-Synth, a computational resource for the design of synthetic microRNAs able to target multiple genes in multiple sites. The proposed strategy constitutes a valid alternative to the use of siRNA, allowing the employment of a fewer number of molecules for the inhibition of multiple targets. This may represent a great advantage in designing therapies for diseases caused by crucial cellular pathways altered by multiple dysregulated genes. The system has been successfully validated on two of the most prominent genes associated to lung cancer, c-MET and Epidermal Growth Factor Receptor (EGFR). (See http://microrna.osumc.edu/mir-synth).
Assuntos
Técnicas de Silenciamento de Genes , MicroRNAs/genética , Software , Regiões 3' não Traduzidas , Sequência de Bases , Receptores ErbB/biossíntese , Receptores ErbB/genética , Expressão Gênica , Genes Reporter , Células HEK293 , Células HeLa , Humanos , Luciferases de Renilla/biossíntese , Luciferases de Renilla/genética , Proteínas Proto-Oncogênicas c-met/biossíntese , Proteínas Proto-Oncogênicas c-met/genética , Interferência de RNARESUMO
Negative examples - genes that are known not to carry out a given protein function - are rarely recorded in genome and proteome annotation databases, such as the Gene Ontology database. Negative examples are required, however, for several of the most powerful machine learning methods for integrative protein function prediction. Most protein function prediction efforts have relied on a variety of heuristics for the choice of negative examples. Determining the accuracy of methods for negative example prediction is itself a non-trivial task, given that the Open World Assumption as applied to gene annotations rules out many traditional validation metrics. We present a rigorous comparison of these heuristics, utilizing a temporal holdout, and a novel evaluation strategy for negative examples. We add to this comparison several algorithms adapted from Positive-Unlabeled learning scenarios in text-classification, which are the current state of the art methods for generating negative examples in low-density annotation contexts. Lastly, we present two novel algorithms of our own construction, one based on empirical conditional probability, and the other using topic modeling applied to genes and annotations. We demonstrate that our algorithms achieve significantly fewer incorrect negative example predictions than the current state of the art, using multiple benchmarks covering multiple organisms. Our methods may be applied to generate negative examples for any type of method that deals with protein function, and to this end we provide a database of negative examples in several well-studied organisms, for general use (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html).
Assuntos
Algoritmos , Bases de Dados Genéticas , Ontologia Genética , Proteínas/genética , Proteínas/fisiologia , Animais , Proteínas de Arabidopsis/genética , Proteínas de Arabidopsis/fisiologia , Inteligência Artificial , Biologia Computacional , Genoma , Humanos , Camundongos , Anotação de Sequência Molecular , Proteoma , Proteínas de Saccharomyces cerevisiae/genética , Proteínas de Saccharomyces cerevisiae/fisiologiaRESUMO
The incompleteness of proteome structure and function annotation is a critical problem for biologists and, in particular, severely limits interpretation of high-throughput and next-generation experiments. We have developed a proteome annotation pipeline based on structure prediction, where function and structure annotations are generated using an integration of sequence comparison, fold recognition, and grid-computing-enabled de novo structure prediction. We predict protein domain boundaries and three-dimensional (3D) structures for protein domains from 94 genomes (including human, Arabidopsis, rice, mouse, fly, yeast, Escherichia coli, and worm). De novo structure predictions were distributed on a grid of more than 1.5 million CPUs worldwide (World Community Grid). We generated significant numbers of new confident fold annotations (9% of domains that are otherwise unannotated in these genomes). We demonstrate that predicted structures can be combined with annotations from the Gene Ontology database to predict new and more specific molecular functions.
Assuntos
Dobramento de Proteína , Proteoma/química , Animais , Corismato Mutase/química , Deinococcus/metabolismo , Deinococcus/efeitos da radiação , Proteínas de Drosophila/química , Genoma , Glucosiltransferases/química , Humanos , Camundongos , Anotação de Sequência Molecular , Proteínas Nucleares/química , Proteínas Nucleares/classificação , Plasmodium vivax/metabolismo , Conformação Proteica , Estrutura Terciária de Proteína , Proteínas de Protozoários/química , Controle de Qualidade , Reprodutibilidade dos Testes , Transglutaminases/química , Interface Usuário-ComputadorRESUMO
MOTIVATION: Computational biologists have demonstrated the utility of using machine learning methods to predict protein function from an integration of multiple genome-wide data types. Yet, even the best performing function prediction algorithms rely on heuristics for important components of the algorithm, such as choosing negative examples (proteins without a given function) or determining key parameters. The improper choice of negative examples, in particular, can hamper the accuracy of protein function prediction. RESULTS: We present a novel approach for choosing negative examples, using a parameterizable Bayesian prior computed from all observed annotation data, which also generates priors used during function prediction. We incorporate this new method into the GeneMANIA function prediction algorithm and demonstrate improved accuracy of our algorithm over current top-performing function prediction methods on the yeast and mouse proteomes across all metrics tested. AVAILABILITY: Code and Data are available at: http://bonneaulab.bio.nyu.edu/funcprop.html
Assuntos
Algoritmos , Proteínas/fisiologia , Animais , Inteligência Artificial , Teorema de Bayes , Redes Reguladoras de Genes , Genoma , Camundongos , Anotação de Sequência Molecular , Mapeamento de Interação de Proteínas , Proteínas/genética , Proteínas/metabolismo , Proteoma/metabolismo , Leveduras/genética , Leveduras/metabolismoRESUMO
As sessile organisms, root plasticity enables plants to forage for and acquire nutrients in a fluctuating underground environment. Here, we use genetic and genomic approaches in a "split-root" framework--in which physically isolated root systems of the same plant are challenged with different nitrogen (N) environments--to investigate how systemic signaling affects genome-wide reprogramming and root development. The integration of transcriptome and root phenotypes enables us to identify distinct mechanisms underlying "N economy" (i.e., N supply and demand) of plants as a system. Under nitrate-limited conditions, plant roots adopt an "active-foraging strategy", characterized by lateral root outgrowth and a shared pattern of transcriptome reprogramming, in response to either local or distal nitrate deprivation. By contrast, in nitrate-replete conditions, plant roots adopt a "dormant strategy", characterized by a repression of lateral root outgrowth and a shared pattern of transcriptome reprogramming, in response to either local or distal nitrate supply. Sentinel genes responding to systemic N signaling identified by genome-wide comparisons of heterogeneous vs. homogeneous split-root N treatments were used to probe systemic N responses in Arabidopsis mutants impaired in nitrate reduction and hormone synthesis and also in decapitated plants. This combined analysis identified genetically distinct systemic signaling underlying plant N economy: (i) N supply, corresponding to a long-distance systemic signaling triggered by nitrate sensing; and (ii) N demand, experimental support for the transitive closure of a previously inferred nitrate-cytokinin shoot-root relay system that reports the nitrate demand of the whole plant, promoting a compensatory root growth in nitrate-rich patches of heterogeneous soil.
Assuntos
Citocininas/metabolismo , Nitratos/metabolismo , Nitrogênio/metabolismo , Raízes de Plantas/metabolismo , Transdução de Sinais , Arabidopsis/genética , Arabidopsis/metabolismo , Citocininas/biossíntese , Genes de PlantasRESUMO
A network, whose nodes are genes and whose directed edges represent positive or negative influences of a regulatory gene and its targets, is often used as a representation of causality. To infer a network, researchers often develop a machine learning model and then evaluate the model based on its match with experimentally verified "gold standard" edges. The desired result of such a model is a network that may extend the gold standard edges. Since networks are a form of visual representation, one can compare their utility with architectural or machine blueprints. Blueprints are clearly useful because they provide precise guidance to builders in construction. If the primary role of gene regulatory networks is to characterize causality, then such networks should be good tools of prediction because prediction is the actionable benefit of knowing causality. But are they? In this paper, we compare prediction quality based on "gold standard" regulatory edges from previous experimental work with non-linear models inferred from time series data across four different species. We show that the same non-linear machine learning models have better predictive performance, with improvements from 5.3% to 25.3% in terms of the reduction in the root mean square error (RMSE) compared with the same models based on the gold standard edges. Having established that networks fail to characterize causality properly, we suggest that causality research should focus on four goals: (i) predictive accuracy; (ii) a parsimonious enumeration of predictive regulatory genes for each target gene g; (iii) the identification of disjoint sets of predictive regulatory genes for each target g of roughly equal accuracy; and (iv) the construction of a bipartite network (whose node types are genes and models) representation of causality. We provide algorithms for all goals.
RESUMO
Rational computational design is crucial to the pursuit of novel drugs and therapeutic agents. Meso-scale cyclic peptides, which consist of 7-40 amino acid residues, are of particular interest due to their conformational rigidity, binding specificity, degradation resistance, and potential cell permeability. Because there are few natural cyclic peptides, de novo design involving non-canonical amino acids is a potentially useful goal. Here, we develop an efficient pipeline (CyclicChamp) for cyclic peptide design. After converting the cyclic constraint into an error function, we employ a variant of simulated annealing to search for low-energy peptide backbones while maintaining peptide closure. Compared to the previous random sampling approach, which was capable of sampling conformations of cyclic peptides of up to 14 residues, our method both greatly accelerates the computation speed for sampling conformations of small macrocycles (ca. 7 residues), and addresses the high-dimensionality challenge that large macrocycle designs often encounter. As a result, CyclicChamp makes conformational sampling tractable for 15- to 24-residue cyclic peptides, thus permitting the design of macrocycles in this size range. Microsecond-length molecular dynamics simulations on the resulting 15, 20, and 24 amino acid cyclic designs identify trajectories with kinetic stability. To test their thermodynamic stability, we perform additional replica exchange molecular dynamics simulations and generate free energy surfaces. Two 15-residue designs and one 20-residue design emerge as promising candidates, along with one viable 24-residue candidate.
RESUMO
[This corrects the article DOI: 10.3389/fgene.2024.1371607.].
RESUMO
BACKGROUND: Graphs can represent biological networks at the molecular, protein, or species level. An important query is to find all matches of a pattern graph to a target graph. Accomplishing this is inherently difficult (NP-complete) and the efficiency of heuristic algorithms for the problem may depend upon the input graphs. The common aim of existing algorithms is to eliminate unsuccessful mappings as early as and as inexpensively as possible. RESULTS: We propose a new subgraph isomorphism algorithm which applies a search strategy to significantly reduce the search space without using any complex pruning rules or domain reduction procedures. We compare our method with the most recent and efficient subgraph isomorphism algorithms (VFlib, LAD, and our C++ implementation of FocusSearch which was originally distributed in Modula2) on synthetic, molecules, and interaction networks data. We show a significant reduction in the running time of our approach compared with these other excellent methods and show that our algorithm scales well as memory demands increase. CONCLUSIONS: Subgraph isomorphism algorithms are intensively used by biochemical tools. Our analysis gives a comprehensive comparison of different software approaches to subgraph isomorphism highlighting their weaknesses and strengths. This will help researchers make a rational choice among methods depending on their application. We also distribute an open-source package including our system and our own C++ implementation of FocusSearch together with all the used datasets (http://ferrolab.dmi.unict.it/ri.html). In future work, our findings may be extended to approximate subgraph isomorphism algorithms.
Assuntos
Algoritmos , Proteínas/metabolismo , Software , Inteligência Artificial , Mapas de Interação de Proteínas , Transdução de SinaisRESUMO
MOTIVATION: A-to-I RNA editing is an important mechanism that consists of the conversion of specific adenosines into inosines in RNA molecules. Its dysregulation has been associated to several human diseases including cancer. Recent work has demonstrated a role for A-to-I editing in microRNA (miRNA)-mediated gene expression regulation. In fact, edited forms of mature miRNAs can target sets of genes that differ from the targets of their unedited forms. The specific deamination of mRNAs can generate novel binding sites in addition to potentially altering existing ones. RESULTS: This work presents miR-EdiTar, a database of predicted A-to-I edited miRNA binding sites. The database contains predicted miRNA binding sites that could be affected by A-to-I editing and sites that could become miRNA binding sites as a result of A-to-I editing. AVAILABILITY: miR-EdiTar is freely available online at http://microrna.osumc.edu/mireditar. CONTACT: alessandro.lagana@osumc.edu or carlo.croce@osumc.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Sítios de Ligação/genética , Bases de Dados Genéticas , MicroRNAs/genética , Edição de RNA , Adenosina/genética , Regulação da Expressão Gênica , Humanos , Inosina/genética , Internet , Conformação de Ácido NucleicoRESUMO
Data generation is no longer the limiting factor in advancing biological research. In addition, data integration, analysis, and interpretation have become key bottlenecks and challenges that biologists conducting genomic research face daily. To enable biologists to derive testable hypotheses from the increasing amount of genomic data, we have developed the VirtualPlant software platform. VirtualPlant enables scientists to visualize, integrate, and analyze genomic data from a systems biology perspective. VirtualPlant integrates genome-wide data concerning the known and predicted relationships among genes, proteins, and molecules, as well as genome-scale experimental measurements. VirtualPlant also provides visualization techniques that render multivariate information in visual formats that facilitate the extraction of biological concepts. Importantly, VirtualPlant helps biologists who are not trained in computer science to mine lists of genes, microarray experiments, and gene networks to address questions in plant biology, such as: What are the molecular mechanisms by which internal or external perturbations affect processes controlling growth and development? We illustrate the use of VirtualPlant with three case studies, ranging from querying a gene of interest to the identification of gene networks and regulatory hubs that control seed development. Whereas the VirtualPlant software was developed to mine Arabidopsis (Arabidopsis thaliana) genomic data, its data structures, algorithms, and visualization tools are designed in a species-independent way. VirtualPlant is freely available at www.virtualplant.org.
Assuntos
Sistemas de Gerenciamento de Base de Dados , Genômica , Plantas/genética , Biologia de Sistemas , Biologia Computacional/métodos , Bases de Dados Genéticas , Redes Reguladoras de Genes , Genes de Plantas , Genoma de Planta , Análise de Sequência com Séries de Oligonucleotídeos , Interface Usuário-ComputadorRESUMO
SafePredict is a novel meta-algorithm that works with any base prediction algorithm for online data to guarantee an arbitrarily chosen correctness rate, 1-ϵ, by allowing refusals. Allowing refusals means that the meta-algorithm may refuse to emit a prediction produced by the base algorithm so that the error rate on non-refused predictions does not exceed ϵ. The SafePredict error bound does not rely on any assumptions on the data distribution or the base predictor. When the base predictor happens not to exceed the target error rate ϵ, SafePredict refuses only a finite number of times. When the error rate of the base predictor changes through time SafePredict makes use of a weight-shifting heuristic that adapts to these changes without knowing when the changes occur yet still maintains the correctness guarantee. Empirical results show that (i) SafePredict compares favorably with state-of-the-art confidence-based refusal mechanisms which fail to offer robust error guarantees; and (ii) combining SafePredict with such refusal mechanisms can in many cases further reduce the number of refusals. Our software is included in the supplementary material, which can be found on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2019.2932415.
RESUMO
In the plant meristem, tissue-wide maturation gradients are coordinated with specialized cell networks to establish various developmental phases required for indeterminate growth. Here, we used single-cell transcriptomics to reconstruct the protophloem developmental trajectory from the birth of cell progenitors to terminal differentiation in the Arabidopsis thaliana root. PHLOEM EARLY DNA-BINDING-WITH-ONE-FINGER (PEAR) transcription factors mediate lineage bifurcation by activating guanosine triphosphatase signaling and prime a transcriptional differentiation program. This program is initially repressed by a meristem-wide gradient of PLETHORA transcription factors. Only the dissipation of PLETHORA gradient permits activation of the differentiation program that involves mutual inhibition of early versus late meristem regulators. Thus, for phloem development, broad maturation gradients interface with cell-type-specific transcriptional regulators to stage cellular differentiation.
Assuntos
Proteínas de Arabidopsis/metabolismo , Arabidopsis/citologia , Floema/citologia , Floema/crescimento & desenvolvimento , Raízes de Plantas/citologia , Fatores de Transcrição/metabolismo , Arabidopsis/genética , Arabidopsis/metabolismo , Proteínas de Arabidopsis/genética , Diferenciação Celular , Proteínas de Ligação ao GTP/genética , Proteínas de Ligação ao GTP/metabolismo , Meristema/citologia , Floema/genética , Floema/metabolismo , Raízes de Plantas/genética , Raízes de Plantas/crescimento & desenvolvimento , Raízes de Plantas/metabolismo , RNA-Seq , Transdução de Sinais , Análise de Célula Única , Fatores de Transcrição/genética , TranscriptomaRESUMO
BACKGROUND: Finding the subgraphs of a graph database that are isomorphic to a given query graph has practical applications in several fields, from cheminformatics to image understanding. Since subgraph isomorphism is a computationally hard problem, indexing techniques have been intensively exploited to speed up the process. Such systems filter out those graphs which cannot contain the query, and apply a subgraph isomorphism algorithm to each residual candidate graph. The applicability of such systems is limited to databases of small graphs, because their filtering power degrades on large graphs. RESULTS: In this paper, SING (Subgraph search In Non-homogeneous Graphs), a novel indexing system able to cope with large graphs, is presented. The method uses the notion of feature, which can be a small subgraph, subtree or path. Each graph in the database is annotated with the set of all its features. The key point is to make use of feature locality information. This idea is used to both improve the filtering performance and speed up the subgraph isomorphism task. CONCLUSIONS: Extensive tests on chemical compounds, biological networks and synthetic graphs show that the proposed system outperforms the most popular systems in query time over databases of medium and large graphs. Other specific tests show that the proposed system is effective for single large graphs.