Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 17 de 17
Filter
Add more filters











Publication year range
1.
BMC Bioinformatics ; 25(1): 187, 2024 May 13.
Article in English | MEDLINE | ID: mdl-38741200

ABSTRACT

MOTIVATION: Long non-coding RNAs (lncRNAs) are a class of molecules involved in important biological processes. Extensive efforts have been provided to get deeper understanding of disease mechanisms at the lncRNA level, guiding towards the detection of biomarkers for disease diagnosis, treatment, prognosis and prevention. Unfortunately, due to costs and time complexity, the number of possible disease-related lncRNAs verified by traditional biological experiments is very limited. Computational approaches for the prediction of disease-lncRNA associations allow to identify the most promising candidates to be verified in laboratory, reducing costs and time consuming. RESULTS: We propose novel approaches for the prediction of lncRNA-disease associations, all sharing the idea of exploring associations among lncRNAs, other intermediate molecules (e.g., miRNAs) and diseases, suitably represented by tripartite graphs. Indeed, while only a few lncRNA-disease associations are still known, plenty of interactions between lncRNAs and other molecules, as well as associations of the latters with diseases, are available. A first approach presented here, NGH, relies on neighborhood analysis performed on a tripartite graph, built upon lncRNAs, miRNAs and diseases. A second approach (CF) relies on collaborative filtering; a third approach (NGH-CF) is obtained boosting NGH by collaborative filtering. The proposed approaches have been validated on both synthetic and real data, and compared against other methods from the literature. It results that neighborhood analysis allows to outperform competitors, and when it is combined with collaborative filtering the prediction accuracy further improves, scoring a value of AUC equal to 0966. AVAILABILITY: Source code and sample datasets are available at: https://github.com/marybonomo/LDAsPredictionApproaches.git.


Subject(s)
Computational Biology , RNA, Long Noncoding , RNA, Long Noncoding/genetics , Humans , Computational Biology/methods , Algorithms , MicroRNAs/genetics , MicroRNAs/metabolism , Genetic Predisposition to Disease/genetics
2.
Bioinformatics ; 39(4)2023 04 03.
Article in English | MEDLINE | ID: mdl-37021928

ABSTRACT

MOTIVATION: An interesting problem is to study how gene co-expression varies in two different populations, associated with healthy and unhealthy individuals, respectively. To this aim, two important aspects should be taken into account: (i) in some cases, pairs/groups of genes show collaborative attitudes, emerging in the study of disorders and diseases; (ii) information coming from each single individual may be crucial to capture specific details, at the basis of complex cellular mechanisms; therefore, it is important avoiding to miss potentially powerful information, associated with the single samples. RESULTS: Here, a novel approach is proposed, such that two different input populations are considered, and represented by two datasets of edge-labeled graphs. Each graph is associated to an individual, and the edge label is the co-expression value between the two genes associated to the nodes. Discriminative patterns among graphs belonging to different sample sets are searched for, based on a statistical notion of 'relevance' able to take into account important local similarities, and also collaborative effects, involving the co-expression among multiple genes. Four different gene expression datasets have been analyzed by the proposed approach, each associated to a different disease. An extensive set of experiments show that the extracted patterns significantly characterize important differences between healthy and unhealthy samples, both in the cooperation and in the biological functionality of the involved genes/proteins. Moreover, the provided analysis confirms some results already presented in the literature on genes with a central role for the considered diseases, still allowing to identify novel and useful insights on this aspect. AVAILABILITY AND IMPLEMENTATION: The algorithm has been implemented using the Java programming language. The data underlying this article and the code are available at https://github.com/CriSe92/DiscriminativeSubgraphDiscovery.


Subject(s)
Algorithms , Proteins , Humans
3.
BMC Bioinformatics ; 23(1): 474, 2022 Nov 11.
Article in English | MEDLINE | ID: mdl-36368948

ABSTRACT

BACKGROUND: Huge amounts of molecular interaction data are continuously produced and stored in public databases. Although many bioinformatics tools have been proposed in the literature for their analysis, based on their modeling through different types of biological networks, several problems still remain unsolved when the problem turns on a large scale. RESULTS: We propose DIAMIN, that is, a high-level software library to facilitate the development of applications for the efficient analysis of large-scale molecular interaction networks. DIAMIN relies on distributed computing, and it is implemented in Java upon the framework Apache Spark. It delivers a set of functionalities implementing different tasks on an abstract representation of very large graphs, providing a built-in support for methods and algorithms commonly used to analyze these networks. DIAMIN has been tested on data retrieved from two of the most used molecular interactions databases, resulting to be highly efficient and scalable. As shown by different provided examples, DIAMIN can be exploited by users without any distributed programming experience, in order to perform various types of data analysis, and to implement new algorithms based on its primitives. CONCLUSIONS: The proposed DIAMIN has been proved to be successful in allowing users to solve specific biological problems that can be modeled relying on biological networks, by using its functionalities. The software is freely available and this will hopefully allow its rapid diffusion through the scientific community, to solve both specific data analysis and more complex tasks.


Subject(s)
Computational Biology , Software , Computational Biology/methods , Algorithms , Databases, Factual , Gene Library
4.
Brief Bioinform ; 23(3)2022 05 13.
Article in English | MEDLINE | ID: mdl-35381599

ABSTRACT

MOTIVATION: Biological networks topology yields important insights into biological function, occurrence of diseases and drug design. In the last few years, different types of topological measures have been introduced and applied to infer the biological relevance of network components/interactions, according to their position within the network structure. Although comparisons of such measures have been previously proposed, to what extent the topology per se may lead to the extraction of novel biological knowledge has never been critically examined nor formalized in the literature. RESULTS: We present a comparative analysis of nine outstanding topological measures, based on compact views obtained from the rank they induce on a given input biological network. The goal is to understand their ability in correctly positioning nodes/edges in the rank, according to the functional knowledge implicitly encoded in biological networks. To this aim, both internal and external (gold standard) validation criteria are taken into account, and six networks involving three different organisms (yeast, worm and human) are included in the comparison. The results show that a distinct handful of best-performing measures can be identified for each of the considered organisms, independently from the reference gold standard. AVAILABILITY: Input files and code for the computation of the considered topological measures and K-haus distance are available at https://gitlab.com/MaryBonomo/ranking. CONTACT: simona.rombo@unipa.it. SUPPLEMENTARY INFORMATION: Supplementary data are available at Briefings in Bioinformatics online.


Subject(s)
Algorithms
6.
BMC Bioinformatics ; 20(Suppl 4): 124, 2019 Apr 18.
Article in English | MEDLINE | ID: mdl-30999847

ABSTRACT

BACKGROUND: RNA editing is an important mechanism for gene expression in plants organelles. It alters the direct transfer of genetic information from DNA to proteins, due to the introduction of differences between RNAs and the corresponding coding DNA sequences. Software tools successful for the search of genes in other organisms not always are able to correctly perform this task in plants organellar genomes. Moreover, the available software tools predicting RNA editing events utilise algorithms that do not account for events which may generate a novel start codon. RESULTS: We present FEDRO, a Java software tool implementing a novel strategy to generate candidate Open Reading Frames (ORFs) resulting from Cytidine to Uridine (c→u) editing substitutions which occur in the mitochondrial genome (mtDNA) of a given input plant. The goal is to predict putative proteins of plants mitochondria that have not been yet annotated. In order to validate the generated ORFs, a screening is performed by checking for sequence similarity or presence in active transcripts of the same or similar organisms. We illustrate the functionalities of our framework on a model organism. CONCLUSIONS: The proposed tool may be used also on other organisms and genomes. FEDRO is publicly available at http://math.unipa.it/rombo/FEDRO .


Subject(s)
Open Reading Frames/genetics , Oryza/genetics , RNA Editing/genetics , Software , Base Sequence , DNA, Mitochondrial/genetics , Genome, Mitochondrial
7.
BMC Bioinformatics ; 20(Suppl 4): 138, 2019 Apr 18.
Article in English | MEDLINE | ID: mdl-30999863

ABSTRACT

BACKGROUND: Distributed approaches based on the MapReduce programming paradigm have started to be proposed in the Bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of MapReduce and related Big Data technologies and frameworks (e.g., Apache Hadoop and Spark) does not necessarily produce satisfactory results, in terms of both efficiency and effectiveness. We discuss how the development of distributed and Big Data management technologies has affected the analysis of large datasets of biological sequences. Moreover, we show how the choice of different parameter configurations and the careful engineering of the software with respect to the specific framework under consideration may be crucial in order to achieve good performance, especially on very large amounts of data. We choose k-mers counting as a case study for our analysis, and Spark as the framework to implement FastKmer, a novel approach for the extraction of k-mer statistics from large collection of biological sequences, with arbitrary values of k. RESULTS: One of the most relevant contributions of FastKmer is the introduction of a module for balancing the statistics aggregation workload over the nodes of a computing cluster, in order to overcome data skew while allowing for a full exploitation of the underlying distributed architecture. We also present the results of a comparative experimental analysis showing that our approach is currently the fastest among the ones based on Big Data technologies, while exhibiting a very good scalability. CONCLUSIONS: We provide evidence that the usage of technologies such as Hadoop or Spark for the analysis of big datasets of biological sequences is productive only if the architectural details and the peculiar aspects of the considered framework are carefully taken into account for the algorithm design and implementation.


Subject(s)
Data Analysis , Databases, Nucleic Acid , Genome , Statistics as Topic , Algorithms , Base Sequence , Software , Time Factors
8.
Bioinformatics ; 34(20): 3454-3460, 2018 10 15.
Article in English | MEDLINE | ID: mdl-30204840

ABSTRACT

Motivation: Although the nucleosome occupancy along a genome can be in part predicted by in vitro experiments, it has been recently observed that the chromatin organization presents important differences in vitro with respect to in vivo. Such differences mainly regard the hierarchical and regular structures of the nucleosome fiber, whose existence has long been assumed, and in part also observed in vitro, but that does not apparently occur in vivo. It is also well known that the DNA sequence has a role in determining the nucleosome occupancy. Therefore, an important issue is to understand if, and to what extent, the structural differences in the chromatin organization between in vitro and in vivo have a counterpart in terms of the underlying genomic sequences. Results: We present the first quantitative comparison between the in vitro and in vivo nucleosome maps of two model organisms (S. cerevisiae and C. elegans). The comparison is based on the construction of weighted k-mer dictionaries. Our findings show that there is a good level of sequence conservation between in vitro and in vivo in both the two organisms, in contrast to the abovementioned important differences in chromatin structural organization. Moreover, our results provide evidence that the two organisms predispose themselves differently, in terms of sequence composition and both in vitro and in vivo, for the nucleosome occupancy. This leads to the conclusion that, although the notion of a genome encoding for its own nucleosome occupancy is general, the intrinsic histone k-mer sequence preferences tend to be species-specific. Availability and implementation: The files containing the dictionaries and the main results of the analysis are available at http://math.unipa.it/rombo/material. Supplementary information: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome , Sequence Analysis , Animals , Caenorhabditis elegans/genetics , Chromatin/genetics , Eukaryotic Cells , Histones/genetics , Nucleosomes , Saccharomyces cerevisiae/genetics
9.
Article in English | MEDLINE | ID: mdl-28113780

ABSTRACT

Entropy, being closely related to repetitiveness and compressibility, is a widely used information-related measure to assess the degree of predictability of a sequence. Entropic profiles are based on information theory principles, and can be used to study the under-/over-representation of subwords, by also providing information about the scale of conserved DNA regions. Here, we focus on the algorithmic aspects related to entropic profiles. In particular, we propose linear time algorithms for their computation that rely on suffix-based data structures, more specifically on the truncated suffix tree (TST) and on the enhanced suffix array (ESA). We performed an extensive experimental campaign showing that our algorithms, beside being faster, make it possible the analysis of longer sequences, even for high degrees of resolution, than state of the art algorithms.


Subject(s)
Algorithms , Computational Biology/methods , Entropy , Sequence Analysis, DNA/methods , Animals , DNA/genetics , Humans
10.
Bioinformatics ; 31(18): 2939-46, 2015 Sep 15.
Article in English | MEDLINE | ID: mdl-26007227

ABSTRACT

MOTIVATION: Information-theoretic and compositional analysis of biological sequences, in terms of k-mer dictionaries, has a well established role in genomic and proteomic studies. Much less so in epigenomics, although the role of k-mers in chromatin organization and nucleosome positioning is particularly relevant. Fundamental questions concerning the informational content and compositional structure of nucleosome favouring and disfavoring sequences with respect to their basic building blocks still remain open. RESULTS: We present the first analysis on the role of k-mers in the composition of nucleosome enriched and depleted genomic regions (NER and NDR for short) that is: (i) exhaustive and within the bounds dictated by the information-theoretic content of the sample sets we use and (ii) informative for comparative epigenomics. We analize four different organisms and we propose a paradigmatic formalization of k-mer dictionaries, providing two different and complementary views of the k-mers involved in NER and NDR. The first extends well known studies in this area, its comparative nature being its major merit. The second, very novel, brings to light the rich variety of k-mers involved in influencing nucleosome positioning, for which an initial classification in terms of clusters is also provided. Although such a classification offers many insights, the following deserves to be singled-out: short poly(dA:dT) tracts are reported in the literature as fundamental for nucleosome depletion, however a global quantitative look reveals that their role is much less prominent than one would expect based on previous studies. AVAILABILITY AND IMPLEMENTATION: Dictionaries, clusters and Supplementary Material are available online at http://math.unipa.it/rombo/epigenomics/. CONTACT: simona.rombo@unipa.it SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Chromatin Assembly and Disassembly/genetics , Epigenomics , Nucleosomes/genetics , Sequence Analysis, DNA/methods , Animals , Genome , Humans
11.
Brief Bioinform ; 16(1): 118-36, 2015 Jan.
Article in English | MEDLINE | ID: mdl-24300112

ABSTRACT

We present here a compact overview of the data, models and methods proposed for the analysis of biological networks based on the search for significant repetitions. In particular, we concentrate on three problems widely studied in the literature: 'network alignment', 'network querying' and 'network motif extraction'. We provide (i) details of the experimental techniques used to obtain the main types of interaction data, (ii) descriptions of the models and approaches introduced to solve such problems and (iii) pointers to both the available databases and software tools. The intent is to lay out a useful roadmap for identifying suitable strategies to analyse cellular data, possibly based on the joint use of different interaction data types or analysis techniques.


Subject(s)
Computational Biology/methods , Models, Theoretical , Software
12.
Bioinformatics ; 30(10): 1343-52, 2014 May 15.
Article in English | MEDLINE | ID: mdl-24458952

ABSTRACT

MOTIVATION: Protein-protein interaction (PPI) networks are powerful models to represent the pairwise protein interactions of the organisms. Clustering PPI networks can be useful for isolating groups of interacting proteins that participate in the same biological processes or that perform together specific biological functions. Evolutionary orthologies can be inferred this way, as well as functions and properties of yet uncharacterized proteins. RESULTS: We present an overview of the main state-of-the-art clustering methods that have been applied to PPI networks over the past decade. We distinguish five specific categories of approaches, describe and compare their main features and then focus on one of them, i.e. population-based stochastic search. We provide an experimental evaluation, based on some validation measures widely used in the literature, of techniques in this class, that are as yet less explored than the others. In particular, we study how the capability of Genetic Algorithms (GAs) to extract clusters in PPI networks varies when different topology-based fitness functions are used, and we compare GAs with the main techniques in the other categories. The experimental campaign shows that predictions returned by GAs are often more accurate than those produced by the contestant methods. Interesting issues still remain open about possible generalizations of GAs allowing for cluster overlapping. AVAILABILITY AND IMPLEMENTATION: We point out which methods and tools described here are publicly available. CONTACT: simona.rombo@math.unipa.it SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Protein Interaction Mapping/methods , Proteins/metabolism , Animals , Cluster Analysis , Humans , Proteins/genetics
13.
Brief Bioinform ; 15(3): 390-406, 2014 May.
Article in English | MEDLINE | ID: mdl-24347576

ABSTRACT

High-throughput sequencing technologies produce large collections of data, mainly DNA sequences with additional information, requiring the design of efficient and effective methodologies for both their compression and storage. In this context, we first provide a classification of the main techniques that have been proposed, according to three specific research directions that have emerged from the literature and, for each, we provide an overview of the current techniques. Finally, to make this review useful to researchers and technicians applying the existing software and tools, we include a synopsis of the main characteristics of the described approaches, including details on their implementation and availability. Performance of the various methods is also highlighted, although the state of the art does not lend itself to a consistent and coherent comparison among all the methods presented here.


Subject(s)
Computational Biology/methods , Data Compression/methods , High-Throughput Nucleotide Sequencing/methods , Algorithms , Data Compression/statistics & numerical data , High-Throughput Nucleotide Sequencing/statistics & numerical data , Metagenomics/statistics & numerical data , Sequence Alignment , Software
14.
Article in English | MEDLINE | ID: mdl-22201069

ABSTRACT

Several approaches have been presented in the literature to cluster Protein-Protein Interaction (PPI) networks. They can be grouped in two main categories: those allowing a protein to participate in different clusters and those generating only nonoverlapping clusters. In both cases, a challenging task is to find a suitable compromise between the biological relevance of the results and a comprehensive coverage of the analyzed networks. Indeed, methods returning high accurate results are often able to cover only small parts of the input PPI network, especially when low-characterized networks are considered. We present a coclustering-based technique able to generate both overlapping and nonoverlapping clusters. The density of the clusters to search for can also be set by the user. We tested our method on the two networks of yeast and human, and compared it to other five well-known techniques on the same interaction data sets. The results showed that, for all the examples considered, our approach always reaches a good compromise between accuracy and network coverage. Furthermore, the behavior of our algorithm is not influenced by the structure of the input network, different from all the techniques considered in the comparison, which returned very good results on the yeast network, while on the human network their outcomes are rather poor.


Subject(s)
Cluster Analysis , Protein Interaction Maps , Proteins/chemistry , Proteins/metabolism , Algorithms , Humans , Protein Interaction Mapping/methods
15.
BMC Proc ; 5 Suppl 2: S1, 2011 May 28.
Article in English | MEDLINE | ID: mdl-21554757

ABSTRACT

BACKGROUND: Plants have played a special role in inositol polyphosphate (IP) research since in plant seeds was discovered the first IP, the fully phosphorylated inositol ring of phytic acid (IP6). It is now known that phytic acid is further metabolized by the IP6 Kinases (IP6Ks) to generate IP containing pyro-phosphate moiety. The IP6K are evolutionary conserved enzymes identified in several mammalian, fungi and amoebae species. Although IP6K has not yet been identified in plant chromosomes, there are many clues suggesting its presences in vegetal cells. RESULTS: In this paper we propose a new approach to search for the plant IP6K gene, that lead to the identification in plant genome of a nucleotide sequence corresponding to a specific tag of the IP6K family. Such a tag has been found in all IP6K genes identified up to now, as well as in all genes belonging to the Inositol Polyphosphate Kinases superfamily (IPK). The tag sequence corresponds to the inositol-binding site of the enzyme, and it can be considered as characterizing all IPK genes. To this aim we applied a technique based on motif discovery. We exploited DLSME, a software recently proposed, which allows for the motif structure to be only partially specified by the user. First we applied the new method on mitochondrial DNA (mtDNA) of plants, where such a gene could have been nested, possibly encrypted and hidden by virtue of the editing and/or trans-splicing processes. Then we looked for the gene in nuclear genome of two model plants, Arabidopsis thaliana and Oryza sativa. CONCLUSIONS: The analysis we conducted in plant mitochondria provided the negative, though we argue relevant, result that IP6K does not actually occur in vegetable mtDNA. Very interestingly, the tag search in nuclear genomes lead us to identify a promising sequence in chromosome 5 of Oryza sativa. Further analyses are in course to confirm that this sequence actually corresponds to IP6K mammalian gene.

16.
Article in English | MEDLINE | ID: mdl-21321368

ABSTRACT

Comparing and querying the protein-protein interaction (PPI) networks of different organisms is important to infer knowledge about conservation across species. Known methods that perform these tasks operate symmetrically, i.e., they do not assign a distinct role to the input PPI networks. However, in most cases, the input networks are indeed distinguishable on the basis of how the corresponding organism is biologically well characterized. In this paper a new idea is developed, that is, to exploit differences in the characterization of organisms at hand in order to devise methods for comparing their PPI networks. We use the PPI network (called Master) of the best characterized organism as a fingerprint to guide the alignment process to the second input network (called Slave), so that generated results preferably retain the structural characteristics of the Master network. Technically, this is obtained by generating from the Master a finite automaton, called alignment model, which is then fed with (a linearization of) the Slave for the purpose of extracting, via the Viterbi algorithm, matching subgraphs. We propose an approach able to perform global alignment and network querying, and we apply it on PPI networks. We tested our method showing that the results it returns are biologically relevant.


Subject(s)
Computational Biology/methods , Models, Biological , Protein Interaction Domains and Motifs , Protein Interaction Mapping , Algorithms , Sequence Alignment , Sequence Analysis, Protein
17.
Int J Data Min Bioinform ; 3(4): 431-53, 2009.
Article in English | MEDLINE | ID: mdl-20052906

ABSTRACT

We describe a method to search for similarities across protein-protein interaction networks of different organisms. The technique core consists in computing a maximum weight matching of bipartite graphs resulting from comparing the neighbourhoods of proteins belonging to different networks. Both quantitative and reliability information are exploited. We tested the method on the networks of S. cerevisiae, D. melanogaster and C. elegans. The experiments showed that the technique is able to detect functional orthologs when the sole sequence similarity does not prove itself sufficient. They also demonstrated the capability of our approach in discovering common biological processes involving uncharacterised proteins.


Subject(s)
Computational Biology/methods , Protein Interaction Mapping/methods , Proteins/chemistry , Proteins/metabolism , Animals , Caenorhabditis elegans/metabolism , Drosophila/metabolism , Saccharomyces cerevisiae/metabolism
SELECTION OF CITATIONS
SEARCH DETAIL