Search | VHL Search Portal

1.

Large-Scale Analyses of Human Microbiomes Reveal Thousands of Small, Novel Genes.

Sberro, Hila; Fremin, Brayon J; Zlitni, Soumaya; Edfors, Fredrik; Greenfield, Nicholas; Snyder, Michael P; Pavlopoulos, Georgios A; Kyrpides, Nikos C; Bhatt, Ami S.

Cell ; 178(5): 1245-1259.e14, 2019 08 22.

Article in English | MEDLINE | ID: mdl-31402174

ABSTRACT

Small proteins are traditionally overlooked due to computational and experimental difficulties in detecting them. To systematically identify small proteins, we carried out a comparative genomics study on 1,773 human-associated metagenomes from four different body sites. We describe >4,000 conserved protein families, the majority of which are novel; â¼30% of these protein families are predicted to be secreted or transmembrane. Over 90% of the small protein families have no known domain and almost half are not represented in reference genomes. We identify putative housekeeping, mammalian-specific, defense-related, and protein families that are likely to be horizontally transferred. We provide evidence of transcription and translation for a subset of these families. Our study suggests that small proteins are highly abundant and those of the human microbiome, in particular, may perform diverse functions that have not been previously reported.

Subject(s)

Microbiota , Proteins/metabolism , Amino Acid Sequence , Cell Communication , Host-Pathogen Interactions , Humans , Metagenome , Open Reading Frames/genetics , Proteins/chemistry , Ribosomal Proteins/chemistry , Ribosomal Proteins/metabolism , Sequence Alignment

2.

Unraveling the functional dark matter through global metagenomics.

Pavlopoulos, Georgios A; Baltoumas, Fotis A; Liu, Sirui; Selvitopi, Oguz; Camargo, Antonio Pedro; Nayfach, Stephen; Azad, Ariful; Roux, Simon; Call, Lee; Ivanova, Natalia N; Chen, I Min; Paez-Espino, David; Karatzas, Evangelos; Iliopoulos, Ioannis; Konstantinidis, Konstantinos; Tiedje, James M; Pett-Ridge, Jennifer; Baker, David; Visel, Axel; Ouzounis, Christos A; Ovchinnikov, Sergey; Buluç, Aydin; Kyrpides, Nikos C.

Nature ; 622(7983): 594-602, 2023 Oct.

Article in English | MEDLINE | ID: mdl-37821698

ABSTRACT

Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database3. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.

Subject(s)

Metagenome , Metagenomics , Microbiology , Proteins , Cluster Analysis , Metagenome/genetics , Metagenomics/methods , Proteins/chemistry , Proteins/classification , Proteins/genetics , Databases, Protein , Protein Conformation

3.

NMPFamsDB: a database of novel protein families from microbial metagenomes and metatranscriptomes.

Baltoumas, Fotis A; Karatzas, Evangelos; Liu, Sirui; Ovchinnikov, Sergey; Sofianatos, Yorgos; Chen, I-Min; Kyrpides, Nikos C; Pavlopoulos, Georgios A.

Nucleic Acids Res ; 52(D1): D502-D512, 2024 Jan 05.

Article in English | MEDLINE | ID: mdl-37811892

ABSTRACT

The Novel Metagenome Protein Families Database (NMPFamsDB) is a database of metagenome- and metatranscriptome-derived protein families, whose members have no hits to proteins of reference genomes or Pfam domains. Each protein family is accompanied by multiple sequence alignments, Hidden Markov Models, taxonomic information, ecosystem and geolocation metadata, sequence and structure predictions, as well as 3D structure models predicted with AlphaFold2. In its current version, NMPFamsDB hosts over 100 000 protein families, each with at least 100 members. The reported protein families significantly expand (more than double) the number of known protein sequence clusters from reference genomes and reveal new insights into their habitat distribution, origins, functions and taxonomy. We expect NMPFamsDB to be a valuable resource for microbial proteome-wide analyses and for further discovery and characterization of novel functions. NMPFamsDB is publicly available in http://www.nmpfamsdb.org/ or https://bib.fleming.gr/NMPFamsDB.

Subject(s)

Databases, Protein , Metagenome , Proteins , Amino Acid Sequence , Databases, Factual , Ecosystem , Proteins/chemistry , Geography

4.

Flame (v2.0): advanced integration and interpretation of functional enrichment results from multiple sources.

Karatzas, Evangelos; Baltoumas, Fotis A; Aplakidou, Eleni; Kontou, Panagiota I; Stathopoulos, Panos; Stefanis, Leonidas; Bagos, Pantelis G; Pavlopoulos, Georgios A.

Bioinformatics ; 39(8)2023 08 01.

Article in English | MEDLINE | ID: mdl-37540207

ABSTRACT

Functional enrichment is the process of identifying implicated functional terms from a given input list of genes or proteins. In this article, we present Flame (v2.0), a web tool which offers a combinatorial approach through merging and visualizing results from widely used functional enrichment applications while also allowing various flexible input options. In this version, Flame utilizes the aGOtool, g: Profiler, WebGestalt, and Enrichr pipelines and presents their outputs separately or in combination following a visual analytics approach. For intuitive representations and easier interpretation, it uses interactive plots such as parameterizable networks, heatmaps, barcharts, and scatter plots. Users can also: (i) handle multiple protein/gene lists and analyse union and intersection sets simultaneously through interactive UpSet plots, (ii) automatically extract genes and proteins from free text through text-mining and Named Entity Recognition (NER) techniques, (iii) upload single nucleotide polymorphisms (SNPs) and extract their relative genes, or (iv) analyse multiple lists of differentially expressed proteins/genes after selecting them interactively from a parameterizable volcano plot. Compared to the previous version of 197 supported organisms, Flame (v2.0) currently allows enrichment for 14 436 organisms. AVAILABILITY AND IMPLEMENTATION: Web Application: http://flame.pavlopouloslab.info. Code: https://github.com/PavlopoulosLab/Flame. Docker: https://hub.docker.com/r/pavlopouloslab/flame.

Subject(s)

Proteins , Software , Data Mining

5.

IL-6 Signaling Attenuates TNF-α Production by Plasmacytoid Dendritic Cells in Rheumatoid Arthritis.

Papadaki, Garyfalia; Goutakoli, Panagiota; Tiniakou, Ioanna; Grün, Joachim R; Grützkau, Andreas; Pavlopoulos, Georgios A; Iliopoulos, Ioannis; Bertsias, George; Boumpas, Dimitrios; Ospelt, Caroline; Reizis, Boris; Sidiropoulos, Prodromos; Verginis, Panayotis.

J Immunol ; 209(10): 1906-1917, 2022 11 15.

Article in English | MEDLINE | ID: mdl-36426957

ABSTRACT

Rheumatoid arthritis (RA) is characterized by autoimmune joint destruction with debilitating consequences. Despite treatment advancements with biologic therapies, a significant proportion of RA patients show an inadequate clinical response, and restoration of immune self-tolerance represents an unmet therapeutic need. We have previously described a tolerogenic phenotype of plasmacytoid dendritic cells (pDCs) in RA patients responding to anti-TNF-α agents. However, the molecular mechanisms involved in tolerogenic reprogramming of pDCs in RA remain elusive. In this study, guided by transcriptomic analysis of CD303+CD123+ pDCs from RA patients in remission, we revealed enhanced expression of IL-6R and its downstream signaling compared with healthy pDCs. Functional assessment demonstrated that IL-6R engagement resulted in marked reduction of TNF-α secretion by pDCs whereas intracellular TNF-α was significantly increased. Accordingly, pharmacologic inhibition of IL-6R signaling restored TNF-α secretion levels by pDCs. Mechanistic analysis demonstrated impaired activity and decreased lysosomal degradation of ADAM17 (a disintegrin and metalloproteinase 17) sheddase in pDCs, which is essential for TNF-α cleavage. Importantly, reduction of TNF-α secretion by IL-6-treated pDCs attenuated the inflammatory potential of RA patient-derived synovial fibroblasts. Collectively, these findings position pDCs as an important source of TNF-α in RA pathogenesis and unravel an anti-inflammatory mechanism of IL-6 by limiting the pDC-derived TNF-α secretion.

Subject(s)

Arthritis, Rheumatoid , Interleukin-6 , Humans , Tumor Necrosis Factor Inhibitors , Dendritic Cells , Signal Transduction , Tumor Necrosis Factor-alpha

6.

Arena3Dweb: interactive 3D visualization of multilayered networks.

Karatzas, Evangelos; Baltoumas, Fotis A; Panayiotou, Nikolaos A; Schneider, Reinhard; Pavlopoulos, Georgios A.

Nucleic Acids Res ; 49(W1): W36-W45, 2021 07 02.

Article in English | MEDLINE | ID: mdl-33885790

ABSTRACT

Efficient integration and visualization of heterogeneous biomedical information in a single view is a key challenge. In this study, we present Arena3Dweb, the first, fully interactive and dependency-free, web application which allows the visualization of multilayered graphs in 3D space. With Arena3Dweb, users can integrate multiple networks in a single view along with their intra- and inter-layer connections. For clearer and more informative views, users can choose between a plethora of layout algorithms and apply them on a set of selected layers either individually or in combination. Users can align networks and highlight node topological features, whereas each layer as well as the whole scene can be translated, rotated and scaled in 3D space. User-selected edge colors can be used to highlight important paths, while node positioning, coloring and resizing can be adjusted on-the-fly. In its current version, Arena3Dweb supports weighted and unweighted undirected graphs and is written in R, Shiny and JavaScript. We demonstrate the functionality of Arena3Dweb using two different use-case scenarios; one regarding drug repurposing for SARS-CoV-2 and one related to GPCR signaling pathways implicated in melanoma. Arena3Dweb is available at http://bib.fleming.gr:3838/Arena3D or http://bib.fleming.gr/Arena3D.

Subject(s)

Algorithms , Data Visualization , Internet , Protein Interaction Maps , Software , COVID-19/metabolism , Color , Drug Repositioning , Humans , Melanoma/drug therapy , Melanoma/metabolism , Programming Languages , Receptors, Endothelin/metabolism , SARS-CoV-2/metabolism , Signal Transduction , COVID-19 Drug Treatment

7.

ProteoSign v2: a faster and evolved user-friendly online tool for statistical analyses of differential proteomics.

Theodorakis, Evangelos; Antonakis, Andreas N; Baltsavia, Ismini; Pavlopoulos, Georgios A; Samiotaki, Martina; Amoutzias, Grigoris D; Theodosiou, Theodosios; Acuto, Oreste; Efstathiou, Georgios; Iliopoulos, Ioannis.

Nucleic Acids Res ; 49(W1): W573-W577, 2021 07 02.

Article in English | MEDLINE | ID: mdl-33963869

ABSTRACT

Bottom-up proteomics analyses have been proved over the last years to be a powerful tool in the characterization of the proteome and are crucial for understanding cellular and organism behaviour. Through differential proteomic analysis researchers can shed light on groups of proteins or individual proteins that play key roles in certain, normal or pathological conditions. However, several tools for the analysis of such complex datasets are powerful, but hard-to-use with steep learning curves. In addition, some other tools are easy to use, but are weak in terms of analytical power. Previously, we have introduced ProteoSign, a powerful, yet user-friendly open-source online platform for protein differential expression/abundance analysis designed with the end-proteomics user in mind. Part of Proteosign's power stems from the utilization of the well-established Linear Models For Microarray Data (LIMMA) methodology. Here, we present a substantial upgrade of this computational resource, called ProteoSign v2, where we introduce major improvements, also based on user feedback. The new version offers more plot options, supports additional experimental designs, analyzes updated input datasets and performs a gene enrichment analysis of the differentially expressed proteins. We also introduce the deployment of the Docker technology and significantly increase the speed of a full analysis. ProteoSign v2 is available at http://bioinformatics.med.uoc.gr/ProteoSign.

Subject(s)

Proteomics/methods , Software , Data Interpretation, Statistical , Internet , Mass Spectrometry , Proteins/genetics , Proteins/metabolism

8.

Drug genetic associations with COVID-19 manifestations: a data mining and network biology approach.

Charitou, Theodosia; Kontou, Panagiota I; Tamposis, Ioannis A; Pavlopoulos, Georgios A; Braliou, Georgia G; Bagos, Pantelis G.

Pharmacogenomics J ; 22(5-6): 294-302, 2022 12.

Article in English | MEDLINE | ID: mdl-36171417

ABSTRACT

Available drugs have been used as an urgent attempt through clinical trials to minimize severe cases of hospitalizations with Coronavirus disease (COVID-19), however, there are limited data on common pharmacogenomics affecting concomitant medications response in patients with comorbidities. To identify the genomic determinants that influence COVID-19 susceptibility, we use a computational, statistical, and network biology approach to analyze relationships of ineffective concomitant medication with an adverse effect on patients. We statistically construct a pharmacogenetic/biomarker network with significant drug-gene interactions originating from gene-disease associations. Investigation of the predicted pharmacogenes encompassing the gene-disease-gene pharmacogenomics (PGx) network suggests that these genes could play a significant role in COVID-19 clinical manifestation due to their association with autoimmune, metabolic, neurological, cardiovascular, and degenerative disorders, some of which have been reported to be crucial comorbidities in a COVID-19 patient.

Subject(s)

COVID-19 Drug Treatment , Humans , Data Mining , Pharmacogenetics , Genomics

9.

Uncovering Earth's virome.

Paez-Espino, David; Eloe-Fadrosh, Emiley A; Pavlopoulos, Georgios A; Thomas, Alex D; Huntemann, Marcel; Mikhailova, Natalia; Rubin, Edward; Ivanova, Natalia N; Kyrpides, Nikos C.

Nature ; 536(7617): 425-30, 2016 08 25.

Article in English | MEDLINE | ID: mdl-27533034

ABSTRACT

Viruses are the most abundant biological entities on Earth, but challenges in detecting, isolating, and classifying unknown viruses have prevented exhaustive surveys of the global virome. Here we analysed over 5 Tb of metagenomic sequence data from 3,042 geographically diverse samples to assess the global distribution, phylogenetic diversity, and host specificity of viruses. We discovered over 125,000 partial DNA viral genomes, including the largest phage yet identified, and increased the number of known viral genes by 16-fold. Half of the predicted partial viral genomes were clustered into genetically distinct groups, most of which included genes unrelated to those in known viruses. Using CRISPR spacers and transfer RNA matches to link viral groups to microbial host(s), we doubled the number of microbial phyla known to be infected by viruses, and identified viruses that can infect organisms from different phyla. Analysis of viral distribution across diverse ecosystems revealed strong habitat-type specificity for the vast majority of viruses, but also identified some cosmopolitan groups. Our results highlight an extensive global viral diversity and provide detailed insight into viral habitat distribution and hostvirus interactions.

Subject(s)

Earth, Planet , Ecosystem , Genome, Viral/genetics , Metagenomics , Viruses/genetics , Animals , Aquatic Organisms/virology , Bacteriophages/genetics , Biodiversity , Clustered Regularly Interspaced Short Palindromic Repeats/genetics , DNA, Viral/analysis , DNA, Viral/genetics , Datasets as Topic , Genes, Viral , Host Specificity , Host-Pathogen Interactions , Humans , Metagenome/genetics , Phylogeny , Phylogeography , RNA, Transfer/genetics , Sequence Analysis , Viruses/classification , Viruses/isolation & purification

10.

Prediction and Ranking of Biomarkers Using multiple UniReD.

Baltsavia, Ismini; Theodosiou, Theodosios; Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Amoutzias, Grigorios D; Panagopoulou, Maria; Chatzaki, Ekaterini; Andreakos, Evangelos; Iliopoulos, Ioannis.

Int J Mol Sci ; 23(19)2022 Sep 21.

Article in English | MEDLINE | ID: mdl-36232413

ABSTRACT

Protein-protein interactions (PPIs) are of key importance for understanding how cells and organisms function. Thus, in recent decades, many approaches have been developed for the identification and discovery of such interactions. These approaches addressed the problem of PPI identification either by an experimental point of view or by a computational one. Here, we present an updated version of UniReD, a computational prediction tool which takes advantage of biomedical literature aiming to extract documented, already published protein associations and predict undocumented ones. The usefulness of this computational tool has been previously evaluated by experimentally validating predicted interactions and by benchmarking it against public databases of experimentally validated PPIs. In its updated form, UniReD allows the user to provide a list of proteins of known implication in, e.g., a particular disease, as well as another list of proteins that are potentially associated with the proteins of the first list. UniReD then automatically analyzes both lists and ranks the proteins of the second list by their association with the proteins of the first list, thus serving as a potential biomarker discovery/validation tool.

Subject(s)

Protein Interaction Mapping , Proteins , Biomarkers , Computational Biology , Proteins/metabolism

11.

Distinct transcriptional profile of blood mononuclear cells in Behçet's disease: insights into the central role of neutrophil chemotaxis.

Verrou, Kleio-Maria; Vlachogiannis, Nikolaos I; Ampatziadis-Michailidis, Giannis; Moulos, Panagiotis; Pavlopoulos, Georgios A; Hatzis, Pantelis; Kollias, George; Sfikakis, Petros P.

Rheumatology (Oxford) ; 60(10): 4910-4919, 2021 10 02.

Article in English | MEDLINE | ID: mdl-33493315

ABSTRACT

OBJECTIVES: Both innate and adaptive immune responses are reportedly increased in Behçet's disease (BD), a chronic, relapsing systemic vasculitis lying at the intersection between autoinflammation and autoimmunity. To further study pathophysiologic molecular mechanisms operating in BD, we searched for transcriptome-wide changes in blood mononuclear cells from these patients. METHODS: We performed 3' mRNA next-generation sequencing-based genome-wide transcriptional profiling followed by analysis of differential expression signatures, Kyoto Encyclopedia of Genes and Genomes pathways, GO biological processes and transcription factor signatures. RESULTS: Differential expression analysis clustered the transcriptomes of 13 patients and one healthy subject separately from those of 10 healthy age/gender-matched controls and one patient. Among the total of 17 591 expressed protein-coding genes, 209 and 31 genes were significantly upregulated and downregulated, respectively, in BD vs controls by at least 2-fold. The most upregulated genes comprised an abundance of CC- and CXC-chemokines. Remarkably, the 5 out of top 10 upregulated biological processes involved leucocyte recruitment to peripheral tissues, especially for neutrophils. Moreover, NF-kB, TNF and IL-1 signalling pathways were prominently enhanced in BD, while transcription factor activity analysis suggested that the NF-kB p65/RELA subunit action underlies the observed differences in the BD transcriptome. CONCLUSION: This RNA-sequencing analysis in peripheral blood mononuclear cells derived from patients with BD does not support a major pathogenetic role for adaptive immunity-driven mechanisms, but clearly points to the action of aberrant innate immune responses with a central role played by upregulated neutrophil chemotaxis.

Subject(s)

Behcet Syndrome/immunology , Chemotaxis, Leukocyte , Leukocytes, Mononuclear/pathology , Neutrophils/pathology , Adult , Behcet Syndrome/pathology , Case-Control Studies , Female , Gene Expression Regulation , High-Throughput Nucleotide Sequencing , Humans , Leukocytes, Mononuclear/metabolism , Male , Middle Aged , Neutrophils/metabolism , Transcription Factors/metabolism , Transcriptome

12.

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks.

Azad, Ariful; Pavlopoulos, Georgios A; Ouzounis, Christos A; Kyrpides, Nikos C; Buluç, Aydin.

Nucleic Acids Res ; 46(6): e33, 2018 04 06.

Article in English | MEDLINE | ID: mdl-29315405

ABSTRACT

Biological networks capture structural or functional properties of relevant entities such as molecules, proteins or genes. Characteristic examples are gene expression networks or protein-protein interaction networks, which hold information about functional affinities or structural similarities. Such networks have been expanding in size due to increasing scale and abundance of biological data. While various clustering algorithms have been proposed to find highly connected regions, Markov Clustering (MCL) has been one of the most successful approaches to cluster sequence similarity or expression networks. Despite its popularity, MCL's scalability to cluster large datasets still remains a bottleneck due to high running times and memory demands. Here, we present High-performance MCL (HipMCL), a parallel implementation of the original MCL algorithm that can run on distributed-memory computers. We show that HipMCL can efficiently utilize 2000 compute nodes and cluster a network of â¼70 million nodes with â¼68 billion edges in â¼2.4 h. By exploiting distributed-memory environments, HipMCL clusters large-scale networks several orders of magnitude faster than MCL and enables clustering of even bigger networks. HipMCL is based on MPI and OpenMP and is freely available under a modified BSD license.

Subject(s)

Algorithms , Cluster Analysis , Computational Biology/methods , Gene Regulatory Networks , Markov Chains , Gene Expression , Protein Interaction Maps/genetics

13.

ProteoSign: an end-user online differential proteomics statistical analysis platform.

Efstathiou, Georgios; Antonakis, Andreas N; Pavlopoulos, Georgios A; Theodosiou, Theodosios; Divanach, Peter; Trudgian, David C; Thomas, Benjamin; Papanikolaou, Nikolas; Aivaliotis, Michalis; Acuto, Oreste; Iliopoulos, Ioannis.

Nucleic Acids Res ; 45(W1): W300-W306, 2017 07 03.

Article in English | MEDLINE | ID: mdl-28520987

ABSTRACT

Profiling of proteome dynamics is crucial for understanding cellular behavior in response to intrinsic and extrinsic stimuli and maintenance of homeostasis. Over the last 20 years, mass spectrometry (MS) has emerged as the most powerful tool for large-scale identification and characterization of proteins. Bottom-up proteomics, the most common MS-based proteomics approach, has always been challenging in terms of data management, processing, analysis and visualization, with modern instruments capable of producing several gigabytes of data out of a single experiment. Here, we present ProteoSign, a freely available web application, dedicated in allowing users to perform proteomics differential expression/abundance analysis in a user-friendly and self-explanatory way. Although several non-commercial standalone tools have been developed for post-quantification statistical analysis of proteomics data, most of them are not end-user appealing as they often require very stringent installation of programming environments, third-party software packages and sometimes further scripting or computer programming. To avoid this bottleneck, we have developed a user-friendly software platform accessible via a web interface in order to enable proteomics laboratories and core facilities to statistically analyse quantitative proteomics data sets in a resource-efficient manner. ProteoSign is available at http://bioinformatics.med.uoc.gr/ProteoSign and the source code at https://github.com/yorgodillo/ProteoSign.

Subject(s)

Proteomics/methods , Software , Data Interpretation, Statistical , Internet , Mass Spectrometry

14.

IMG/VR: a database of cultured and uncultured DNA Viruses and retroviruses.

Paez-Espino, David; Chen, I-Min A; Palaniappan, Krishna; Ratner, Anna; Chu, Ken; Szeto, Ernest; Pillay, Manoj; Huang, Jinghua; Markowitz, Victor M; Nielsen, Torben; Huntemann, Marcel; K Reddy, T B; Pavlopoulos, Georgios A; Sullivan, Matthew B; Campbell, Barbara J; Chen, Feng; McMahon, Katherine; Hallam, Steve J; Denef, Vincent; Cavicchioli, Ricardo; Caffrey, Sean M; Streit, Wolfgang R; Webster, John; Handley, Kim M; Salekdeh, Ghasem H; Tsesmetzis, Nicolas; Setubal, Joao C; Pope, Phillip B; Liu, Wen-Tso; Rivers, Adam R; Ivanova, Natalia N; Kyrpides, Nikos C.

Nucleic Acids Res ; 45(D1): D457-D465, 2017 01 04.

Article in English | MEDLINE | ID: mdl-27799466

ABSTRACT

Viruses represent the most abundant life forms on the planet. Recent experimental and computational improvements have led to a dramatic increase in the number of viral genome sequences identified primarily from metagenomic samples. As a result of the expanding catalog of metagenomic viral sequences, there exists a need for a comprehensive computational platform integrating all these sequences with associated metadata and analytical tools. Here we present IMG/VR (https://img.jgi.doe.gov/vr/), the largest publicly available database of 3908 isolate reference DNA viruses with 264 413 computationally identified viral contigs from >6000 ecologically diverse metagenomic samples. Approximately half of the viral contigs are grouped into genetically distinct quasi-species clusters. Microbial hosts are predicted for 20 000 viral sequences, revealing nine microbial phyla previously unreported to be infected by viruses. Viral sequences can be queried using a variety of associated metadata, including habitat type and geographic location of the samples, or taxonomic classification according to hallmark viral genes. IMG/VR has a user-friendly interface that allows users to interrogate all integrated data and interact by comparing with external sequences, thus serving as an essential resource in the viral genomics community.

Subject(s)

DNA Viruses/genetics , Databases, Genetic , Genome, Viral , Genomics/methods , Metagenomics/methods , Retroviridae/genetics , Software , Environmental Microbiology , Host-Pathogen Interactions , Metagenome , Sequence Analysis, DNA

15.

DrugQuest - a text mining workflow for drug association discovery.

Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Theodosiou, Theodosios; Vizirianakis, Ioannis S; Iliopoulos, Ioannis.

BMC Bioinformatics ; 17 Suppl 5: 182, 2016 Jun 06.

Article in English | MEDLINE | ID: mdl-27295093

ABSTRACT

BACKGROUND: Text mining and data integration methods are gaining ground in the field of health sciences due to the exponential growth of bio-medical literature and information stored in biological databases. While such methods mostly try to extract bioentity associations from PubMed, very few of them are dedicated in mining other types of repositories such as chemical databases. RESULTS: Herein, we apply a text mining approach on the DrugBank database in order to explore drug associations based on the DrugBank "Description", "Indication", "Pharmacodynamics" and "Mechanism of Action" text fields. We apply Name Entity Recognition (NER) techniques on these fields to identify chemicals, proteins, genes, pathways, diseases, and we utilize the TextQuest algorithm to find additional biologically significant words. Using a plethora of similarity and partitional clustering techniques, we group the DrugBank records based on their common terms and investigate possible scenarios why these records are clustered together. Different views such as clustered chemicals based on their textual information, tag clouds consisting of Significant Terms along with the terms that were used for clustering are delivered to the user through a user-friendly web interface. CONCLUSIONS: DrugQuest is a text mining tool for knowledge discovery: it is designed to cluster DrugBank records based on text attributes in order to find new associations between drugs. The service is freely available at http://bioinformatics.med.uoc.gr/drugquest .

Subject(s)

Drug Discovery , User-Computer Interface , Algorithms , Cluster Analysis , Databases, Factual , Humans , Internet , Pharmaceutical Preparations/chemistry , Pharmaceutical Preparations/metabolism

16.

Protein-protein interaction predictions using text mining methods.

Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Theodosiou, Theodosios; Iliopoulos, Ioannis.

Methods ; 74: 47-53, 2015 Mar.

Article in English | MEDLINE | ID: mdl-25448298

ABSTRACT

It is beyond any doubt that proteins and their interactions play an essential role in most complex biological processes. The understanding of their function individually, but also in the form of protein complexes is of a great importance. Nowadays, despite the plethora of various high-throughput experimental approaches for detecting protein-protein interactions, many computational methods aiming to predict new interactions have appeared and gained interest. In this review, we focus on text-mining based computational methodologies, aiming to extract information for proteins and their interactions from public repositories such as literature and various biological databases. We discuss their strengths, their weaknesses and how they complement existing experimental techniques by simultaneously commenting on the biological databases which hold such information and the benchmark datasets that can be used for evaluating new tools.

Subject(s)

Data Mining/methods , Databases, Protein , Protein Interaction Mapping/methods , Animals , Data Mining/trends , Databases, Protein/trends , Forecasting , Humans , Protein Interaction Mapping/trends

17.

BioTextQuest(+): a knowledge integration platform for literature mining and concept discovery.

Papanikolaou, Nikolas; Pavlopoulos, Georgios A; Pafilis, Evangelos; Theodosiou, Theodosios; Schneider, Reinhard; Satagopam, Venkata P; Ouzounis, Christos A; Eliopoulos, Aristides G; Promponas, Vasilis J; Iliopoulos, Ioannis.

Bioinformatics ; 30(22): 3249-56, 2014 Nov 15.

Article in English | MEDLINE | ID: mdl-25100685

ABSTRACT

SUMMARY: The iterative process of finding relevant information in biomedical literature and performing bioinformatics analyses might result in an endless loop for an inexperienced user, considering the exponential growth of scientific corpora and the plethora of tools designed to mine PubMed(®) and related biological databases. Herein, we describe BioTextQuest(+), a web-based interactive knowledge exploration platform with significant advances to its predecessor (BioTextQuest), aiming to bridge processes such as bioentity recognition, functional annotation, document clustering and data integration towards literature mining and concept discovery. BioTextQuest(+) enables PubMed and OMIM querying, retrieval of abstracts related to a targeted request and optimal detection of genes, proteins, molecular functions, pathways and biological processes within the retrieved documents. The front-end interface facilitates the browsing of document clustering per subject, the analysis of term co-occurrence, the generation of tag clouds containing highly represented terms per cluster and at-a-glance popup windows with information about relevant genes and proteins. Moreover, to support experimental research, BioTextQuest(+) addresses integration of its primary functionality with biological repositories and software tools able to deliver further bioinformatics services. The Google-like interface extends beyond simple use by offering a range of advanced parameterization for expert users. We demonstrate the functionality of BioTextQuest(+) through several exemplary research scenarios including author disambiguation, functional term enrichment, knowledge acquisition and concept discovery linking major human diseases, such as obesity and ageing. AVAILABILITY: The service is accessible at http://bioinformatics.med.uoc.gr/biotextquest. CONTACT: g.pavlopoulos@gmail.com or georgios.pavlopoulos@esat.kuleuven.be SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Data Mining/methods , Software , Authorship , Cluster Analysis , Disease/genetics , Genes , Humans , Internet , Medical Subject Headings , Proteins , PubMed , Publications

18.

Meander: visually exploring the structural variome using space-filling curves.

Pavlopoulos, Georgios A; Kumar, Parveen; Sifrim, Alejandro; Sakai, Ryo; Lin, Meng Lay; Voet, Thierry; Moreau, Yves; Aerts, Jan.

Nucleic Acids Res ; 41(11): e118, 2013 Jun.

Article in English | MEDLINE | ID: mdl-23605045

ABSTRACT

The introduction of next generation sequencing methods in genome studies has made it possible to shift research from a gene-centric approach to a genome wide view. Although methods and tools to detect single nucleotide polymorphisms are becoming more mature, methods to identify and visualize structural variation (SV) are still in their infancy. Most genome browsers can only compare a given sequence to a reference genome; therefore, direct comparison of multiple individuals still remains a challenge. Therefore, the implementation of efficient approaches to explore and visualize SVs and directly compare two or more individuals is desirable. In this article, we present a visualization approach that uses space-filling Hilbert curves to explore SVs based on both read-depth and pair-end information. An interactive open-source Java application, called Meander, implements the proposed methodology, and its functionality is demonstrated using two cases. With Meander, users can explore variations at different levels of resolution and simultaneously compare up to four different individuals against a common reference. The application was developed using Java version 1.6 and Processing.org and can be run on any platform. It can be found at http://homes.esat.kuleuven.be/~bioiuser/meander.

Subject(s)

Genomic Structural Variation , Software , Arabidopsis/genetics , Breast Neoplasms/genetics , Chromosomes , Female , Genomics , Humans

19.

A survey of k-mer methods and applications in bioinformatics.

Moeckel, Camille; Mareboina, Manvita; Konnaris, Maxwell A; Chan, Candace S Y; Mouratidis, Ioannis; Montgomery, Austin; Chantzi, Nikol; Pavlopoulos, Georgios A; Georgakopoulos-Soares, Ilias.

Comput Struct Biotechnol J ; 23: 2289-2303, 2024 Dec.

Article in English | MEDLINE | ID: mdl-38840832

ABSTRACT

The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.

20.

kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species.

Mouratidis, Ioannis; Baltoumas, Fotis A; Chantzi, Nikol; Patsakis, Michail; Chan, Candace S Y; Montgomery, Austin; Konnaris, Maxwell A; Aplakidou, Eleni; Georgakopoulos, George C; Das, Anshuman; Chartoumpekis, Dionysios V; Kovac, Jasna; Pavlopoulos, Georgios A; Georgakopoulos-Soares, Ilias.

Comput Struct Biotechnol J ; 23: 1919-1928, 2024 Dec.

Article in English | MEDLINE | ID: mdl-38711760

ABSTRACT

The decrease in sequencing expenses has facilitated the creation of reference genomes and proteomes for an expanding array of organisms. Nevertheless, no established repository that details organism-specific genomic and proteomic sequences of specific lengths, referred to as kmers, exists to our knowledge. In this article, we present kmerDB, a database accessible through an interactive web interface that provides kmer-based information from genomic and proteomic sequences in a systematic way. kmerDB currently contains 202,340,859,107 base pairs and 19,304,903,356 amino acids, spanning 54,039 and 21,865 reference genomes and proteomes, respectively, as well as 6,905,362 and 149,305,183 genomic and proteomic species-specific sequences, termed quasi-primes. Additionally, we provide access to 5,186,757 nucleic and 214,904,089 peptide sequences absent from every genome and proteome, termed primes. kmerDB features a user-friendly interface offering various search options and filters for easy parsing and searching. The service is available at: www.kmerdb.com.

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL