RESUMEN
Millions of transcriptome samples were generated by the Library of Integrated Network-based Cellular Signatures (LINCS) program. When these data are processed into searchable signatures along with signatures extracted from Genotype-Tissue Expression (GTEx) and Gene Expression Omnibus (GEO), connections between drugs, genes, pathways and diseases can be illuminated. SigCom LINCS is a webserver that serves over a million gene expression signatures processed, analyzed, and visualized from LINCS, GTEx, and GEO. SigCom LINCS is built with Signature Commons, a cloud-agnostic skeleton Data Commons with a focus on serving searchable signatures. SigCom LINCS provides a rapid signature similarity search for mimickers and reversers given sets of up and down genes, a gene set, a single gene, or any search term. Additionally, users of SigCom LINCS can perform a metadata search to find and analyze subsets of signatures and find information about genes and drugs. SigCom LINCS is findable, accessible, interoperable, and reusable (FAIR) with metadata linked to standard ontologies and vocabularies. In addition, all the data and signatures within SigCom LINCS are available via a well-documented API. In summary, SigCom LINCS, available at https://maayanlab.cloud/sigcom-lincs, is a rich webserver resource for accelerating drug and target discovery in systems pharmacology.
Asunto(s)
Metadatos , Transcriptoma , Transcriptoma/genética , Motor de BúsquedaRESUMEN
The Library of Integrated Network-based Cellular Signatures (LINCS) program is a national consortium funded by the NIH to generate a diverse and extensive reference library of cell-based perturbation-response signatures, along with novel data analytics tools to improve our understanding of human diseases at the systems level. In contrast to other large-scale data generation efforts, LINCS Data and Signature Generation Centers (DSGCs) employ a wide range of assay technologies cataloging diverse cellular responses. Integration of, and unified access to LINCS data has therefore been particularly challenging. The Big Data to Knowledge (BD2K) LINCS Data Coordination and Integration Center (DCIC) has developed data standards specifications, data processing pipelines, and a suite of end-user software tools to integrate and annotate LINCS-generated data, to make LINCS signatures searchable and usable for different types of users. Here, we describe the LINCS Data Portal (LDP) (http://lincsportal.ccs.miami.edu/), a unified web interface to access datasets generated by the LINCS DSGCs, and its underlying database, LINCS Data Registry (LDR). LINCS data served on the LDP contains extensive metadata and curated annotations. We highlight the features of the LDP user interface that is designed to enable search, browsing, exploration, download and analysis of LINCS data and related curated content.
Asunto(s)
Bases de Datos Factuales , Biología Celular , Biología Computacional , Curaduría de Datos , Bases de Datos Genéticas , Epigenómica , Humanos , Metadatos , Proteómica , Programas Informáticos , Biología de Sistemas , Interfaz Usuario-ComputadorRESUMEN
Enrichment analysis is a popular method for analyzing gene sets generated by genome-wide experiments. Here we present a significant update to one of the tools in this domain called Enrichr. Enrichr currently contains a large collection of diverse gene set libraries available for analysis and download. In total, Enrichr currently contains 180 184 annotated gene sets from 102 gene set libraries. New features have been added to Enrichr including the ability to submit fuzzy sets, upload BED files, improved application programming interface and visualization of the results as clustergrams. Overall, Enrichr is a comprehensive resource for curated gene sets and a search engine that accumulates biological knowledge for further biological discoveries. Enrichr is freely available at: http://amp.pharm.mssm.edu/Enrichr.
Asunto(s)
Biología Computacional/métodos , Biblioteca de Genes , Ontología de Genes , Interfaz Usuario-Computador , Benchmarking , Biología Computacional/estadística & datos numéricos , Bases de Datos Genéticas , Perfilación de la Expresión Génica , Genoma Humano , Humanos , Internet , Anotación de Secuencia MolecularRESUMEN
The volume and diversity of data in biomedical research have been rapidly increasing in recent years. While such data hold significant promise for accelerating discovery, their use entails many challenges including: the need for adequate computational infrastructure, secure processes for data sharing and access, tools that allow researchers to find and integrate diverse datasets, and standardized methods of analysis. These are just some elements of a complex ecosystem that needs to be built to support the rapid accumulation of these data. The NIH Big Data to Knowledge (BD2K) initiative aims to facilitate digitally enabled biomedical research. Within the BD2K framework, the Commons initiative is intended to establish a virtual environment that will facilitate the use, interoperability, and discoverability of shared digital objects used for research. The BD2K Commons Framework Pilots Working Group (CFPWG) was established to clarify goals and work on pilot projects that address existing gaps toward realizing the vision of the BD2K Commons. This report reviews highlights from a two-day meeting involving the BD2K CFPWG to provide insights on trends and considerations in advancing Big Data science for biomedical research in the United States.
Asunto(s)
Conjuntos de Datos como Asunto , Difusión de la Información , National Institutes of Health (U.S.) , Investigación Biomédica , Humanos , Conocimiento , Investigación Biomédica Traslacional , Estados UnidosRESUMEN
BACKGROUND: Birth defects are functional and structural abnormalities that impact about 1 in 33 births in the United States. They have been attributed to genetic and other factors such as drugs, cosmetics, food, and environmental pollutants during pregnancy, but for most birth defects there are no known causes. METHODS: To further characterize associations between small molecule compounds and their potential to induce specific birth abnormalities, we gathered knowledge from multiple sources to construct a reproductive toxicity Knowledge Graph (ReproTox-KG) with a focus on associations between birth defects, drugs, and genes. Specifically, we gathered data from drug/birth-defect associations from co-mentions in published abstracts, gene/birth-defect associations from genetic studies, drug- and preclinical-compound-induced gene expression changes in cell lines, known drug targets, genetic burden scores for human genes, and placental crossing scores for small molecules. RESULTS: Using ReproTox-KG and semi-supervised learning (SSL), we scored >30,000 preclinical small molecules for their potential to cross the placenta and induce birth defects, and identified >500 birth-defect/gene/drug cliques that can be used to explain molecular mechanisms for drug-induced birth defects. The ReproTox-KG can be accessed via a web-based user interface available at https://maayanlab.cloud/reprotox-kg . This site enables users to explore the associations between birth defects, approved and preclinical drugs, and all human genes. CONCLUSIONS: ReproTox-KG provides a resource for exploring knowledge about the molecular mechanisms of birth defects with the potential of predicting the likelihood of genes and preclinical small molecules to induce birth defects.
While birth defects are common, for most birth defects there are no known causes. During pregnancy, developing babies are exposed to drugs, cosmetics, food, and environmental pollutants that may cause birth defects. However, exactly how these environmental factors are involved in producing birth defects is difficult to discern. Also, birth defects can be a consequence of the genes inherited from the parents. We combined general data about human genes and drugs with specific data previously implicating genes and drugs in inducing birth defects to create a knowledge graph representation that connects genes, drugs, and birth defects. This knowledge graph can be used to explore new links that may explain why birth defects occur, particularly those that result from a combination of inherited and environmental influences.
RESUMEN
Motivation: Many biological and biomedical researchers commonly search for information about genes and drugs to gather knowledge from these resources. For the most part, such information is served as landing pages in disparate data repositories and web portals. Results: The Gene and Drug Landing Page Aggregator (GDLPA) provides users with access to 50 gene-centric and 19 drug-centric repositories, enabling them to retrieve landing pages corresponding to their gene and drug queries. Bringing these resources together into one dashboard that directs users to the landing pages across many resources can help centralize gene- and drug-centric knowledge, as well as raise awareness of available resources that may be missed when using standard search engines. To demonstrate the utility of GDLPA, case studies for the gene klotho and the drug remdesivir were developed. The first case study highlights the potential role of klotho as a drug target for aging and kidney disease, while the second study gathers knowledge regarding approval, usage, and safety for remdesivir, the first approved coronavirus disease 2019 therapeutic. Finally, based on our experience, we provide guidelines for developing effective landing pages for genes and drugs. Availability and implementation: GDLPA is open source and is available from: https://cfde-gene-pages.cloud/. Supplementary information: Supplementary data are available at Bioinformatics Advances online.
RESUMEN
The Library of Integrated Network-based Cellular Signatures (LINCS) was an NIH Common Fund program that aimed to expand our knowledge about human cellular responses to chemical, genetic, and microenvironment perturbations. Responses to perturbations were measured by transcriptomics, proteomics, cellular imaging, and other high content assays. The second phase of the LINCS program, which lasted 7 years, involved the engagement of six data and signature generation centers (DSGCs) and one data coordination and integration center (DCIC). The DSGCs and the DCIC developed several digital resources, including tools, databases, and workflows that aim to facilitate the use of the LINCS data and integrate this data with other publicly available data. The digital resources developed by the DSGCs and the DCIC can be used to gain new biological and pharmacological insights that can lead to the development of novel therapeutics. This protocol provides step-by-step instructions for processing the LINCS data into signatures, and utilizing the digital resources developed by the LINCS consortia for hypothesis generation and knowledge discovery. © 2022 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: Navigating L1000 tools and data in CLUE.io Basic Protocol 2: Computing signatures from the L1000 data with the CD method Basic Protocol 3: Analyzing lists of differentially expressed genes and querying them against the L1000 data with BioJupies and the Bulk RNA-seq Appyter Basic Protocol 4: Utilizing the L1000FWD resource for drug discovery Basic Protocol 5: KINOMEscan and the KINOMEscan Appyter Basic Protocol 6: LINCS P100 and GCP Proteomics Assays Basic Protocol 7: The LINCS Joint Project (LJP) Basic Protocol 8: The LINCS Data Portals and SigCom LINCS Basic Protocol 9: Creating and analyzing signatures with iLINCS.
Asunto(s)
Descubrimiento de Drogas , Proteómica , Bases de Datos Factuales , Descubrimiento de Drogas/métodos , Biblioteca de Genes , Humanos , TranscriptomaRESUMEN
Profiling samples from patients, tissues, and cells with genomics, transcriptomics, epigenomics, proteomics, and metabolomics ultimately produces lists of genes and proteins that need to be further analyzed and integrated in the context of known biology. Enrichr (Chen et al., 2013; Kuleshov et al., 2016) is a gene set search engine that enables the querying of hundreds of thousands of annotated gene sets. Enrichr uniquely integrates knowledge from many high-profile projects to provide synthesized information about mammalian genes and gene sets. The platform provides various methods to compute gene set enrichment, and the results are visualized in several interactive ways. This protocol provides a summary of the key features of Enrichr, which include using Enrichr programmatically and embedding an Enrichr button on any website. © 2021 Wiley Periodicals LLC. Basic Protocol 1: Analyzing lists of differentially expressed genes from transcriptomics, proteomics and phosphoproteomics, GWAS studies, or other experimental studies Basic Protocol 2: Searching Enrichr by a single gene or key search term Basic Protocol 3: Preparing raw or processed RNA-seq data through BioJupies in preparation for Enrichr analysis Basic Protocol 4: Analyzing gene sets for model organisms using modEnrichr Basic Protocol 5: Using Enrichr in Geneshot Basic Protocol 6: Using Enrichr in ARCHS4 Basic Protocol 7: Using the enrichment analysis visualization Appyter to visualize Enrichr results Basic Protocol 8: Using the Enrichr API Basic Protocol 9: Adding an Enrichr button to a website.
Asunto(s)
Descubrimiento del Conocimiento , Programas Informáticos , Animales , Biología Computacional , Genómica , Humanos , RNA-SeqRESUMEN
Although widely prevalent, Lyme disease is still under-diagnosed and misunderstood. Here we followed 73 acute Lyme disease patients and uninfected controls over a period of a year. At each visit, RNA-sequencing was applied to profile patients' peripheral blood mononuclear cells in addition to extensive clinical phenotyping. Based on the projection of the RNA-seq data into lower dimensions, we observe that the cases are separated from controls, and almost all cases never return to cluster with the controls over time. Enrichment analysis of the differentially expressed genes between clusters identifies up-regulation of immune response genes. This observation is also supported by deconvolution analysis to identify the changes in cell type composition due to Lyme disease infection. Importantly, we developed several machine learning classifiers that attempt to perform various Lyme disease classifications. We show that Lyme patients can be distinguished from the controls as well as from COVID-19 patients, but classification was not successful in distinguishing those patients with early Lyme disease cases that would advance to develop post-treatment persistent symptoms.
Asunto(s)
Leucocitos Mononucleares/inmunología , Enfermedad de Lyme/genética , Adulto , COVID-19/genética , COVID-19/inmunología , Citocinas/genética , Citocinas/inmunología , Femenino , Estudios de Seguimiento , Humanos , Leucocitos Mononucleares/química , Enfermedad de Lyme/sangre , Enfermedad de Lyme/inmunología , Aprendizaje Automático , Masculino , Persona de Mediana Edad , Estudios Prospectivos , RNA-SeqRESUMEN
Jupyter Notebooks have transformed the communication of data analysis pipelines by facilitating a modular structure that brings together code, markdown text, and interactive visualizations. Here, we extended Jupyter Notebooks to broaden their accessibility with Appyters. Appyters turn Jupyter Notebooks into fully functional standalone web-based bioinformatics applications. Appyters present to users an entry form enabling them to upload their data and set various parameters for a multitude of data analysis workflows. Once the form is filled, the Appyter executes the corresponding notebook in the cloud, producing the output without requiring the user to interact directly with the code. Appyters were used to create many bioinformatics web-based reusable workflows, including applications to build customized machine learning pipelines, analyze omics data, and produce publishable figures. These Appyters are served in the Appyters Catalog at https://appyters.maayanlab.cloud. In summary, Appyters enable the rapid development of interactive web-based bioinformatics applications.
RESUMEN
The application of proteomic techniques to neuroscientific research provides an opportunity for a greater understanding of nervous system structure and function. As increasing amounts of neuroproteomic data become available, it is necessary to formulate methods to integrate these data in a meaningful way to obtain a more comprehensive picture of neuronal subcompartments. Furthermore, computational methods can be used to make biologically relevant predictions from large proteomic data sets. Here, we applied an integrated proteomics and systems biology approach to characterize the presynaptic (PRE) nerve terminal. For this, we carried out proteomic analyses of presynaptically enriched fractions, and generated a PRE literature-based protein-protein interaction network. We combined these with other proteomic analyses to generate a core list of 117 PRE proteins, and used graph theory-inspired algorithms to predict 92 additional components and a PRE complex containing 17 proteins. Some of these predictions were validated experimentally, indicating that the computational analyses can identify novel proteins and complexes in a subcellular compartment. We conclude that the combination of techniques (proteomics, data integration, and computational analyses) used in this study are useful in obtaining a comprehensive understanding of functional components, especially low-abundance entities and/or interactions in the PRE nerve terminal.
Asunto(s)
Terminales Presinápticos/metabolismo , Proteoma/metabolismo , Proteómica/métodos , Animales , Cromatografía Líquida de Alta Presión , Bases de Datos de Proteínas , Hipocampo/química , Inmunohistoquímica , Masculino , Ratones , Ratones Endogámicos C57BL , Ratas , Ratas Sprague-Dawley , Reproducibilidad de los Resultados , Transducción de Señal , Espectrometría de Masas en TándemRESUMEN
As more digital resources are produced by the research community, it is becoming increasingly important to harmonize and organize them for synergistic utilization. The findable, accessible, interoperable, and reusable (FAIR) guiding principles have prompted many stakeholders to consider strategies for tackling this challenge. The FAIRshake toolkit was developed to enable the establishment of community-driven FAIR metrics and rubrics paired with manual and automated FAIR assessments. FAIR assessments are visualized as an insignia that can be embedded within digital-resources-hosting websites. Using FAIRshake, a variety of biomedical digital resources were manually and automatically evaluated for their level of FAIRness.
Asunto(s)
Difusión de la Información/métodos , Internet/tendencias , Sistemas en Línea/normas , Recursos en Salud/normas , HumanosRESUMEN
The Library of Integrated Network-Based Cellular Signatures (LINCS) is an NIH Common Fund program that catalogs how human cells globally respond to chemical, genetic, and disease perturbations. Resources generated by LINCS include experimental and computational methods, visualization tools, molecular and imaging data, and signatures. By assembling an integrated picture of the range of responses of human cells exposed to many perturbations, the LINCS program aims to better understand human disease and to advance the development of new therapies. Perturbations under study include drugs, genetic perturbations, tissue micro-environments, antibodies, and disease-causing mutations. Responses to perturbations are measured by transcript profiling, mass spectrometry, cell imaging, and biochemical methods, among other assays. The LINCS program focuses on cellular physiology shared among tissues and cell types relevant to an array of diseases, including cancer, heart disease, and neurodegenerative disorders. This Perspective describes LINCS technologies, datasets, tools, and approaches to data accessibility and reusability.
Asunto(s)
Catalogación/métodos , Biología de Sistemas/métodos , Biología Computacional/métodos , Bases de Datos de Compuestos Químicos/normas , Perfilación de la Expresión Génica/métodos , Biblioteca de Genes , Humanos , Almacenamiento y Recuperación de la Información/métodos , Programas Nacionales de Salud , National Institutes of Health (U.S.)/normas , Transcriptoma , Estados UnidosRESUMEN
The global relationship between drugs that are approved for therapeutic use and the human genome is not known. We employed graph-theory methods to analyze the Federal Food and Drug Administration (FDA) approved drugs and their known molecular targets. We used the FDA Approved Drug Products with Therapeutic Equivalence Evaluations 26(th) Edition Electronic Orange Book (EOB) to identify all FDA approved drugs and their active ingredients. We then connected the list of active ingredients extracted from the EOB to those known human protein targets included in the DrugBank database and constructed a bipartite network. We computed network statistics and conducted Gene Ontology analysis on the drug targets and drug categories. We find that drug to drug-target relationship in the bipartite network is scale-free. Several classes of proteins in the human genome appear to be better targets for drugs since they appear to be selectively enriched as drug targets for the currently FDA approved drugs. These initial observations allow for development of an integrated research methodology to identify general principles of the drug discovery process.
Asunto(s)
Biología Computacional , Bases de Datos Factuales , Aprobación de Drogas , United States Food and Drug Administration , Humanos , Biología de Sistemas , Estados UnidosRESUMEN
Gene expression data are accumulating exponentially in public repositories. Reanalysis and integration of themed collections from these studies may provide new insights, but requires further human curation. Here we report a crowdsourcing project to annotate and reanalyse a large number of gene expression profiles from Gene Expression Omnibus (GEO). Through a massive open online course on Coursera, over 70 participants from over 25 countries identify and annotate 2,460 single-gene perturbation signatures, 839 disease versus normal signatures, and 906 drug perturbation signatures. All these signatures are unique and are manually validated for quality. Global analysis of these signatures confirms known associations and identifies novel associations between genes, diseases and drugs. The manually curated signatures are used as a training set to develop classifiers for extracting similar signatures from the entire GEO repository. We develop a web portal to serve these signatures for query, download and visualization.
RESUMEN
Because of the complexity inherent in biological systems, many researchers frequently rely on a combination of global analysis and computational approaches to gain insight into both (i) how interacting components can produce complex system behaviors, and (ii) how changes in conditions may alter these behaviors. Because the biological details of a particular system are generally not taught along with the quantitative approaches that enable hypothesis generation and analysis of the system, we developed a course at Mount Sinai School of Medicine that introduces first-year graduate students to these computational principles and approaches. We anticipate that such approaches will apply throughout the biomedical sciences and that courses such as the one described here will become a core requirement of many graduate programs in the biological and biomedical sciences.
Asunto(s)
Modelos Biológicos , Biología de Sistemas/educación , Biología de Sistemas/métodos , Biología de Sistemas/tendencias , Animales , HumanosRESUMEN
BACKGROUND: Word-clouds recently emerged on the web as a solution for quickly summarizing text by maximizing the display of most relevant terms about a specific topic in the minimum amount of space. As biologists are faced with the daunting amount of new research data commonly presented in textual formats, word-clouds can be used to summarize and represent biological and/or biomedical content for various applications. RESULTS: Genes2WordCloud is a web application that enables users to quickly identify biological themes from gene lists and research relevant text by constructing and displaying word-clouds. It provides users with several different options and ideas for the sources that can be used to generate a word-cloud. Different options for rendering and coloring the word-clouds give users the flexibility to quickly generate customized word-clouds of their choice. METHODS: Genes2WordCloud is a word-cloud generator and a word-cloud viewer that is based on WordCram implemented using Java, Processing, AJAX, mySQL, and PHP. Text is fetched from several sources and then processed to extract the most relevant terms with their computed weights based on word frequencies. Genes2WordCloud is freely available for use online; it is open source software and is available for installation on any web-site along with supporting documentation at http://www.maayanlab.net/G2W. CONCLUSIONS: Genes2WordCloud provides a useful way to summarize and visualize large amounts of textual biological data or to find biological themes from several different sources. The open source availability of the software enables users to implement customized word-clouds on their own web-sites and desktop applications.
RESUMEN
Signaling from G(i/o)-coupled G protein-coupled receptors (GPCRs), such as the serotonin 1B, cannabinoid 1, and dopamine D2 receptors, inhibits cAMP production by adenylyl cyclases and activates protein kinases, such as Src, mitogen-activated protein kinases 1 and 2, and Akt. Activation of these protein kinases results in stimulation of neurite outgrowth in the central nervous system (CNS) and in neuronal cell lines. This Connections Map traces downstream signaling pathways from G(i/o)-coupled GPCRs to key protein kinases and key transcription factors involved in neuronal differentiation. Components in the Science Signaling Connections Map are linked to Nature Molecule Pages. This interoperability provides ready access to detail that includes information about specific states for the nodes.
Asunto(s)
Diferenciación Celular , Subunidades alfa de la Proteína de Unión al GTP Gi-Go/metabolismo , Animales , Línea Celular , GTP Fosfohidrolasas/química , Sistema de Señalización de MAP Quinasas , Ratones , Modelos Biológicos , Modelos Neurológicos , Neuritas/metabolismo , Neuronas/metabolismo , Receptores Acoplados a Proteínas G/metabolismo , Transducción de Señal , Proteínas ras/metabolismoRESUMEN
BACKGROUND: Studies of cellular signaling indicate that signal transduction pathways combine to form large networks of interactions. Viewing protein-protein and ligand-protein interactions as graphs (networks), where biomolecules are represented as nodes and their interactions are represented as links, is a promising approach for integrating experimental results from different sources to achieve a systematic understanding of the molecular mechanisms driving cell phenotype. The emergence of large-scale signaling networks provides an opportunity for topological statistical analysis while visualization of such networks represents a challenge. RESULTS: SNAVI is Windows-based desktop application that implements standard network analysis methods to compute the clustering, connectivity distribution, and detection of network motifs, as well as provides means to visualize networks and network motifs. SNAVI is capable of generating linked web pages from network datasets loaded in text format. SNAVI can also create networks from lists of gene or protein names. CONCLUSION: SNAVI is a useful tool for analyzing, visualizing and sharing cell signaling data. SNAVI is open source free software. The installation may be downloaded from: http://snavi.googlecode.com. The source code can be accessed from: http://snavi.googlecode.com/svn/trunk.