|

1.

Fast and accurate genome-wide predictions and structural modeling of protein-protein interactions using Galaxy.

Guerler, Aysam; Baker, Dannon; van den Beek, Marius; Gruening, Bjoern; Bouvier, Dave; Coraor, Nate; Shank, Stephen D; Zehr, Jordan D; Schatz, Michael C; Nekrutenko, Anton.

BMC Bioinformatics ; 24(1): 263, 2023 Jun 23.

Article En | MEDLINE | ID: mdl-37353753

BACKGROUND: Protein-protein interactions play a crucial role in almost all cellular processes. Identifying interacting proteins reveals insight into living organisms and yields novel drug targets for disease treatment. Here, we present a publicly available, automated pipeline to predict genome-wide protein-protein interactions and produce high-quality multimeric structural models. RESULTS: Application of our method to the Human and Yeast genomes yield protein-protein interaction networks similar in quality to common experimental methods. We identified and modeled Human proteins likely to interact with the papain-like protease of SARS-CoV2's non-structural protein 3. We also produced models of SARS-CoV2's spike protein (S) interacting with myelin-oligodendrocyte glycoprotein receptor and dipeptidyl peptidase-4. CONCLUSIONS: The presented method is capable of confidently identifying interactions while providing high-quality multimeric structural models for experimental validation. The interactome modeling pipeline is available at usegalaxy.org and usegalaxy.eu.

COVID-19 , Protein Interaction Mapping , Humans , RNA, Viral/metabolism , SARS-CoV-2 , Saccharomyces cerevisiae/metabolism

2.

Expanding the Galaxy's reference data.

VijayKrishna, Nagampalli; Joshi, Jayadev; Coraor, Nate; Hillman-Jackson, Jennifer; Bouvier, Dave; van den Beek, Marius; Eguinoa, Ignacio; Coppens, Frederik; Davis, John; Stolarczyk, Michal; Sheffield, Nathan C; Gladman, Simon; Cuccuru, Gianmauro; Grüning, Björn; Soranzo, Nicola; Rasche, Helena; Langhorst, Bradley W; Bernt, Matthias; Fornika, Dan; de Lima Morais, David Anderson; Barrette, Michel; van Heusden, Peter; Petrillo, Mauro; Puertas-Gallardo, Antonio; Patak, Alex; Hotz, Hans-Rudolf; Blankenberg, Daniel.

Bioinform Adv ; 2(1): vbac030, 2022.

Article En | MEDLINE | ID: mdl-35669346

Summary: Properly and effectively managing reference datasets is an important task for many bioinformatics analyses. Refgenie is a reference asset management system that allows users to easily organize, retrieve and share such datasets. Here, we describe the integration of refgenie into the Galaxy platform. Server administrators are able to configure Galaxy to make use of reference datasets made available on a refgenie instance. In addition, a Galaxy Data Manager tool has been developed to provide a graphical interface to refgenie's remote reference retrieval functionality. A large collection of reference datasets has also been made available using the CVMFS (CernVM File System) repository from GalaxyProject.org, with mirrors across the USA, Canada, Europe and Australia, enabling easy use outside of Galaxy. Availability and implementation: The ability of Galaxy to use refgenie assets was added to the core Galaxy framework in version 22.01, which is available from https://github.com/galaxyproject/galaxy under the Academic Free License version 3.0. The refgenie Data Manager tool can be installed via the Galaxy ToolShed, with source code managed at https://github.com/BlankenbergLab/galaxy-tools-blankenberg/tree/main/data_managers/data_manager_refgenie_pull and released using an MIT license. Access to existing data is also available through CVMFS, with instructions at https://galaxyproject.org/admin/reference-data-repo/. No new data were generated or analyzed in support of this research.

3.

Training Infrastructure as a Service.

Rasche, Helena; Hyde, Cameron; Davis, John; Gladman, Simon; Coraor, Nate; Bretaudeau, Anthony; Cuccuru, Gianmauro; Bacon, Wendi; Serrano-Solano, Beatriz; Hillman-Jackson, Jennifer; Hiltemann, Saskia; Zhou, Miaomiao; Grüning, Björn; Stubbs, Andrew.

Gigascience ; 122022 12 28.

Article En | MEDLINE | ID: mdl-37395629

BACKGROUND: Hands-on training, whether in bioinformatics or other domains, often requires significant technical resources and knowledge to set up and run. Instructors must have access to powerful compute infrastructure that can support resource-intensive jobs running efficiently. Often this is achieved using a private server where there is no contention for the queue. However, this places a significant prerequisite knowledge or labor barrier for instructors, who must spend time coordinating deployment and management of compute resources. Furthermore, with the increase of virtual and hybrid teaching, where learners are located in separate physical locations, it is difficult to track student progress as efficiently as during in-person courses. FINDINGS: Originally developed by Galaxy Europe and the Gallantries project, together with the Galaxy community, we have created Training Infrastructure-as-a-Service (TIaaS), aimed at providing user-friendly training infrastructure to the global training community. TIaaS provides dedicated training resources for Galaxy-based courses and events. Event organizers register their course, after which trainees are transparently placed in a private queue on the compute infrastructure, which ensures jobs complete quickly, even when the main queue is experiencing high wait times. A built-in dashboard allows instructors to monitor student progress. CONCLUSIONS: TIaaS provides a significant improvement for instructors and learners, as well as infrastructure administrators. The instructor dashboard makes remote events not only possible but also easy. Students experience continuity of learning, as all training happens on Galaxy, which they can continue to use after the event. In the past 60 months, 504 training events with over 24,000 learners have used this infrastructure for Galaxy training.

Learning , Software , Humans , Europe , Computational Biology

4.

No more business as usual: Agile and effective responses to emerging pathogen threats require open data and open analytics.

Baker, Dannon; van den Beek, Marius; Blankenberg, Daniel; Bouvier, Dave; Chilton, John; Coraor, Nate; Coppens, Frederik; Eguinoa, Ignacio; Gladman, Simon; Grüning, Björn; Keener, Nicholas; Larivière, Delphine; Lonie, Andrew; Kosakovsky Pond, Sergei; Maier, Wolfgang; Nekrutenko, Anton; Taylor, James; Weaver, Steven.

PLoS Pathog ; 16(8): e1008643, 2020 08.

Article En | MEDLINE | ID: mdl-32790776

The current state of much of the Wuhan pneumonia virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]) research shows a regrettable lack of data sharing and considerable analytical obfuscation. This impedes global research cooperation, which is essential for tackling public health emergencies and requires unimpeded access to data, analysis tools, and computational infrastructure. Here, we show that community efforts in developing open analytical software tools over the past 10 years, combined with national investments into scientific computational infrastructure, can overcome these deficiencies and provide an accessible platform for tackling global health emergencies in an open and transparent manner. Specifically, we use all SARS-CoV-2 genomic data available in the public domain so far to (1) underscore the importance of access to raw data and (2) demonstrate that existing community efforts in curation and deployment of biomedical software can reliably support rapid, reproducible research during global health crises. All our analyses are fully documented at https://github.com/galaxyproject/SARS-CoV-2.

Betacoronavirus/pathogenicity , Coronavirus Infections/virology , Pneumonia, Viral/virology , Public Health , Severe Acute Respiratory Syndrome/virology , COVID-19 , Data Analysis , Humans , Pandemics , SARS-CoV-2

5.

Galaxy External Display Applications: closing a dataflow interoperability loop.

Blankenberg, Daniel; Chilton, John; Coraor, Nate.

Nat Methods ; 17(2): 123-124, 2020 02.

Article En | MEDLINE | ID: mdl-31959996

Datasets as Topic , Mobile Applications , User-Computer Interface , Programming Languages , Research Personnel

6.

Predicting runtimes of bioinformatics tools based on historical data: five years of Galaxy usage.

Tyryshkina, Anastasia; Coraor, Nate; Nekrutenko, Anton.

Bioinformatics ; 35(18): 3453-3460, 2019 09 15.

Article En | MEDLINE | ID: mdl-30698642

MOTIVATION: One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation. RESULTS: Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation. AVAILABILITY AND IMPLEMENTATION: Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Computational Biology , Software , Machine Learning

7.

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update.

Afgan, Enis; Baker, Dannon; Batut, Bérénice; van den Beek, Marius; Bouvier, Dave; Cech, Martin; Chilton, John; Clements, Dave; Coraor, Nate; Grüning, Björn A; Guerler, Aysam; Hillman-Jackson, Jennifer; Hiltemann, Saskia; Jalili, Vahid; Rasche, Helena; Soranzo, Nicola; Goecks, Jeremy; Taylor, James; Nekrutenko, Anton; Blankenberg, Daniel.

Nucleic Acids Res ; 46(W1): W537-W544, 2018 07 02.

Article En | MEDLINE | ID: mdl-29790989

Galaxy (homepage: https://galaxyproject.org, main public server: https://usegalaxy.org) is a web-based scientific analysis platform used by tens of thousands of scientists across the world to analyze large biomedical datasets such as those found in genomics, proteomics, metabolomics and imaging. Started in 2005, Galaxy continues to focus on three key challenges of data-driven biomedical science: making analyses accessible to all researchers, ensuring analyses are completely reproducible, and making it simple to communicate analyses so that they can be reused and extended. During the last two years, the Galaxy team and the open-source community around Galaxy have made substantial improvements to Galaxy's core framework, user interface, tools, and training materials. Framework and user interface improvements now enable Galaxy to be used for analyzing tens of thousands of datasets, and >5500 tools are now available from the Galaxy ToolShed. The Galaxy community has led an effort to create numerous high-quality tutorials focused on common types of genomic analyses. The Galaxy developer and user communities continue to grow and be integral to Galaxy's development. The number of Galaxy public servers, developers contributing to the Galaxy framework and its tools, and users of the main Galaxy server have all increased substantially.

Genomics/statistics & numerical data , Metabolomics/statistics & numerical data , Molecular Imaging/statistics & numerical data , Proteomics/statistics & numerical data , User-Computer Interface , Datasets as Topic , Humans , Information Dissemination , International Cooperation , Internet , Reproducibility of Results

8.

Jupyter and Galaxy: Easing entry barriers into complex data analyses for biomedical researchers.

Grüning, Björn A; Rasche, Eric; Rebolledo-Jaramillo, Boris; Eberhard, Carl; Houwaart, Torsten; Chilton, John; Coraor, Nate; Backofen, Rolf; Taylor, James; Nekrutenko, Anton.

PLoS Comput Biol ; 13(5): e1005425, 2017 05.

Article En | MEDLINE | ID: mdl-28542180

What does it take to convert a heap of sequencing data into a publishable result? First, common tools are employed to reduce primary data (sequencing reads) to a form suitable for further analyses (i.e., the list of variable sites). The subsequent exploratory stage is much more ad hoc and requires the development of custom scripts and pipelines, making it problematic for biomedical researchers. Here, we describe a hybrid platform combining common analysis pathways with the ability to explore data interactively. It aims to fully encompass and simplify the "raw data-to-publication" pathway and make it reproducible.

Biomedical Research/methods , Biomedical Research/organization & administration , Computational Biology , High-Throughput Nucleotide Sequencing , Research Personnel , Software , Humans

9.

The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2016 update.

Afgan, Enis; Baker, Dannon; van den Beek, Marius; Blankenberg, Daniel; Bouvier, Dave; Cech, Martin; Chilton, John; Clements, Dave; Coraor, Nate; Eberhard, Carl; Grüning, Björn; Guerler, Aysam; Hillman-Jackson, Jennifer; Von Kuster, Greg; Rasche, Eric; Soranzo, Nicola; Turaga, Nitesh; Taylor, James; Nekrutenko, Anton; Goecks, Jeremy.

Nucleic Acids Res ; 44(W1): W3-W10, 2016 07 08.

Article En | MEDLINE | ID: mdl-27137889

High-throughput data production technologies, particularly 'next-generation' DNA sequencing, have ushered in widespread and disruptive changes to biomedical research. Making sense of the large datasets produced by these technologies requires sophisticated statistical and computational methods, as well as substantial computational power. This has led to an acute crisis in life sciences, as researchers without informatics training attempt to perform computation-dependent analyses. Since 2005, the Galaxy project has worked to address this problem by providing a framework that makes advanced computational tools usable by non experts. Galaxy seeks to make data-intensive research more accessible, transparent and reproducible by providing a Web-based environment in which users can perform computational analyses and have all of the details automatically tracked for later inspection, publication, or reuse. In this report we highlight recently added features enabling biomedical analyses on a large scale.

Computational Biology/statistics & numerical data , Datasets as Topic/statistics & numerical data , User-Computer Interface , Biomedical Research , Computational Biology/methods , Databases, Genetic , Humans , Internet , Reproducibility of Results

10.

NGS analyses by visualization with Trackster.

Goecks, Jeremy; Coraor, Nate; Nekrutenko, Anton; Taylor, James.

Nat Biotechnol ; 30(11): 1036-9, 2012 Nov.

Article En | MEDLINE | ID: mdl-23138293

Chromosome Mapping/methods , Computer Graphics , Data Mining/methods , Databases, Genetic , Sequence Analysis, DNA/methods , Software , User-Computer Interface

11.

Harnessing cloud computing with Galaxy Cloud.

Afgan, Enis; Baker, Dannon; Coraor, Nate; Goto, Hiroki; Paul, Ian M; Makova, Kateryna D; Nekrutenko, Anton; Taylor, James.

Nat Biotechnol ; 29(11): 972-4, 2011 Nov 08.

Article En | MEDLINE | ID: mdl-22068528

Computer Storage Devices/trends , Internet/standards , Sequence Analysis, DNA/methods , Software/trends , Computer Storage Devices/economics , DNA, Mitochondrial/genetics , Humans

12.

Galaxy CloudMan: delivering cloud compute clusters.

Afgan, Enis; Baker, Dannon; Coraor, Nate; Chapman, Brad; Nekrutenko, Anton; Taylor, James.

BMC Bioinformatics ; 11 Suppl 12: S4, 2010 Dec 21.

Article En | MEDLINE | ID: mdl-21210983

BACKGROUND: Widespread adoption of high-throughput sequencing has greatly increased the scale and sophistication of computational infrastructure needed to perform genomic research. An alternative to building and maintaining local infrastructure is "cloud computing", which, in principle, offers on demand access to flexible computational infrastructure. However, cloud computing resources are not yet suitable for immediate "as is" use by experimental biologists. RESULTS: We present a cloud resource management system that makes it possible for individual researchers to compose and control an arbitrarily sized compute cluster on Amazon's EC2 cloud infrastructure without any informatics requirements. Within this system, an entire suite of biological tools packaged by the NERC Bio-Linux team (http://nebc.nerc.ac.uk/tools/bio-linux) is available for immediate consumption. The provided solution makes it possible, using only a web browser, to create a completely configured compute cluster ready to perform analysis in less than five minutes. Moreover, we provide an automated method for building custom deployments of cloud resources. This approach promotes reproducibility of results and, if desired, allows individuals and labs to add or customize an otherwise available cloud system to better meet their needs. CONCLUSIONS: The expected knowledge and associated effort with deploying a compute cluster in the Amazon EC2 cloud is not trivial. The solution presented in this paper eliminates these barriers, making it possible for researchers to deploy exactly the amount of computing power they need, combined with a wealth of existing analysis software, to handle the ongoing data deluge.

Computational Biology/methods , Software , Cluster Analysis , Internet