RESUMO
MOTIVATION: Visualization is a powerful tool to analyze, understand and present big data. Computational biology, bioinformatics and molecular modeling require dedicated tools, tailored to very complex, highly multidimensional data. Over the recent years, numerous tools have been developed for online presentation, but new challenges like the COVID-19 pandemic require new libraries which will guarantee fast development of online tools for a better understanding of biomedical data/results. RESULTS: VisuaLife is a Python library that provides a new approach to visualization in a web browser. It offers 2D and 3D plotting capabilities as well as widgets designed to display the most common biological data types: nucleotide or protein sequences, 3D biomolecular structures and multiple sequence alignments. Components provided by the VisuaLife library can be assembled into a web application to create an analysis tool tailored to provide multidimensional analysis of a specific research problem. VisuaLife, to our best knowledge, is the most modern solution that allows one to implement such a client-side interactivity in Python. AVAILABILITY AND IMPLEMENTATION: The git repository of the library is hosted at BitBucket: https://bitbucket.org/dgront/visualife/. PyPI distribution is also provided for MacOS and Linux. While basic examples are provided in the supporting materials, the full documentation is available at ReadTheDocs website: https://visualife.readthedocs.io/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
RESUMO
Many scientific disciplines rely on computational methods for data analysis, model generation, and prediction. Implementing these methods is often accomplished by researchers with domain expertise but without formal training in software engineering or computer science. This arrangement has led to underappreciation of sustainability and maintainability of scientific software tools developed in academic environments. Some software tools have avoided this fate, including the scientific library Rosetta. We use this software and its community as a case study to show how modern software development can be accomplished successfully, irrespective of subject area. Rosetta is one of the largest software suites for macromolecular modeling, with 3.1 million lines of code and many state-of-the-art applications. Since the mid 1990s, the software has been developed collaboratively by the RosettaCommons, a community of academics from over 60 institutions worldwide with diverse backgrounds including chemistry, biology, physiology, physics, engineering, mathematics, and computer science. Developing this software suite has provided us with more than two decades of experience in how to effectively develop advanced scientific software in a global community with hundreds of contributors. Here we illustrate the functioning of this development community by addressing technical aspects (like version control, testing, and maintenance), community-building strategies, diversity efforts, software dissemination, and user support. We demonstrate how modern computational research can thrive in a distributed collaborative community. The practices described here are independent of subject area and can be readily adopted by other software development communities.
Assuntos
Biologia Computacional/métodos , Pesquisa/tendências , Software/tendências , Comportamento Cooperativo , Análise de Dados , Engenharia , Biblioteca Gênica , Humanos , Modelos Moleculares , Pesquisadores , Comportamento Social , Interface Usuário-ComputadorRESUMO
Cytochrome P450 monooxygenase CYP51 (sterol 14α-demethylase) is a well-known target of the azole drug fluconazole for treating cryptococcosis, a life-threatening fungal infection in immune-compromised patients in poor countries. Studies indicate that mutations in CYP51 confer fluconazole resistance on cryptococcal species. Despite the importance of CYP51 in these species, few studies on the structural analysis of CYP51 and its interactions with different azole drugs have been reported. We therefore performed in silico structural analysis of 11 CYP51s from cryptococcal species and other Tremellomycetes. Interactions of 11 CYP51s with nine ligands (three substrates and six azoles) performed by Rosetta docking using 10,000 combinations for each of the CYP51-ligand complex (11 CYP51s × 9 ligands = 99 complexes) and hierarchical agglomerative clustering were used for selecting the complexes. A web application for visualization of CYP51s' interactions with ligands was developed (http://bioshell.pl/azoledocking/). The study results indicated that Tremellomycetes CYP51s have a high preference for itraconazole, corroborating the in vitro effectiveness of itraconazole compared to fluconazole. Amino acids interacting with different ligands were found to be conserved across CYP51s, indicating that the procedure employed in this study is accurate and can be automated for studying P450-ligand interactions to cater for the growing number of P450s.
Assuntos
Aminoácidos/metabolismo , Azóis/metabolismo , Basidiomycota/enzimologia , Sistema Enzimático do Citocromo P-450/metabolismo , Fluconazol/metabolismo , Proteínas Fúngicas/metabolismo , Itraconazol/metabolismo , Aminoácidos/química , Antifúngicos/química , Antifúngicos/metabolismo , Azóis/química , Simulação por Computador , Sistema Enzimático do Citocromo P-450/química , Fluconazol/química , Proteínas Fúngicas/química , Itraconazol/química , Ligantes , Modelos Moleculares , Filogenia , Ligação Proteica , Conformação Proteica , Especificidade por SubstratoRESUMO
Cytochrome P450 monooxygenases (CYPs/P450s), heme-thiolate proteins, are well-known players in the generation of chemicals valuable to humans and as a drug target against pathogens. Understanding the evolution of P450s in a bacterial population is gaining momentum. In this study, we report comprehensive analysis of P450s in the ancient group of the bacterial class Alphaproteobacteria. Genome data mining and annotation of P450s in 599 alphaproteobacterial species belonging to 164 genera revealed the presence of P450s in only 241 species belonging to 82 genera that are grouped into 143 P450 families and 214 P450 subfamilies, including 77 new P450 families. Alphaproteobacterial species have the highest average number of P450s compared to Firmicutes species and cyanobacterial species. The lowest percentage of alphaproteobacterial species P450s (2.4%) was found to be part of secondary metabolite biosynthetic gene clusters (BGCs), compared other bacterial species, indicating that during evolution large numbers of P450s became part of BGCs in other bacterial species. Our study identified that some of the P450 families found in alphaproteobacterial species were passed to other bacterial species. This is the first study to report on the identification of CYP125 P450, cholesterol and cholest-4-en-3-one hydroxylase in alphaproteobacterial species (Phenylobacterium zucineum) and to predict cholesterol side-chain oxidation capability (based on homolog proteins) by P. zucineum.
Assuntos
Alphaproteobacteria/genética , Vias Biossintéticas/genética , Sistema Enzimático do Citocromo P-450/genética , Família Multigênica , Metabolismo Secundário/genética , Colesterol/metabolismo , Cianobactérias/genética , Sistema Enzimático do Citocromo P-450/metabolismo , Mineração de Dados , Evolução Molecular , Firmicutes/genética , Genoma Bacteriano , Mycobacterium tuberculosis/genética , Filogenia , Streptomyces/genéticaRESUMO
The impact of lifestyle on shaping the genome content of an organism is a well-known phenomenon and cytochrome P450 enzymes (CYPs/P450s), heme-thiolate proteins that are ubiquitously present in organisms, are no exception. Recent studies focusing on a few bacterial species such as Streptomyces, Mycobacterium, Cyanobacteria and Firmicutes revealed that the impact of lifestyle affected the P450 repertoire in these species. However, this phenomenon needs to be understood in other bacterial species. We therefore performed genome data mining, annotation, phylogenetic analysis of P450s and their role in secondary metabolism in the bacterial class Gammaproteobacteria. Genome-wide data mining for P450s in 1261 Gammaproteobacterial species belonging to 161 genera revealed that only 169 species belonging to 41 genera have P450s. A total of 277 P450s found in 169 species grouped into 84 P450 families and 105 P450 subfamilies, where 38 new P450 families were found. Only 18% of P450s were found to be involved in secondary metabolism in Gammaproteobacterial species, as observed in Firmicutes as well. The pathogenic or commensal lifestyle of Gammaproteobacterial species influences them to such an extent that they have the lowest number of P450s compared to other bacterial species, indicating the impact of lifestyle on shaping the P450 repertoire. This study is the first report on comprehensive analysis of P450s in Gammaproteobacteria.
Assuntos
Sistema Enzimático do Citocromo P-450/metabolismo , Gammaproteobacteria/genética , Gammaproteobacteria/metabolismo , Simulação por Computador , Cianobactérias , Sistema Enzimático do Citocromo P-450/genética , Sistema Enzimático do Citocromo P-450/fisiologia , Evolução Molecular , Firmicutes , Genômica/métodos , Família Multigênica , Mycobacterium , Filogenia , Metabolismo Secundário/fisiologia , StreptomycesRESUMO
Understanding protein structure and dynamics is crucial for investigating numerous biological processes. This however requires proper description of molecular interactions, most notably hydrogen bonds, which are the driving force behind the folding of protein sequences into working molecules. Due to the multi-body character of this interaction, proper mathematical formulation has been a matter of long debate in the literature. This description becomes even more complex in reduced protein models. In this contribution, we propose a novel hydrogen bond energy function definition that is based only on Cα positions and used for coarse-grained simulations. We show that this new method has the capability to recognize hydrogen bonds with over 80% accuracy and can successfully identify ß-sheet in ß-amyloid peptide simulations.
Assuntos
Peptídeos beta-Amiloides , Simulação de Dinâmica Molecular , Ligação de Hidrogênio , Peptídeos beta-Amiloides/químicaRESUMO
The assignment of secondary structure elements in protein conformations is necessary to interpret a protein model that has been established by computational methods. The process essentially involves labeling the amino acid residues with H (Helix), E (Strand), or C (Coil, also known as Loop). When particular atoms are absent from an input protein structure, the procedure becomes more complicated, especially when only the alpha carbon locations are known. Various techniques have been tested and applied to this problem during the last forty years. The application of machine learning techniques is the most recent trend. This contribution presents the HECA classifier, which uses neural networks to assign protein secondary structure types. The technique exclusively employs Cα coordinates. The Keras (TensorFlow) library was used to implement and train the neural network model. The BioShell toolkit was used to calculate the neural network input features from raw coordinates. The study's findings show that neural network-based methods may be successfully used to take on structure assignment challenges when only Cα trace is available. Thanks to the careful selection of input features, our approach's accuracy (above 97%) exceeded that of the existing methods.
Assuntos
Redes Neurais de Computação , Proteínas , Aprendizado de Máquina , Conformação Proteica , Estrutura Secundária de Proteína , Proteínas/químicaRESUMO
Each year vast international resources are wasted on irreproducible research. The scientific community has been slow to adopt standard software engineering practices, despite the increases in high-dimensional data, complexities of workflows, and computational environments. Here we show how scientific software applications can be created in a reproducible manner when simple design goals for reproducibility are met. We describe the implementation of a test server framework and 40 scientific benchmarks, covering numerous applications in Rosetta bio-macromolecular modeling. High performance computing cluster integration allows these benchmarks to run continuously and automatically. Detailed protocol captures are useful for developers and users of Rosetta and other macromolecular modeling tools. The framework and design concepts presented here are valuable for developers and users of any type of scientific software and for the scientific community to create reproducible methods. Specific examples highlight the utility of this framework, and the comprehensive documentation illustrates the ease of adding new tests in a matter of hours.
Assuntos
Substâncias Macromoleculares/química , Simulação de Acoplamento Molecular , Proteínas/química , Software/normas , Benchmarking , Sítios de Ligação , Humanos , Ligantes , Substâncias Macromoleculares/metabolismo , Ligação Proteica , Proteínas/metabolismo , Reprodutibilidade dos TestesRESUMO
BioShell is an open-source package for processing biological data, particularly focused on structural applications. The package provides parsers, data structures and algorithms for handling and analyzing macromolecular sequences, structures and sequence profiles. The most frequently used routines are accessible by a set of easy-to-use command line utilities for a Linux environment. The full functionality of the package assumes knowledge of C++ or Python to assemble an application using this software library. Since the last publication that announced the version 2.0, the package has been greatly expanded and rewritten in C++ standard 11 (C++11) to improve its modularity and efficiency. A new testing platform has been implemented to continuously test the correctness and integrity of the package. More than two hundred test programs have been published to provide simple examples that can be used as templates. This makes BioShell an easy to use library that greatly speeds up development of bioinformatics applications and web services without compromising computational efficiency.