RESUMEN
The techniques used in protein production and structural biology have been developing rapidly, but techniques for recording the laboratory information produced have not kept pace. One approach is the development of laboratory information-management systems (LIMS), which typically use a relational database schema to model and store results from a laboratory workflow. The underlying philosophy and implementation of the Protein Information Management System (PiMS), a LIMS development specifically targeted at the flexible and unpredictable workflows of protein-production research laboratories of all scales, is described. PiMS is a web-based Java application that uses either Postgres or Oracle as the underlying relational database-management system. PiMS is available under a free licence to all academic laboratories either for local installation or for use as a managed service.
Asunto(s)
Sistemas de Información Administrativa , Proteínas/aislamiento & purificación , Bases de Datos de Proteínas , Proteínas/genéticaRESUMEN
Data management has been identified as a crucial issue in all large-scale experimental projects. In this type of project, many different persons manipulate multiple objects in different locations; thus, unless complete and accurate records are maintained, it is extremely difficult to understand exactly what has been done, when it was done, who did it, and what exact protocol was used. All of this information is essential for use in publications, reusing successful protocols, determining why a target has failed, and validating and optimizing protocols. Although data management solutions have been in place for certain focused activities (e.g., genome sequencing and microarray experiments), they are just emerging for more widespread projects, such as structural genomics, metabolomics, and systems biology as a whole. The complexity of experimental procedures, and the diversity and high rate of development of protocols used in a single center, or across various centers, have important consequences for the design of information management systems. Because procedures are carried out by both machines and hand, the system must be capable of handling data entry both from robotic systems and by means of a user-friendly interface. The information management system needs to be flexible so it can handle changes in existing protocols or newly added protocols. Because no commercial information management systems have had the needed features, most structural genomics groups have developed their own solutions. This chapter discusses the advantages of using a LIMS (laboratory information management system), for day-to-day management of structural genomics projects, and also for data mining. This chapter reviews different solutions currently in place or under development with emphasis on three systems developed by the authors: Xtrack, Sesame (developed at the Center for Eukaryotic Structural Genomics under the US Protein Structural Genomics Initiative), and HalX (developed at the Yeast Structural Genomics Laboratory, in collaboration with the European SPINE project).
Asunto(s)
Biología Computacional/métodos , Sistemas de Administración de Bases de Datos , Genómica/métodos , Proteínas/química , Programas Informáticos , Bases de Datos FactualesRESUMEN
Structural genomics aims at the establishment of a universal protein-fold dictionary through systematic structure determination either by NMR or X-ray crystallography. In order to catch up with the explosive amount of protein sequence data, the structural biology laboratories are spurred to increase the speed of the structure-determination process. To achieve this goal, high-throughput robotic approaches are increasingly used in all the steps leading from cloning to data collection and even structure interpretation is becoming more and more automatic. The progress made in these areas has begun to have a significant impact on the more 'classical' structural biology laboratories, dramatically increasing the number of individual experiments. This automation creates the need for efficient data management. Here, a new piece of software, HalX, designed as an 'electronic lab book' that aims at (i) storage and (ii) easy access and use of all experimental data is presented. This should lead to much improved management and tracking of structural genomics experimental data.
Asunto(s)
Bases de Datos de Proteínas , Programas Informáticos , Cristalografía por Rayos X/métodosRESUMEN
To address data management and data exchange problems in the nuclear magnetic resonance (NMR) community, the Collaborative Computing Project for the NMR community (CCPN) created a "Data Model" that describes all the different types of information needed in an NMR structural study, from molecular structure and NMR parameters to coordinates. This paper describes the development of a set of software applications that use the Data Model and its associated libraries, thus validating the approach. These applications are freely available and provide a pipeline for high-throughput analysis of NMR data. Three programs work directly with the Data Model: CcpNmr Analysis, an entirely new analysis and interactive display program, the CcpNmr FormatConverter, which allows transfer of data from programs commonly used in NMR to and from the Data Model, and the CLOUDS software for automated structure calculation and assignment (Carnegie Mellon University), which was rewritten to interact directly with the Data Model. The ARIA 2.0 software for structure calculation (Institut Pasteur) and the QUEEN program for validation of restraints (University of Nijmegen) were extended to provide conversion of their data to the Data Model. During these developments the Data Model has been thoroughly tested and used, demonstrating that applications can successfully exchange data via the Data Model. The software architecture developed by CCPN is now ready for new developments, such as integration with additional software applications and extensions of the Data Model into other areas of research.
Asunto(s)
Bases de Datos de Proteínas , Espectroscopía de Resonancia Magnética/métodos , Programas Informáticos , Gráficos por Computador , Espectroscopía de Resonancia Magnética/instrumentación , Modelos TeóricosRESUMEN
Data management has emerged as one of the central issues in the high-throughput processes of taking a protein target sequence through to a protein sample. To simplify this task, and following extensive consultation with the international structural genomics community, we describe here a model of the data related to protein production. The model is suitable for both large and small facilities for use in tracking samples, experiments, and results through the many procedures involved. The model is described in Unified Modeling Language (UML). In addition, we present relational database schemas derived from the UML. These relational schemas are already in use in a number of data management projects.
Asunto(s)
Genómica/métodos , Ingeniería de Proteínas/métodos , Proteínas/química , Proteómica/métodos , Algoritmos , Secuencia de Aminoácidos , Interpretación Estadística de Datos , Bases de Datos de Proteínas , Internet , Modelos Biológicos , Lenguajes de Programación , Investigación , Programas Informáticos , Diseño de Software , Biología de Sistemas , Unified Medical Language SystemRESUMEN
MOTIVATION: The lack of standards for storage and exchange of data is a serious hindrance for the large-scale data deposition, data mining and program interoperability that is becoming increasingly important in bioinformatics. The problem lies not only in defining and maintaining the standards, but also in convincing scientists and application programmers with a wide variety of backgrounds and interests to adhere to them. RESULTS: We present a UML-based programming framework for the modeling of data and the automated production of software to manipulate that data. Our approach allows one to make an abstract description of the structure of the data used in a particular scientific field and then use it to generate fully functional computer code for data access and input/output routines for data storage, together with accompanying documentation. This code can be generated simultaneously for different programming languages from a single model, together with, for example for format descriptions and I/O libraries XML and various relational databases. The framework is entirely general and could be applied in any subject area. We have used this approach to generate a data exchange standard for structural biology and analysis software for macromolecular NMR spectroscopy. AVAILABILITY: The framework is available under the GPL license, the data exchange standard with generated subroutine libraries under the LGPL license. Both may be found at http://www.ccpn.ac.uk; http://sourceforge.net/projects/ccpn CONTACT: ccpn@mole.bio.cam.ac.uk.
Asunto(s)
Biopolímeros/química , Sistemas de Administración de Bases de Datos , Documentación/métodos , Almacenamiento y Recuperación de la Información/métodos , Modelos Biológicos , Modelos Químicos , Programas Informáticos , Unified Medical Language System , Biopolímeros/análisis , Biopolímeros/clasificación , Biopolímeros/metabolismo , Simulación por Computador , Documentación/normas , Guías como Asunto , Espectroscopía de Resonancia Magnética/métodos , Espectroscopía de Resonancia Magnética/normas , Estándares de Referencia , Ciencia/métodosRESUMEN
In recent years a large body of data has been obtained from Nuclear Magnetic Resonance and Circular Dichroism experiments on the influence of the amino acid sequence and various other parameters on the conformational state of peptides in solution. Interpreting the experimental data in terms of the conformational populations of the peptides remains a key problem, for which current solutions leave appreciable room for improvement. Considering that making this body of data available for surveys and analysis should be instrumental in tackling the problem, we undertook the development of Pescador: The 'PEptides in Solution ConformAtion Database: Online Resource'. Pescador contains data from NMR and CD spectroscopy on peptides in solution as well as information on the structural parameters derived from these data. It also features specialized Web-based tools for data deposition, and means for readily accessing the stored information for analysis purposes. To illustrate the use of the database in deriving information for the conformational analysis of peptides, we show how the alpha proton delta-values stored in Pescador and measured by NMR for different peptides in different laboratories can be used to derive a new set of 'random coil' chemical shift values. Firstly, we show these values to be very similar to those obtained experimentally for model peptides in water, and their variation with increasing Tri-Fluoro-Ethanol (TFE) concentration is similar to that reported for model peptides. We show, furthermore, that the chemical shift data in Pescador can be used to derive correction factors that take into account effects of neighboring residues. These correction factors compare favorably with those recently derived from a series of model GGXGG peptides (Schwarzinger et al., 2001). These encouraging results suggest that, as the quantity of NMR data on peptide deposited in Pescador increases, surveys of these data should be a valuable means of deriving key parameters for the analysis of peptide conformation.