RESUMO
MOTIVATION: Research in systems biology is carried out through a combination of experiments and models. Several data standards have been adopted for representing models (Systems Biology Markup Language) and various types of relevant experimental data (such as FuGE and those of the Proteomics Standards Initiative). However, until now, there has been no standard way to associate a model and its entities to the corresponding datasets, or vice versa. Such a standard would provide a means to represent computational simulation results as well as to frame experimental data in the context of a particular model. Target applications include model-driven data analysis, parameter estimation, and sharing and archiving model simulations. RESULTS: We propose the Systems Biology Results Markup Language (SBRML), an XML-based language that associates a model with several datasets. Each dataset is represented as a series of values associated with model variables, and their corresponding parameter values. SBRML provides a flexible way of indexing the results to model parameter values, which supports both spreadsheet-like data and multidimensional data cubes. We present and discuss several examples of SBRML usage in applications such as enzyme kinetics, microarray gene expression and various types of simulation results. AVAILABILITY AND IMPLEMENTATION: The XML Schema file for SBRML is available at http://www.comp-sys-bio.org/SBRML under the Academic Free License (AFL) v3.0.
Assuntos
Software , Biologia de Sistemas/métodos , Biologia Computacional/métodos , Bases de Dados Factuais , Análise de Sequência com Séries de OligonucleotídeosRESUMO
Sensing devices are increasingly being deployed to monitor the physical world around us. One class of application for which sensor data is pertinent is environmental decision support systems, e.g., flood emergency response. For these applications, the sensor readings need to be put in context by integrating them with other sources of data about the surrounding environment. Traditional systems for predicting and detecting floods rely on methods that need significant human resources. In this paper we describe a semantic sensor web architecture for integrating multiple heterogeneous datasets, including live and historic sensor data, databases, and map layers. The architecture provides mechanisms for discovering datasets, defining integrated views over them, continuously receiving data in real-time, and visualising on screen and interacting with the data. Our approach makes extensive use of web service standards for querying and accessing data, and semantic technologies to discover and integrate datasets. We demonstrate the use of our semantic sensor web architecture in the context of a flood response planning web application that uses data from sensor networks monitoring the sea-state around the coast of England.
Assuntos
Técnicas de Apoio para a Decisão , Monitoramento AmbientalRESUMO
The Human Proteome Organisation's Proteomics Standards Initiative has developed the GelML (gel electrophoresis markup language) data exchange format for representing gel electrophoresis experiments performed in proteomics investigations. The format closely follows the reporting guidelines for gel electrophoresis, which are part of the Minimum Information About a Proteomics Experiment (MIAPE) set of modules. GelML supports the capture of metadata (such as experimental protocols) and data (such as gel images) resulting from gel electrophoresis so that laboratories can be compliant with the MIAPE Gel Electrophoresis guidelines, while allowing such data sets to be exchanged or downloaded from public repositories. The format is sufficiently flexible to capture data from a broad range of experimental processes, and complements other PSI formats for MS data and the results of protein and peptide identifications to capture entire gel-based proteome workflows. GelML has resulted from the open standardisation process of PSI consisting of both public consultation and anonymous review of the specifications.
Assuntos
Bases de Dados de Proteínas , Eletroforese em Gel de Poliacrilamida , Proteômica/métodos , Software , Humanos , Internet , Espectrometria de Massas , Modelos Químicos , Proteômica/normas , Padrões de Referência , Interface Usuário-ComputadorRESUMO
BACKGROUND: The behaviour of biological systems can be deduced from their mathematical models. However, multiple sources of data in diverse forms are required in the construction of a model in order to define its components and their biochemical reactions, and corresponding parameters. Automating the assembly and use of systems biology models is dependent upon data integration processes involving the interoperation of data and analytical resources. RESULTS: Taverna workflows have been developed for the automated assembly of quantitative parameterised metabolic networks in the Systems Biology Markup Language (SBML). A SBML model is built in a systematic fashion by the workflows which starts with the construction of a qualitative network using data from a MIRIAM-compliant genome-scale model of yeast metabolism. This is followed by parameterisation of the SBML model with experimental data from two repositories, the SABIO-RK enzyme kinetics database and a database of quantitative experimental results. The models are then calibrated and simulated in workflows that call out to COPASIWS, the web service interface to the COPASI software application for analysing biochemical networks. These systems biology workflows were evaluated for their ability to construct a parameterised model of yeast glycolysis. CONCLUSIONS: Distributed information about metabolic reactions that have been described to MIRIAM standards enables the automated assembly of quantitative systems biology models of metabolic networks based on user-defined criteria. Such data integration processes can be implemented as Taverna workflows to provide a rapid overview of the components and their relationships within a biochemical system.
Assuntos
Redes e Vias Metabólicas , Biologia de Sistemas/métodos , Bases de Dados Factuais , Modelos BiológicosRESUMO
Proteomics, the study of the protein complement of a biological system, is generating increasing quantities of data from rapidly developing technologies employed in a variety of different experimental workflows. Experimental processes, e.g. for comparative 2D gel studies or LC-MS/MS analyses of complex protein mixtures, involve a number of steps: from experimental design, through wet and dry lab operations, to publication of data in repositories and finally to data annotation and maintenance. The presence of inaccuracies throughout the processing pipeline, however, results in data that can be untrustworthy, thus offsetting the benefits of high-throughput technology. While researchers and practitioners are generally aware of some of the information quality issues associated with public proteomics data, there are few accepted criteria and guidelines for dealing with them. In this article, we highlight factors that impact on the quality of experimental data and review current approaches to information quality management in proteomics. Data quality issues are considered throughout the lifecycle of a proteomics experiment, from experiment design and technique selection, through data analysis, to archiving and sharing.
Assuntos
Armazenamento e Recuperação da Informação , Proteômica , Controle de Qualidade , Sistemas de Gerenciamento de Base de Dados , Eletroforese em Gel Bidimensional , Armazenamento e Recuperação da Informação/métodos , Armazenamento e Recuperação da Informação/normas , Espectrometria de Massas , Proteínas/análise , Proteômica/instrumentação , Proteômica/métodos , Proteômica/normas , SoftwareRESUMO
MOTIVATION: Most experimental evidence on kinetic parameters is buried in the literature, whose manual searching is complex, time consuming and partial. These shortcomings become particularly acute in systems biology, where these parameters need to be integrated into detailed, genome-scale, metabolic models. These problems are addressed by KiPar, a dedicated information retrieval system designed to facilitate access to the literature relevant for kinetic modelling of a given metabolic pathway in yeast. Searching for kinetic data in the context of an individual pathway offers modularity as a way of tackling the complexity of developing a full metabolic model. It is also suitable for large-scale mining, since multiple reactions and their kinetic parameters can be specified in a single search request, rather than one reaction at a time, which is unsuitable given the size of genome-scale models. RESULTS: We developed an integrative approach, combining public data and software resources for the rapid development of large-scale text mining tools targeting complex biological information. The user supplies input in the form of identifiers used in relevant data resources to refer to the concepts of interest, e.g. EC numbers, GO and SBO identifiers. By doing so, the user is freed from providing any other knowledge or terminology concerned with these concepts and their relations, since they are retrieved from these and cross-referenced resources automatically. The terminology acquired is used to index the literature by mapping concepts to their synonyms, and then to textual documents mentioning them. The indexing results and the previously acquired knowledge about relations between concepts are used to formulate complex search queries aiming at documents relevant to the user's information needs. The conceptual approach is demonstrated in the implementation of KiPar. Evaluation reveals that KiPar performs better than a Boolean search. The precision achieved for abstracts (60%) and full-text articles (48%) is considerably better than the baseline precision (44% and 24%, respectively). The baseline recall is improved by 36% for abstracts and by 100% for full text. It appears that full-text articles are a much richer source of information on kinetic data than are their abstracts. Finally, the combined results for abstracts and full text compared with the curated literature provide high values for relative recall (88%) and novelty ratio (92%), suggesting that the system is able to retrieve a high proportion of new documents. AVAILABILITY: Source code and documentation are available at: (http://www.mcisb.org/resources/kipar/).
Assuntos
Biologia Computacional/métodos , Sistemas de Informação , Saccharomyces cerevisiae/metabolismo , Software , Sistemas de Informação/normas , Redes e Vias Metabólicas , Biologia de SistemasRESUMO
The Functional Genomics Experiment data model (FuGE) has been developed to facilitate convergence of data standards for high-throughput, comprehensive analyses in biology. FuGE models the components of an experimental activity that are common across different technologies, including protocols, samples and data. FuGE provides a foundation for describing entire laboratory workflows and for the development of new data formats. The Microarray Gene Expression Data society and the Proteomics Standards Initiative have committed to using FuGE as the basis for defining their respective standards, and other standards groups, including the Metabolomics Standards Initiative, are evaluating FuGE in their development efforts. Adoption of FuGE by multiple standards bodies will enable uniform reporting of common parts of functional genomics workflows, simplify data-integration efforts and ease the burden on researchers seeking to fulfill multiple minimum reporting requirements. Such advances are important for transparent data management and mining in functional genomics and systems biology.
Assuntos
Biologia Computacional , Simulação por Computador/normas , Genômica/normas , Modelos Biológicos , Análise de Sequência com Séries de Oligonucleotídeos/normas , Proteômica/normas , Bases de Dados FactuaisRESUMO
Both the generation and the analysis of proteomics data are now widespread, and high-throughput approaches are commonplace. Protocols continue to increase in complexity as methods and technologies evolve and diversify. To encourage the standardized collection, integration, storage and dissemination of proteomics data, the Human Proteome Organization's Proteomics Standards Initiative develops guidance modules for reporting the use of techniques such as gel electrophoresis and mass spectrometry. This paper describes the processes and principles underpinning the development of these modules; discusses the ramifications for various interest groups such as experimentalists, funders, publishers and the private sector; addresses the issue of overlap with other reporting guidelines; and highlights the criticality of appropriate tools and resources in enabling 'MIAPE-compliant' reporting.
Assuntos
Bases de Dados de Proteínas/normas , Perfilação da Expressão Gênica/normas , Genoma Humano/genética , Guias como Assunto , Armazenamento e Recuperação da Informação/normas , Proteômica/normas , Pesquisa/normas , Humanos , InternacionalidadeRESUMO
Despite the growing volumes of proteomic data, integration of the underlying results remains problematic owing to differences in formats, data captured, protein accessions and services available from the individual repositories. To address this, we present the ISPIDER Central Proteomic Database search (http://www.ispider.manchester.ac.uk/cgi-bin/ProteomicSearch.pl), an integration service offering novel search capabilities over leading, mature, proteomic repositories including PRoteomics IDEntifications database (PRIDE), PepSeeker, PeptideAtlas and the Global Proteome Machine. It enables users to search for proteins and peptides that have been characterised in mass spectrometry-based proteomics experiments from different groups, stored in different databases, and view the collated results with specialist viewers/clients. In order to overcome limitations imposed by the great variability in protein accessions used by individual laboratories, the European Bioinformatics Institute's Protein Identifier Cross-Reference (PICR) service is used to resolve accessions from different sequence repositories. Custom-built clients allow users to view peptide/protein identifications in different contexts from multiple experiments and repositories, as well as integration with the Dasty2 client supporting any annotations available from Distributed Annotation System servers. Further information on the protein hits may also be added via external web services able to take a protein as input. This web server offers the first truly integrated access to proteomics repositories and provides a unique service to biologists interested in mass spectrometry-based proteomics.
Assuntos
Bases de Dados de Proteínas , Proteômica , Software , Gráficos por Computador , Internet , Espectrometria de Massas , Integração de SistemasRESUMO
LC-MS experiments can generate large quantities of data, for which a variety of database search engines are available to make peptide and protein identifications. Decoy databases are becoming widely used to place statistical confidence in result sets, allowing the false discovery rate (FDR) to be estimated. Different search engines produce different identification sets so employing more than one search engine could result in an increased number of peptides (and proteins) being identified, if an appropriate mechanism for combining data can be defined. We have developed a search engine independent score, based on FDR, which allows peptide identifications from different search engines to be combined, called the FDR Score. The results demonstrate that the observed FDR is significantly different when analysing the set of identifications made by all three search engines, by each pair of search engines or by a single search engine. Our algorithm assigns identifications to groups according to the set of search engines that have made the identification, and re-assigns the score (combined FDR Score). The combined FDR Score can differentiate between correct and incorrect peptide identifications with high accuracy, allowing on average 35% more peptide identifications to be made at a fixed FDR than using a single search engine.
Assuntos
Algoritmos , Biologia Computacional/métodos , Armazenamento e Recuperação da Informação , Peptídeos/análise , Proteômica/métodos , Bases de Dados de Proteínas , Modelos Estatísticos , Proteínas/análise , Reprodutibilidade dos Testes , SoftwareRESUMO
BACKGROUND: High content live cell imaging experiments are able to track the cellular localisation of labelled proteins in multiple live cells over a time course. Experiments using high content live cell imaging will generate multiple large datasets that are often stored in an ad-hoc manner. This hinders identification of previously gathered data that may be relevant to current analyses. Whilst solutions exist for managing image data, they are primarily concerned with storage and retrieval of the images themselves and not the data derived from the images. There is therefore a requirement for an information management solution that facilitates the indexing of experimental metadata and results of high content live cell imaging experiments. RESULTS: We have designed and implemented a data model and information management solution for the data gathered through high content live cell imaging experiments. Many of the experiments to be stored measure the translocation of fluorescently labelled proteins from cytoplasm to nucleus in individual cells. The functionality of this database has been enhanced by the addition of an algorithm that automatically annotates results of these experiments with the timings of translocations and periods of any oscillatory translocations as they are uploaded to the repository. Testing has shown the algorithm to perform well with a variety of previously unseen data. CONCLUSION: Our repository is a fully functional example of how high throughput imaging data may be effectively indexed and managed to address the requirements of end users. By implementing the automated analysis of experimental results, we have provided a clear impetus for individuals to ensure that their data forms part of that which is stored in the repository. Although focused on imaging, the solution provided is sufficiently generic to be applied to other functional proteomics and genomics experiments. The software is available from: fhttp://code.google.com/p/livecellim/
Assuntos
Biologia Computacional/métodos , Processamento de Imagem Assistida por Computador/métodos , Gestão da Informação/métodos , Microscopia , Bases de Dados Factuais , Armazenamento e Recuperação da Informação , SoftwareRESUMO
MOTIVATION: The Functional Genomics Experiment Object Model (FuGE) supports modelling of experimental processes either directly or through extensions that specialize FuGE for use in specific contexts. FuGE applications commonly include components that capture, store and search experiment descriptions, where the requirements of different applications have much in common. RESULTS: We describe a toolkit that supports data capture, storage and web-based search of FuGE experiment models; the toolkit can be used directly on FuGE compliant models or configured for use with FuGE extensions. The toolkit is illustrated using a FuGE extension standardized by the proteomics standards initiative, namely GelML. AVAILABILITY: The toolkit and a demonstration are available at http://code.google.com/p/fugetoolkit
Assuntos
Biologia Computacional , Genômica/métodos , Modelos Genéticos , Software , InternetRESUMO
BACKGROUND: The systematic capture of appropriately annotated experimental data is a prerequisite for most bioinformatics analyses. Data capture is required not only for submission of data to public repositories, but also to underpin integrated analysis, archiving, and sharing - both within laboratories and in collaborative projects. The widespread requirement to capture data means that data capture and annotation are taking place at many sites, but the small scale of the literature on tools, techniques and experiences suggests that there is work to be done to identify good practice and reduce duplication of effort. RESULTS: This paper reports on experience gained in the deployment of the Pedro data capture tool in a range of representative bioinformatics applications. The paper makes explicit the requirements that have recurred when capturing data in different contexts, indicates how these requirements are addressed in Pedro, and describes case studies that illustrate where the requirements have arisen in practice. CONCLUSION: Data capture is a fundamental activity for bioinformatics; all biological data resources build on some form of data capture activity, and many require a blend of import, analysis and annotation. Recurring requirements in data capture suggest that model-driven architectures can be used to construct data capture infrastructures that can be rapidly configured to meet the needs of individual use cases. We have described how one such model-driven infrastructure, namely Pedro, has been deployed in representative case studies, and discussed the extent to which the model-driven approach has been effective in practice.
Assuntos
Algoritmos , Biologia Computacional/métodos , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Factuais , Armazenamento e Recuperação da Informação/métodos , SoftwareRESUMO
BACKGROUND: Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually. RESULTS: We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts. CONCLUSIONS: We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods.
Assuntos
Indexação e Redação de Resumos/métodos , Metabolismo , Interface Usuário-Computador , Vocabulário Controlado , Cromatografia Gasosa , Armazenamento e Recuperação da Informação/métodos , MEDLINE , Espectroscopia de Ressonância Magnética , Processamento de Linguagem Natural , Reconhecimento Automatizado de Padrão/métodos , Biologia de Sistemas/instrumentação , Biologia de Sistemas/estatística & dados numéricos , Integração de Sistemas , Tecnologia , Terminologia como Assunto , Estados UnidosRESUMO
BACKGROUND: The number of sequenced fungal genomes is ever increasing, with about 200 genomes already fully sequenced or in progress. Only a small percentage of those genomes have been comprehensively studied, for example using techniques from functional genomics. Comparative analysis has proven to be a useful strategy for enhancing our understanding of evolutionary biology and of the less well understood genomes. However, the data required for these analyses tends to be distributed in various heterogeneous data sources, making systematic comparative studies a cumbersome task. Furthermore, comparative analyses benefit from close integration of derived data sets that cluster genes or organisms in a way that eases the expression of requests that clarify points of similarity or difference between species. DESCRIPTION: To support systematic comparative analyses of fungal genomes we have developed the e-Fungi database, which integrates a variety of data for more than 30 fungal genomes. Publicly available genome data, functional annotations, and pathway information has been integrated into a single data repository and complemented with results of comparative analyses, such as MCL and OrthoMCL cluster analysis, and predictions of signaling proteins and the sub-cellular localisation of proteins. To access the data, a library of analysis tasks is available through a web interface. The analysis tasks are motivated by recent comparative genomics studies, and aim to support the study of evolutionary biology as well as community efforts for improving the annotation of genomes. Web services for each query are also available, enabling the tasks to be incorporated into workflows. CONCLUSION: The e-Fungi database provides fungal biologists with a resource for comparative studies of a large range of fungal genomes. Its analysis library supports the comparative study of genome data, functional annotation, and results of large scale analyses over all the genomes stored in the database. The database is accessible at http://www.e-fungi.org.uk, as is the WSDL for the web services.
Assuntos
Bases de Dados Genéticas , Genoma Fúngico/genética , Biologia Computacional/métodos , Sistemas de Gerenciamento de Base de Dados , Internet , Interface Usuário-ComputadorRESUMO
A broad range of mass spectrometers are used in mass spectrometry (MS)-based proteomics research. Each type of instrument possesses a unique design, data system and performance specifications, resulting in strengths and weaknesses for different types of experiments. Unfortunately, the native binary data formats produced by each type of mass spectrometer also differ and are usually proprietary. The diverse, nontransparent nature of the data structure complicates the integration of new instruments into preexisting infrastructure, impedes the analysis, exchange, comparison and publication of results from different experiments and laboratories, and prevents the bioinformatics community from accessing data sets required for software development. Here, we introduce the 'mzXML' format, an open, generic XML (extensible markup language) representation of MS data. We have also developed an accompanying suite of supporting programs. We expect that this format will facilitate data management, interpretation and dissemination in proteomics research.
Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados Factuais , Disseminação de Informação/métodos , Armazenamento e Recuperação da Informação/métodos , Espectrometria de Massas/métodos , Proteômica/métodos , Interface Usuário-Computador , Armazenamento e Recuperação da Informação/normas , Espectrometria de Massas/normas , Proteoma/análise , Proteoma/química , Proteoma/classificação , Proteômica/normas , SoftwareRESUMO
Both the generation and the analysis of proteome data are becoming increasingly widespread, and the field of proteomics is moving incrementally toward high-throughput approaches. Techniques are also increasing in complexity as the relevant technologies evolve. A standard representation of both the methods used and the data generated in proteomics experiments, analogous to that of the MIAME (minimum information about a microarray experiment) guidelines for transcriptomics, and the associated MAGE (microarray gene expression) object model and XML (extensible markup language) implementation, has yet to emerge. This hinders the handling, exchange, and dissemination of proteomics data. Here, we present a UML (unified modeling language) approach to proteomics experimental data, describe XML and SQL (structured query language) implementations of that model, and discuss capture, storage, and dissemination strategies. These make explicit what data might be most usefully captured about proteomics experiments and provide complementary routes toward the implementation of a proteome repository.
Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação/métodos , Proteínas/química , Proteômica/métodos , Documentação/métodos , Hipermídia , Disseminação de Informação/métodos , Modelos Moleculares , Conformação Proteica , Proteínas/genética , Proteínas/metabolismo , Análise de Sequência de Proteína/métodos , Software , Design de Software , Interface Usuário-ComputadorRESUMO
The study of the metabolite complement of biological samples, known as metabolomics, is creating large amounts of data, and support for handling these data sets is required to facilitate meaningful analyses that will answer biological questions. We present a data model for plant metabolomics known as ArMet (architecture for metabolomics). It encompasses the entire experimental time line from experiment definition and description of biological source material, through sample growth and preparation to the results of chemical analysis. Such formal data descriptions, which specify the full experimental context, enable principled comparison of data sets, allow proper interpretation of experimental results, permit the repetition of experiments and provide a basis for the design of systems for data storage and transmission. The current design and example implementations are freely available (http://www.armet.org/). We seek to advance discussion and community adoption of a standard for metabolomics, which would promote principled collection, storage and transmission of experiment data.
Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados Factuais/normas , Documentação/métodos , Armazenamento e Recuperação da Informação/métodos , Plantas/metabolismo , Proteoma/metabolismo , Projetos de Pesquisa , Documentação/normas , Internet , Proteômica/métodos , Proteômica/normas , Pesquisa/normas , Software , Interface Usuário-ComputadorRESUMO
BACKGROUND: The proliferation of data repositories in bioinformatics has resulted in the development of numerous interfaces that allow scientists to browse, search and analyse the data that they contain. Interfaces typically support repository access by means of web pages, but other means are also used, such as desktop applications and command line tools. Interfaces often duplicate functionality amongst each other, and this implies that associated development activities are repeated in different laboratories. Interfaces developed by public laboratories are often created with limited developer resources. In such environments, reducing the time spent on creating user interfaces allows for a better deployment of resources for specialised tasks, such as data integration or analysis. Laboratories maintaining data resources are challenged to reconcile requirements for software that is reliable, functional and flexible with limitations on software development resources. RESULTS: This paper proposes a model-driven approach for the partial generation of user interfaces for searching and browsing bioinformatics data repositories. Inspired by the Model Driven Architecture (MDA) of the Object Management Group (OMG), we have developed a system that generates interfaces designed for use with bioinformatics resources. This approach helps laboratory domain experts decrease the amount of time they have to spend dealing with the repetitive aspects of user interface development. As a result, the amount of time they can spend on gathering requirements and helping develop specialised features increases. The resulting system is known as Pierre, and has been validated through its application to use cases in the life sciences, including the PEDRoDB proteomics database and the e-Fungi data warehouse. CONCLUSION: MDAs focus on generating software from models that describe aspects of service capabilities, and can be applied to support rapid development of repository interfaces in bioinformatics. The Pierre MDA is capable of supporting common database access requirements with a variety of auto-generated interfaces and across a variety of repositories. With Pierre, four kinds of interfaces are generated: web, stand-alone application, text-menu, and command line. The kinds of repositories with which Pierre interfaces have been used are relational, XML and object databases.
Assuntos
Biologia Computacional/métodos , Bases de Dados Factuais , Modelos Biológicos , Design de Software , Biologia Computacional/tendências , Bases de Dados Factuais/tendênciasRESUMO
BACKGROUND: Several data formats have been developed for large scale biological experiments, using a variety of methodologies. Most data formats contain a mechanism for allowing extensions to encode unanticipated data types. Extensions to data formats are important because the experimental methodologies tend to be fairly diverse and rapidly evolving, which hinders the creation of formats that will be stable over time. RESULTS: In this paper we review the data formats that exist in functional genomics, some of which have become de facto or de jure standards, with a particular focus on how each domain has been modelled, and how each format allows extensions. We describe the tasks that are frequently performed over data formats and analyse how well each task is supported by a particular modelling structure. CONCLUSION: From our analysis, we make recommendations as to the types of modelling structure that are most suitable for particular types of experimental annotation. There are several standards currently under development that we believe could benefit from systematically following a set of guidelines.