RESUMO
The Ensembl project (https://www.ensembl.org) makes key genomic data sets available to the entire scientific community without restrictions. Ensembl seeks to be a fundamental resource driving scientific progress by creating, maintaining and updating reference genome annotation and comparative genomics resources. This year we describe our new and expanded gene, variant and comparative annotation capabilities, which led to a 50% increase in the number of vertebrate genomes we support. We have also doubled the number of available human variants and added regulatory regions for many mouse cell types and developmental stages. Our data sets and tools are available via the Ensembl website as well as a through a RESTful webservice, Perl application programming interface and as data files for download.
Assuntos
Bases de Dados Genéticas , Genoma/genética , Genômica , Vertebrados/genética , Animais , Biologia Computacional/tendências , Humanos , Camundongos , Anotação de Sequência Molecular , SoftwareRESUMO
The Ensembl project has been aggregating, processing, integrating and redistributing genomic datasets since the initial releases of the draft human genome, with the aim of accelerating genomics research through rapid open distribution of public data. Large amounts of raw data are thus transformed into knowledge, which is made available via a multitude of channels, in particular our browser (http://www.ensembl.org). Over time, we have expanded in multiple directions. First, our resources describe multiple fields of genomics, in particular gene annotation, comparative genomics, genetics and epigenomics. Second, we cover a growing number of genome assemblies; Ensembl Release 90 contains exactly 100. Third, our databases feed simultaneously into an array of services designed around different use cases, ranging from quick browsing to genome-wide bioinformatic analysis. We present here the latest developments of the Ensembl project, with a focus on managing an increasing number of assemblies, supporting efforts in genome interpretation and improving our browser.
Assuntos
Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Genoma , Disseminação de Informação , Animais , Epigenômica , Genoma Humano , Estudo de Associação Genômica Ampla , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Anotação de Sequência Molecular , Vertebrados/genética , NavegadorRESUMO
Ensembl Genomes (http://www.ensemblgenomes.org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the Ensembl project (http://www.ensembl.org). Together, the two resources provide a consistent set of programmatic and interactive interfaces to a rich range of data including genome sequence, gene models, transcript sequence, genetic variation, and comparative analysis. This paper provides an update to the previous publications about the resource, with a focus on recent developments and expansions. These include the incorporation of almost 20 000 additional genome sequences and over 35 000 tracks of RNA-Seq data, which have been aligned to genomic sequence and made available for visualization. Other advances since 2015 include the release of the database in Resource Description Framework (RDF) format, a large increase in community-derived curation, a new high-performance protein sequence search, additional cross-references, improved annotation of non-protein-coding genes, and the launch of pre-release and archival sites. Collectively, these changes are part of a continuing response to the increasing quantity of publicly-available genome-scale data, and the consequent need to archive, integrate, annotate and disseminate these using automated, scalable methods.
Assuntos
Archaea/genética , Bactérias/genética , Bases de Dados Genéticas , Bases de Dados de Proteínas , Eucariotos/genética , Genômica , Sequência de Aminoácidos , Animais , Sequência de Bases , Mineração de Dados , Previsões , Genoma , Anotação de Sequência Molecular , RNA/genética , Interface Usuário-ComputadorRESUMO
The Ensembl software resources are a stable infrastructure to store, access and manipulate genome assemblies and their functional annotations. The Ensembl 'Core' database and Application Programming Interface (API) was our first major piece of software infrastructure and remains at the centre of all of our genome resources. Since its initial design more than fifteen years ago, the number of publicly available genomic, transcriptomic and proteomic datasets has grown enormously, accelerated by continuous advances in DNA-sequencing technology. Initially intended to provide annotation for the reference human genome, we have extended our framework to support the genomes of all species as well as richer assembly models. Cross-referenced links to other informatics resources facilitate searching our database with a variety of popular identifiers such as UniProt and RefSeq. Our comprehensive and robust framework storing a large diversity of genome annotations in one location serves as a platform for other groups to generate and maintain their own tailored annotation. We welcome reuse and contributions: our databases and APIs are publicly available, all of our source code is released with a permissive Apache v2.0 licence at http://github.com/Ensembl and we have an active developer mailing list ( http://www.ensembl.org/info/about/contact/index.html ). Database URL: http://www.ensembl.org.
Assuntos
Bases de Dados de Ácidos Nucleicos , Genoma Humano , Anotação de Sequência Molecular/métodos , Análise de Sequência de DNA/métodos , Interface Usuário-Computador , HumanosRESUMO
Ensembl (www.ensembl.org) is a database and genome browser for enabling research on vertebrate genomes. We import, analyse, curate and integrate a diverse collection of large-scale reference data to create a more comprehensive view of genome biology than would be possible from any individual dataset. Our extensive data resources include evidence-based gene and regulatory region annotation, genome variation and gene trees. An accompanying suite of tools, infrastructure and programmatic access methods ensure uniform data analysis and distribution for all supported species. Together, these provide a comprehensive solution for large-scale and targeted genomics applications alike. Among many other developments over the past year, we have improved our resources for gene regulation and comparative genomics, and added CRISPR/Cas9 target sites. We released new browser functionality and tools, including improved filtering and prioritization of genome variation, Manhattan plot visualization for linkage disequilibrium and eQTL data, and an ontology search for phenotypes, traits and disease. We have also enhanced data discovery and access with a track hub registry and a selection of new REST end points. All Ensembl data are freely released to the scientific community and our source code is available via the open source Apache 2.0 license.
Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Genômica/métodos , Ferramenta de Busca , Software , Navegador , Animais , Mineração de Dados , Evolução Molecular , Regulação da Expressão Gênica , Variação Genética , Genoma Humano , Humanos , Anotação de Sequência Molecular , Especificidade da Espécie , VertebradosRESUMO
The Ensembl project (http://www.ensembl.org) is a system for genome annotation, analysis, storage and dissemination designed to facilitate the access of genomic annotation from chordates and key model organisms. It provides access to data from 87 species across our main and early access Pre! websites. This year we introduced three newly annotated species and released numerous updates across our supported species with a concentration on data for the latest genome assemblies of human, mouse, zebrafish and rat. We also provided two data updates for the previous human assembly, GRCh37, through a dedicated website (http://grch37.ensembl.org). Our tools, in particular the VEP, have been improved significantly through integration of additional third party data. REST is now capable of larger-scale analysis and our regulatory data BioMart can deliver faster results. The website is now capable of displaying long-range interactions such as those found in cis-regulated datasets. Finally we have launched a website optimized for mobile devices providing views of genes, variants and phenotypes. Our data is made available without restriction and all code is available from our GitHub organization site (http://github.com/Ensembl) under an Apache 2.0 license.
Assuntos
Bases de Dados Genéticas , Genômica , Anotação de Sequência Molecular , Animais , Genes , Variação Genética , Humanos , Internet , Camundongos , Proteínas/genética , Ratos , Sequências Reguladoras de Ácido Nucleico , SoftwareRESUMO
MOTIVATION: We present a Web service to access Ensembl data using Representational State Transfer (REST). The Ensembl REST server enables the easy retrieval of a wide range of Ensembl data by most programming languages, using standard formats such as JSON and FASTA while minimizing client work. We also introduce bindings to the popular Ensembl Variant Effect Predictor tool permitting large-scale programmatic variant analysis independent of any specific programming language. AVAILABILITY AND IMPLEMENTATION: The Ensembl REST API can be accessed at http://rest.ensembl.org and source code is freely available under an Apache 2.0 license from http://github.com/Ensembl/ensembl-rest.
Assuntos
Biologia Computacional/métodos , Bases de Dados Factuais , Linguagens de Programação , Software , Variação Genética , Genômica , HumanosRESUMO
Ensembl (http://www.ensembl.org) is a genomic interpretation system providing the most up-to-date annotations, querying tools and access methods for chordates and key model organisms. This year we released updated annotation (gene models, comparative genomics, regulatory regions and variation) on the new human assembly, GRCh38, although we continue to support researchers using the GRCh37.p13 assembly through a dedicated site (http://grch37.ensembl.org). Our Regulatory Build has been revamped to identify regulatory regions of interest and to efficiently highlight their activity across disparate epigenetic data sets. A number of new interfaces allow users to perform large-scale comparisons of their data against our annotations. The REST server (http://rest.ensembl.org), which allows programs written in any language to query our databases, has moved to a full service alongside our upgraded website tools. Our online Variant Effect Predictor tool has been updated to process more variants and calculate summary statistics. Lastly, the WiggleTools package enables users to summarize large collections of data sets and view them as single tracks in Ensembl. The Ensembl code base itself is more accessible: it is now hosted on our GitHub organization page (https://github.com/Ensembl) under an Apache 2.0 open source license.
Assuntos
Bases de Dados de Ácidos Nucleicos , Genômica , Animais , Epigênese Genética , Variação Genética , Genoma Humano , Humanos , Internet , Camundongos , Anotação de Sequência Molecular , Sequências Reguladoras de Ácido Nucleico , SoftwareRESUMO
Ensembl (http://www.ensembl.org) creates tools and data resources to facilitate genomic analysis in chordate species with an emphasis on human, major vertebrate model organisms and farm animals. Over the past year we have increased the number of species that we support to 77 and expanded our genome browser with a new scrollable overview and improved variation and phenotype views. We also report updates to our core datasets and improvements to our gene homology relationships from the addition of new species. Our REST service has been extended with additional support for comparative genomics and ontology information. Finally, we provide updated information about our methods for data access and resources for user training.
Assuntos
Bases de Dados Genéticas , Genômica , Animais , Cordados/genética , Variação Genética , Humanos , Internet , Camundongos , Anotação de Sequência Molecular , Fenótipo , RatosRESUMO
CSpritz is a web server for the prediction of intrinsic protein disorder. It is a combination of previous Spritz with two novel orthogonal systems developed by our group (Punch and ESpritz). Punch is based on sequence and structural templates trained with support vector machines. ESpritz is an efficient single sequence method based on bidirectional recursive neural networks. Spritz was extended to filter predictions based on structural homologues. After extensive testing, predictions are combined by averaging their probabilities. The CSpritz website can elaborate single or multiple predictions for either short or long disorder. The server provides a global output page, for download and simultaneous statistics of all predictions. Links are provided to each individual protein where the amino acid sequence and disorder prediction are displayed along with statistics for the individual protein. As a novel feature, CSpritz provides information about structural homologues as well as secondary structure and short functional linear motifs in each disordered segment. Benchmarking was performed on the very recent CASP9 data, where CSpritz would have ranked consistently well with a Sw measure of 49.27 and AUC of 0.828. The server, together with help and methods pages including examples, are freely available at URL: http://protein.bio.unipd.it/cspritz/.
Assuntos
Conformação Proteica , Software , Motivos de Aminoácidos , Internet , Anotação de Sequência Molecular , Estrutura Secundária de Proteína , Homologia Estrutural de ProteínaRESUMO
BACKGROUND: Proteins, especially larger ones, are often composed of individual evolutionary units, domains, which have their own function and structural fold. Predicting domains is an important intermediate step in protein analyses, including the prediction of protein structures. RESULTS: We describe novel systems for the prediction of protein domain boundaries powered by Recursive Neural Networks. The systems rely on a combination of primary sequence and evolutionary information, predictions of structural features such as secondary structure, solvent accessibility and residue contact maps, and structural templates, both annotated for domains (from the SCOP dataset) and unannotated (from the PDB). We gauge the contribution of contact maps, and PDB and SCOP templates independently and for different ranges of template quality. We find that accurately predicted contact maps are informative for the prediction of domain boundaries, while the same is not true for contact maps predicted ab initio. We also find that gap information from PDB templates is informative, but, not surprisingly, less than SCOP annotations. We test both systems trained on templates of all qualities, and systems trained only on templates of marginal similarity to the query (less than 25% sequence identity). While the first batch of systems produces near perfect predictions in the presence of fair to good templates, the second batch outperforms or match ab initio predictors down to essentially any level of template quality.We test all systems in 5-fold cross-validation on a large non-redundant set of multi-domain and single domain proteins. The final predictors are state-of-the-art, with a template-less prediction boundary recall of 50.8% (precision 38.7%) within +/- 20 residues and a single domain recall of 80.3% (precision 78.1%). The SCOP-based predictors achieve a boundary recall of 74% (precision 77.1%) again within +/- 20 residues, and classify single domain proteins as such in over 85% of cases, when we allow a mix of bad and good quality templates. If we only allow marginal templates (max 25% sequence identity to the query) the scores remain high, with boundary recall and precision of 59% and 66.3%, and 80% of all single domain proteins predicted correctly. CONCLUSION: The systems presented here may prove useful in large-scale annotation of protein domains in proteins of unknown structure. The methods are available as public web servers at the address: http://distill.ucd.ie/shandy/ and we plan on running them on a multi-genomic scale and make the results public in the near future.
Assuntos
Algoritmos , Redes Neurais de Computação , Estrutura Terciária de Proteína , Proteínas/química , Bases de Dados de ProteínasRESUMO
BACKGROUND: Prediction of protein structures from their sequences is still one of the open grand challenges of computational biology. Some approaches to protein structure prediction, especially ab initio ones, rely to some extent on the prediction of residue contact maps. Residue contact map predictions have been assessed at the CASP competition for several years now. Although it has been shown that exact contact maps generally yield correct three-dimensional structures, this is true only at a relatively low resolution (3-4 A from the native structure). Another known weakness of contact maps is that they are generally predicted ab initio, that is not exploiting information about potential homologues of known structure. RESULTS: We introduce a new class of distance restraints for protein structures: multi-class distance maps. We show that C alpha trace reconstructions based on 4-class native maps are significantly better than those from residue contact maps. We then build two predictors of 4-class maps based on recursive neural networks: one ab initio, or relying on the sequence and on evolutionary information; one template-based, or in which homology information to known structures is provided as a further input. We show that virtually any level of sequence similarity to structural templates (down to less than 10%) yields more accurate 4-class maps than the ab initio predictor. We show that template-based predictions by recursive neural networks are consistently better than the best template and than a number of combinations of the best available templates. We also extract binary residue contact maps at an 8 A threshold (as per CASP assessment) from the 4-class predictors and show that the template-based version is also more accurate than the best template and consistently better than the ab initio one, down to very low levels of sequence identity to structural templates. Furthermore, we test both ab-initio and template-based 8 A predictions on the CASP7 targets using a pre-CASP7 PDB, and find that both predictors are state-of-the-art, with the template-based one far outperforming the best CASP7 systems if templates with sequence identity to the query of 10% or better are available. Although this is not the main focus of this paper we also report on reconstructions of C alpha traces based on both ab initio and template-based 4-class map predictions, showing that the latter are generally more accurate even when homology is dubious. CONCLUSION: Accurate predictions of multi-class maps may provide valuable constraints for improved ab initio and template-based prediction of protein structures, naturally incorporate multiple templates, and yield state-of-the-art binary maps. Predictions of protein structures and 8 A contact maps based on the multi-class distance map predictors described in this paper are freely available to academic users at the url http://distill.ucd.ie/.
Assuntos
Biologia Computacional/métodos , Modelos Moleculares , Redes Neurais de Computação , Proteínas/química , Algoritmos , Bases de Dados de ProteínasRESUMO
Protein topology representations such as residue contact maps are an important intermediate step towards ab initio prediction of protein structure, but the problem of predicting reliable contact maps is far from solved. One of the main pitfalls of existing contact map predictors is that they generally predict unphysical maps, i.e. maps that cannot be embedded into three-dimensional structures or, at best, violate a number of basic constraints observed in real protein structures, such as the maximum number of contacts for a residue. Here, we focus on the problem of learning to predict more "physical" contact maps. We do so by first predicting contact maps through a traditional system (XXStout), and then filtering these maps by an ensemble of artificial neural networks. The filter is provided as input not only the bare predicted map, but also a number of global or long-range features extracted from it. In a rigorous cross-validation test, we show that the filter greatly improves the predicted maps it is input. CASP7 results, on which we report here, corroborate this finding. Importantly, since the approach we present here is fully modular, it may be beneficial to any other ab initio contact map predictor.
Assuntos
Algoritmos , Modelos Químicos , Redes Neurais de Computação , Reconhecimento Automatizado de Padrão/métodos , Mapeamento de Interação de Proteínas/métodos , Proteínas/química , Sítios de Ligação , Simulação por Computador , Ligação ProteicaRESUMO
BACKGROUND: Structural properties of proteins such as secondary structure and solvent accessibility contribute to three-dimensional structure prediction, not only in the ab initio case but also when homology information to known structures is available. Structural properties are also routinely used in protein analysis even when homology is available, largely because homology modelling is lower throughput than, say, secondary structure prediction. Nonetheless, predictors of secondary structure and solvent accessibility are virtually always ab initio. RESULTS: Here we develop high-throughput machine learning systems for the prediction of protein secondary structure and solvent accessibility that exploit homology to proteins of known structure, where available, in the form of simple structural frequency profiles extracted from sets of PDB templates. We compare these systems to their state-of-the-art ab initio counterparts, and with a number of baselines in which secondary structures and solvent accessibilities are extracted directly from the templates. We show that structural information from templates greatly improves secondary structure and solvent accessibility prediction quality, and that, on average, the systems significantly enrich the information contained in the templates. For sequence similarity exceeding 30%, secondary structure prediction quality is approximately 90%, close to its theoretical maximum, and 2-class solvent accessibility roughly 85%. Gains are robust with respect to template selection noise, and significant for marginal sequence similarity and for short alignments, supporting the claim that these improved predictions may prove beneficial beyond the case in which clear homology is available. CONCLUSION: The predictive system are publicly available at the address http://distill.ucd.ie.
Assuntos
Inteligência Artificial , Modelos Químicos , Modelos Moleculares , Estrutura Secundária de Proteína , Proteínas/química , Proteínas/ultraestrutura , Análise de Sequência de Proteína/métodos , Sequência de Aminoácidos , Simulação por Computador , Sequência Consenso , Dados de Sequência Molecular , Solventes/químicaRESUMO
A significant step towards establishing the structure and function of a protein is the prediction of the local conformation of the polypeptide chain. In this article, we present systems for the prediction of three new alphabets of local structural motifs. The motifs are built by applying multidimensional scaling (MDS) and clustering to pair-wise angular distances for multiple phi-psi angle values collected from high-resolution protein structures. The predictive systems, based on ensembles of bidirectional recurrent neural network architectures, and trained on a large non-redundant set of protein structures, achieve 72%, 66%, and 60% correct motif prediction on an independent test set for di-peptides (six classes), tri-peptides (eight classes) and tetra-peptides (14 classes), respectively, 28-30% above baseline statistical predictors. We then build a further system, based on ensembles of two-layered bidirectional recurrent neural networks, to map structural motif predictions into a traditional 3-class (helix, strand, coil) secondary structure. This system achieves 79.5% correct prediction using the "hard" CASP 3-class assignment, and 81.4% with a more lenient assignment, outperforming a sophisticated state-of-the-art predictor (Porter) trained in the same experimental conditions. The structural motif predictor is publicly available at: http://distill.ucd.ie/porter+/.
Assuntos
Motivos de Aminoácidos , Biologia Computacional/métodos , Peptídeos/química , Proteínas/química , Bases de Dados de Proteínas , Estrutura Secundária de ProteínaRESUMO
BACKGROUND: We describe Distill, a suite of servers for the prediction of protein structural features: secondary structure; relative solvent accessibility; contact density; backbone structural motifs; residue contact maps at 6, 8 and 12 Angstrom; coarse protein topology. The servers are based on large-scale ensembles of recursive neural networks and trained on large, up-to-date, non-redundant subsets of the Protein Data Bank. Together with structural feature predictions, Distill includes a server for prediction of Calpha traces for short proteins (up to 200 amino acids). RESULTS: The servers are state-of-the-art, with secondary structure predicted correctly for nearly 80% of residues (currently the top performance on EVA), 2-class solvent accessibility nearly 80% correct, and contact maps exceeding 50% precision on the top non-diagonal contacts. A preliminary implementation of the predictor of protein Calpha traces featured among the top 20 Novel Fold predictors at the last CASP6 experiment as group Distill (ID 0348). The majority of the servers, including the Calpha trace predictor, now take into account homology information from the PDB, when available, resulting in greatly improved reliability. CONCLUSION: All predictions are freely available through a simple joint web interface and the results are returned by email. In a single submission the user can send protein sequences for a total of up to 32k residues to all or a selection of the servers. Distill is accessible at the address: http://distill.ucd.ie/distill/.
Assuntos
Bases de Dados de Proteínas , Internet , Conformação Proteica , Algoritmos , Motivos de Aminoácidos , Sequência de Aminoácidos , Dobramento de Proteína , Mapeamento de Interação de Proteínas , Estrutura Secundária de Proteína , Alinhamento de Sequência , SolventesRESUMO
Intrinsically disordered proteins have long stretches of their polypeptide chain, which do not adopt a single native structure composed of stable secondary and tertiary structure in the absence of binding partners. The prediction of intrinsically disordered regions in proteins from sequence is increasingly becoming of interest, as the presence of many such regions in the complete genome sequences are discovered and important functional roles are associated with them. We have developed a machine learning approach based on two support vector machines (SVM) to discriminate disordered regions from sequence. The SVM are trained and benchmarked on two sets, representing long and short disordered regions. A preliminary version of Spritz was shown to perform consistently well at the recent biannual CASP-6 experiment [Critical Assessment of Techniques for Protein Structure Prediction (CASP), 2004]. The fully developed Spritz method is freely available as a web server at http://distill.ucd.ie/spritz/ and http://protein.cribi.unipd.it/spritz/.
Assuntos
Inteligência Artificial , Conformação Proteica , Análise de Sequência de Proteína/métodos , Software , Internet , Interface Usuário-ComputadorRESUMO
DISULFIND is a server for predicting the disulfide bonding state of cysteines and their disulfide connectivity starting from sequence alone. Optionally, disulfide connectivity can be predicted from sequence and a bonding state assignment given as input. The output is a simple visualization of the assigned bonding state (with confidence degrees) and the most likely connectivity patterns. The server is available at http://disulfind.dsi.unifi.it/.
Assuntos
Cisteína/química , Proteínas/química , Análise de Sequência de Proteína/métodos , Software , Dissulfetos/química , Internet , Interface Usuário-ComputadorRESUMO
We develop and test machine learning methods for the prediction of coarse 3D protein structures, where a protein is represented by a set of rigid rods associated with its secondary structure elements (alpha-helices and beta-strands). First, we employ cascades of recursive neural networks derived from graphical models to predict the relative placements of segments. These are represented as discretized distance and angle maps, and the discretization levels are statistically inferred from a large and curated dataset. Coarse 3D folds of proteins are then assembled starting from topological information predicted in the first stage. Reconstruction is carried out by minimizing a cost function taking the form of a purely geometrical potential. We show that the proposed architecture outperforms simpler alternatives and can accurately predict binary and multiclass coarse maps. The reconstruction procedure proves to be fast and often leads to topologically correct coarse structures that could be exploited as a starting point for various protein modeling strategies. The fully integrated rod-shaped protein builder (predictor of contact maps + reconstruction algorithm) can be accessed at http://distill.ucd.ie/.
Assuntos
Algoritmos , Modelos Moleculares , Proteínas/química , Software , Estrutura Secundária de Proteína , Estrutura Terciária de ProteínaRESUMO
BACKGROUND: Protein topology representations such as residue contact maps are an important intermediate step towards ab initio prediction of protein structure. Although improvements have occurred over the last years, the problem of accurately predicting residue contact maps from primary sequences is still largely unsolved. Among the reasons for this are the unbalanced nature of the problem (with far fewer examples of contacts than non-contacts), the formidable challenge of capturing long-range interactions in the maps, the intrinsic difficulty of mapping one-dimensional input sequences into two-dimensional output maps. In order to alleviate these problems and achieve improved contact map predictions, in this paper we split the task into two stages: the prediction of a map's principal eigenvector (PE) from the primary sequence; the reconstruction of the contact map from the PE and primary sequence. Predicting the PE from the primary sequence consists in mapping a vector into a vector. This task is less complex than mapping vectors directly into two-dimensional matrices since the size of the problem is drastically reduced and so is the scale length of interactions that need to be learned. RESULTS: We develop architectures composed of ensembles of two-layered bidirectional recurrent neural networks to classify the components of the PE in 2, 3 and 4 classes from protein primary sequence, predicted secondary structure, and hydrophobicity interaction scales. Our predictor, tested on a non redundant set of 2171 proteins, achieves classification performances of up to 72.6%, 16% above a base-line statistical predictor. We design a system for the prediction of contact maps from the predicted PE. Our results show that predicting maps through the PE yields sizeable gains especially for long-range contacts which are particularly critical for accurate protein 3D reconstruction. The final predictor's accuracy on a non-redundant set of 327 targets is 35.4% and 19.8% for minimum contact separations of 12 and 24, respectively, when the top length/5 contacts are selected. On the 11 CASP6 Novel Fold targets we achieve similar accuracies (36.5% and 19.7%). This favourably compares with the best automated predictors at CASP6. CONCLUSION: Our final system for contact map prediction achieves state-of-the-art performances, and may provide valuable constraints for improved ab initio prediction of protein structures. A suite of predictors of structural features, including the PE, and PE-based contact maps, is available at http://distill.ucd.ie.