Pesquisa | Secretaria de Estado da Saúde

1.

SPIn: model selection for phylogenetic mixtures via linear invariants.

Kedzierska, A M; Drton, M; Guigó, R; Casanellas, M.

Mol Biol Evol ; 29(3): 929-37, 2012 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-22009060

RESUMO

In phylogenetic inference, an evolutionary model describes the substitution processes along each edge of a phylogenetic tree. Misspecification of the model has important implications for the analysis of phylogenetic data. Conventionally, however, the selection of a suitable evolutionary model is based on heuristics or relies on the choice of an approximate input tree. We introduce a method for model Selection in Phylogenetics based on linear INvariants (SPIn), which uses recent insights on linear invariants to characterize a model of nucleotide evolution for phylogenetic mixtures on any number of components. Linear invariants are constraints among the joint probabilities of the bases in the operational taxonomic units that hold irrespective of the tree topologies appearing in the mixtures. SPIn therefore requires no input tree and is designed to deal with nonhomogeneous phylogenetic data consisting of multiple sequence alignments showing different patterns of evolution, for example, concatenated genes, exons, and/or introns. Here, we report on the results of the proposed method evaluated on multiple sequence alignments simulated under a variety of single-tree and mixture settings for both continuous- and discrete-time models. In the simulations, SPIn successfully recovers the underlying evolutionary model and is shown to perform better than existing approaches.

Assuntos

Evolução Molecular , Modelos Genéticos , Filogenia , Sequência de Bases , Simulação por Computador , Cadeias de Markov , Alinhamento de Sequência

2.

Selenoprofiles: profile-based scanning of eukaryotic genome sequences for selenoprotein genes.

Mariotti, M; Guigó, R.

Bioinformatics ; 26(21): 2656-63, 2010 Nov 01.

Artigo em Inglês | MEDLINE | ID: mdl-20861026

RESUMO

MOTIVATION: Selenoproteins are a group of proteins that contain selenocysteine (Sec), a rare amino acid inserted co-translationally into the protein chain. The Sec codon is UGA, which is normally a stop codon. In selenoproteins, UGA is recoded to Sec in presence of specific features on selenoprotein gene transcripts. Due to the dual role of the UGA codon, selenoprotein prediction and annotation are difficult tasks, and even known selenoproteins are often misannotated in genome databases. RESULTS: We present an homology-based in silico method to scan genomes for members of the known eukaryotic selenoprotein families: selenoprofiles. The core of the method is a set of manually curated highly reliable multiple sequence alignments of selenoprotein families, which are used as queries to scan genomic sequences. Results of the scan are processed through a number of steps, to produce highly accurate predictions of selenoprotein genes with little or no human intervention. Selenoprofiles is a valuable tool for bioinformatic characterization of eukaryotic selenoproteomes, and can complement genome annotation pipelines. AVAILABILITY AND IMPLEMENTATION: Selenoprofiles is a python-built pipeline that internally runs psitblastn, exonerate, genewise, SECISearch and a number of custom-made scripts and programs. The program is available at http://big.crg.cat/services/selenoprofiles. The predictions presented in this article are available through DAS at http://genome.crg.cat:9000/das/Selenoprofiles_ensembl.

Assuntos

Genoma , Selenoproteínas/genética , Códon de Terminação , Bases de Dados Genéticas , Sequências Reguladoras de Ácido Nucleico , Selenocisteína/química , Selenoproteínas/química , Alinhamento de Sequência

3.

Improving data and knowledge management to better integrate health care and research.

Cases, M; Furlong, L I; Albanell, J; Altman, R B; Bellazzi, R; Boyer, S; Brand, A; Brookes, A J; Brunak, S; Clark, T W; Gea, J; Ghazal, P; Graf, N; Guigó, R; Klein, T E; López-Bigas, N; Maojo, V; Mons, B; Musen, M; Oliveira, J L; Rowe, A; Ruch, P; Shabo, A; Shortliffe, E H; Valencia, A; van der Lei, J; Mayer, M A; Sanz, F.

J Intern Med ; 274(4): 321-8, 2013 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-23808970

Assuntos

Pesquisa Biomédica/métodos , Bases de Dados Factuais/normas , Atenção à Saúde/organização & administração , Gestão do Conhecimento/normas , Pesquisa Biomédica/normas , Atenção à Saúde/normas , Humanos

4.

Distinctive sequence features in protein coding genic non-coding, and intergenic human DNA.

Guigó, R; Fickett, J W.

J Mol Biol ; 253(1): 51-60, 1995 Oct 13.

Artigo em Inglês | MEDLINE | ID: mdl-7473716

RESUMO

We have studied the behavior of a number of sequence statistics, mostly indicative of protein coding function, in a large set of human clone sequences randomly selected in the course of genome mapping (randomly selected clone sequences), and compared this with the behavior in known sequences containing genes (which we term genic sequences). As expected, given the higher coding density of the genic sequences, the sequence statistics studied behave in a substantially different manner in the randomly selected clone sequences (mostly intergenic DNA) and in the genic sequences. Strong differences in behavior of a number of such statistics are also observed, however when the randomly selected clone sequences are compared with only the non-coding fraction of the genic sequences, suggesting that intergenic and genic non-coding DNA constitute two different classes of non-coding DNA. By studying the behavior of the sequence statistics in simulated DNA of different C+G content, we have observed that a number of them are strongly dependent on C+G content. Thus, most differences between intergenic and genic non-coding DNA can be explained by differences in C+G content. A+T-rich intergenic DNA appears to be at the compositional equilibrium expected under random mutation, while C+G richer non-coding genic DNA is far from this equilibrium. The results obtained in simulated DNA indicate, on the other hand, that a very large fraction of the variation in the coding statistics that underlie gene identification algorithms is due simply to C+G content, and is not directly related to protein coding function. It appears, thus, that the performance of gene-finding algorithms should be improved by carefully distinguishing the effects of protein coding function from those of mere base compositional variation on such coding statistics.

Assuntos

Sequência de Bases/genética , DNA/genética , Genes/genética , Algoritmos , Composição de Bases , Bases de Dados Factuais , Análise Discriminante , Humanos , Fases de Leitura Aberta/genética , Proteínas/genética

5.

Prediction of gene structure.

Guigó, R; Knudsen, S; Drake, N; Smith, T.

J Mol Biol ; 226(1): 141-57, 1992 Jul 05.

Artigo em Inglês | MEDLINE | ID: mdl-1619647

RESUMO

We have developed a hierarchical rule base system for identifying genes in DNA sequences. Atomic sites (such as initiation codons, stop codons, acceptor sites and donor sites) are identified by a number of different methods and evaluated by a set of filters and rules chosen to maximize sensitivity; these are combined into higher-order gene elements (such as exons), evaluated, filtered and combined as equivalence classes into probable genes, which are evaluated and ranked. The system has been tested on an extensive collection of vertebrate genes smaller than 15,000 bases. Results obtained show that, on average, 88% of the predicted coding region for a transcription unit is actually coding, and 80% of the actual coding is correctly predicted. This will, in most applications, be sufficient for a search against protein sequence databases for the identification of probable gene function. In addition, the system provides a general test platform for both gene atomic site identification and the rules for their evaluation and assembly.

Assuntos

Algoritmos , Inteligência Artificial , Genes/genética , Splicing de RNA/genética , Estudos de Avaliação como Assunto , Éxons/genética , Íntrons/genética , Probabilidade , Software

6.

Assembling genes from predicted exons in linear time with dynamic programming.

Guigó, R.

J Comput Biol ; 5(4): 681-702, 1998.

Artigo em Inglês | MEDLINE | ID: mdl-10072084

RESUMO

In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.

Assuntos

Algoritmos , Éxons , Genes , Modelos Genéticos , Modelos Lineares , Software

7.

Computational gene identification.

Guigó, R.

J Mol Med (Berl) ; 75(6): 389-93, 1997 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-9231878

Assuntos

Genoma Humano , Modelos Estatísticos , Software , Clonagem Molecular , Humanos , Análise de Sequência de DNA

8.

Long noncoding RNAs as enhancers of gene expression.

Ørom, U A; Derrien, T; Guigo, R; Shiekhattar, R.

Cold Spring Harb Symp Quant Biol ; 75: 325-31, 2010.

Artigo em Inglês | MEDLINE | ID: mdl-21502407

RESUMO

The human genome contains thousands of long noncoding RNAs (ncRNAs) transcribed from diverse genomic locations. A large set of long ncRNAs is transcribed independent of protein-coding genes. We have used the GENCODE annotation of the human genome to identify 3019 long ncRNAs expressed in various human cell lines and tissue. This set of long ncRNAs responds to differentiation signals in primary human keratinocytes and is coexpressed with important regulators of keratinocyte development. Depletion of a number of these long ncRNAs leads to the repression of specific genes in their surrounding locus, supportive of an activating function for ncRNAs. Using reporter assays, we confirmed such activating function and show that such transcriptional enhancement is mediated through the long ncRNA transcripts. Our studies show that long ncRNAs exhibit functions similar to classically defined enhancers, through an RNA-dependent mechanism.

Assuntos

Elementos Facilitadores Genéticos/genética , Regulação da Expressão Gênica , RNA não Traduzido/genética , Diferenciação Celular/genética , Sequência Conservada/genética , Genoma Humano/genética , Humanos , Queratinócitos/metabolismo , Anotação de Sequência Molecular , Fases de Leitura Aberta/genética , Fatores de Transcrição da Família Snail , Software , Fatores de Transcrição/metabolismo

9.

Computational gene identification: an open problem.

Guigó, R.

Comput Chem ; 21(4): 215-22, 1997.

Artigo em Inglês | MEDLINE | ID: mdl-9415986

RESUMO

As the Human Genome Project enters the large-scale sequencing phase, computational gene identification methods are becoming essential for the automatic analysis and annotation of large uncharacterized genomic sequences. Currently available computer programs relying mainly on sequence coding statistics are of great use in pin-pointing regions in genomic sequences containing exons. Such programs perform rather poorly, however, when the problem is to fully elucidate gene structure. For this problem, the DNA sequence signals involved in the specification of the genes--start sites and splice sites--carry a lot of information, and simple methods relying on such information can predict gene structure with an accuracy to some extent comparable to that of other more sophisticated computational methods.

Assuntos

Genes , Técnicas Genéticas , Projeto Genoma Humano , Sequência de Bases , DNA/química , DNA/genética , Éxons , Humanos , Splicing de RNA , Software

10.

Evaluation of gene structure prediction programs.

Burset, M; Guigó, R.

Genomics ; 34(3): 353-67, 1996 Jun 15.

Artigo em Inglês | MEDLINE | ID: mdl-8786136

RESUMO

We evaluate a number of computer programs designed to predict the structure of protein coding genes in genomic DNA sequences. Computational gene identification is set to play an increasingly important role in the development of the genome projects, as emphasis turns from mapping to large-scale sequencing. The evaluation presented here serves both to assess the current status of the problem and to identify the most promising approaches to ensure further progress. The programs analyzed were uniformly tested on a large set of vertebrate sequences with simple gene structure, and several measures of predictive accuracy were computed at the nucleotide, exon, and protein product levels. The results indicated that the predictive accuracy of the programs analyzed was lower than originally found. The accuracy was even lower when considering only those sequences that had recently been entered and that did not show any similarity to previously entered sequences. This indicates that the programs are overly dependent on the particularities of the examples they learn from. For most of the programs, accuracy in this test set ranged from 0.60 to 0.70 as measured by the Correlation Coefficient (where 1.0 corresponds to a perfect prediction and 0.0 is the value expected for a random prediction), and the average percentage of exons exactly identified was less than 50%. Only those programs including protein sequence database searches showed substantially greater accuracy. The accuracy of the programs was severely affected by relatively high rates of sequence errors. Since the set on which the programs were tested included only relatively short sequences with simple gene structure, the accuracy of the programs is likely to be even lower when used for large uncharacterized genomic sequences with complex structure. While in such cases, programs currently available may still be of great use in pinpointing the regions likely to contain exons, they are far from being powerful enough to elucidate its genomic structure completely.

Assuntos

DNA/química , Genes , Modelos Genéticos , Proteínas/genética , Software , Processamento Alternativo , Animais , DNA/genética , Éxons , Humanos , Sistemas de Informação , Matemática , Probabilidade , Biossíntese de Proteínas , Proteínas/química , Pseudogenes , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , Vertebrados

11.

GeneID in Drosophila.

Parra, G; Blanco, E; Guigó, R.

Genome Res ; 10(4): 511-5, 2000 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-10779490

RESUMO

GeneID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. In this paper we describe the obtention of PWMs for sites, and the Markov model of coding DNA in Drosophila melanogaster. We also compare other models of coding DNA with the Markov model. Finally, we present and discuss the results obtained when GeneID is used to predict genes in the Adh region. These results show that the accuracy of GeneID predictions compares currently with that of other existing tools but that GeneID is likely to be more efficient in terms of speed and memory usage.

Assuntos

Bases de Dados Factuais , Drosophila melanogaster/genética , Genes de Insetos/genética , Software , Álcool Desidrogenase/genética , Algoritmos , Animais , Biologia Computacional , Drosophila melanogaster/enzimologia

12.

gff2ps: visualizing genomic annotations.

Abril, J F; Guigó, R.

Bioinformatics ; 16(8): 743-4, 2000 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-11099262

RESUMO

gff2psis a program for visualizing annotations of genomic sequences. The program takes the annotated features on a genomic sequence in GFF format as input, and produces a visual output in PostScript. While it can be used in a very simple way, it also allows for a great degree of customization through a number of options and/or customization files.

Assuntos

Análise de Sequência de DNA/métodos , Análise de Sequência de RNA/métodos , Software , Biologia Computacional , Genoma

13.

Genome sequence comparisons: hurdles in the fast lane to functional genomics.

Wiehe, T; Guigó, R; Miller, W.

Brief Bioinform ; 1(4): 381-8, 2000 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-11465055

RESUMO

An important computational technique for extracting the wealth of information hidden in human genomic sequence data is to compare the sequence with that from the corresponding region of the mouse genome, looking for segments that are conserved over evolutionary time. Moreover, the approach generalises to comparison of sequences from any two related species. The underlying rationale (which is abundantly confirmed by observation) is that a random mutation in a functional region is usually deleterious to the organism, and hence unlikely to become fixed in the population, whereas mutations in a non-functional region are free to accumulate over time. The potential value of this approach is so attractive that the public and private projects to sequence the human genome are now turning to sequencing the mouse, and you will soon be able to compare the human and mouse sequences of your favourite genomic region. We are currently witnessing an explosion of computer tools for comparative analysis of two genomic sequences. Here the capabilities of two new network servers for comparing genomic sequences from any pair of closely related species are sketched. The Syntenic Gene Prediction Program SGP-I utilises sequence comparisons to enhance the ability to locate protein coding segments in genomic data. PipMaker attempts to determine all conserved genomic regions, regardless of their function.

Assuntos

Biologia Computacional , Genoma , Genômica/estatística & dados numéricos , Alinhamento de Sequência/estatística & dados numéricos , Animais , Sequência Conservada , Evolução Molecular , Genoma Humano , Humanos , Interleucina-13/genética , Interleucina-4/genética , Camundongos , Software

14.

Estimation of protein coding density in a corpus of DNA sequence data.

Fickett, J W; Guigó, R.

Nucleic Acids Res ; 21(12): 2837-44, 1993 Jun 25.

Artigo em Inglês | MEDLINE | ID: mdl-8332493

RESUMO

A number of experimental methods have been reported for estimating the number of genes in a genome, or the closely related coding density of a genome, defined as the fraction of base pairs in codons. Recently, DNA sequence data representative of the genome as a whole have become available for several organisms, making the problem of estimating coding density amenable to sequence analytic methods. Estimates of coding density for a single genome vary widely, so that methods with characterized error bounds have become increasingly desirable. We present a method to estimate the protein coding density in a corpus of DNA sequence data, in which a 'coding statistic' is calculated for a large number of windows of the sequence under study, and the distribution of the statistic is decomposed into two normal distributions, assumed to be the distributions of the coding statistic in the coding and noncoding fractions of the sequence windows. The accuracy of the method is evaluated using known data and application is made to the yeast chromosome III sequence and to C. elegans cosmid sequences. It can also be applied to fragmentary data, for example a collection of short sequences determined in the course of STS mapping.

Assuntos

Composição de Bases , Códon , DNA/química , Proteínas/genética , Animais , Caenorhabditis elegans/genética , Cosmídeos , DNA/análise , Genes Fúngicos , Humanos , Análise de Sequência de DNA , Estatística como Assunto

15.

Automatic evaluation of protein sequence functional patterns.

Guigó, R; Johansson, A; Smith, T F.

Comput Appl Biosci ; 7(3): 309-15, 1991 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-1913211

RESUMO

A procedure that automatically provides an evaluation of the diagnostic ability of a protein sequence functional pattern is described. The procedure relies on the identification of the closest definable set in terms of a (protein sequence) database functional annotation to the set of database instances containing a given pattern. Assuming annotation correctness and completeness in the protein sequence database, the degree of statistical association between these sets provides an appropriate measure of the diagnostic ability of the pattern. An experimental implementation of the procedure, using the NBRF/PIR protein database, has been applied to a diverse collection of published sequence patterns. Results obtained reveal that frequently it is not possible to define (in NBRF/PIR database terminology) the set of database instances containing a given pattern, suggesting either lack of pattern diagnostic ability or protein database annotation incompleteness and/or inconsistencies.

Assuntos

Reconhecimento Automatizado de Padrão , Proteínas/química , Algoritmos , Bases de Dados Factuais , Processamento Eletrônico de Dados , Biblioteca Gênica , Métodos

16.

Reconstruction of ancient molecular phylogeny.

Guigó, R; Muchnik, I; Smith, T F.

Mol Phylogenet Evol ; 6(2): 189-213, 1996 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-8899723

RESUMO

Support for contradictory phylogenies is often obtained when molecular sequence data from different genes is used to reconstruct phylogenetic histories. Contradictory phylogenies can result from many data anomalies including unrecognized paralogy. Paralogy, defined as the reconstruction of a phylogenetic tree from a mixture of genes generated by duplications, has generally not been formally included in phylogenetic reconstructions. Here we undertake the task of reconstructing a single most likely evolutionary relationship among a range of taxa from a large set of apparently inconsistent gene trees. Under the assumption that differences among gene trees can be explained by gene duplications, and consequent losses, we have developed a method to obtain the global phylogeny minimizing the total number of postulated duplications and losses and to trace back such individual gene duplications to global genome duplications. We have used this method to infer the most likely phylogenetic relationship among 16 major higher eukaryotic taxa from the sequences of 53 different genes. Only five independent genome duplication events need to be postulated in order to explain the inconsistencies among these trees.

Assuntos

Filogenia , Algoritmos , Animais , Evolução Biológica , Genes , Modelos Biológicos , Família Multigênica , Especificidade da Espécie

17.

Three-dimensional modelling of human cytochrome P450 1A2 and its interaction with caffeine and MeIQ.

Lozano, J J; López-de-Briñas, E; Centeno, N B; Guigó, R; Sanz, F.

J Comput Aided Mol Des ; 11(4): 395-408, 1997 Jul.

Artigo em Inglês | MEDLINE | ID: mdl-9334905

RESUMO

The three-dimensional modelling of proteins is a useful tool to fill the gap between the number of sequenced proteins and the number of experimentally known 3D structures. However, when the degree of homology between the protein and the available 3D templates is low, model building becomes a difficult task and the reliability of the results depends critically on the correctness of the sequence alignment. For this reason, we have undertaken the modelling of human cytochrome P450 1A2 starting by a careful analysis of several sequence alignment strategies (multiple sequence alignments and the TOPITS threading technique). The best results were obtained using TOPITS followed by a manual refinement to avoid unlikely gaps. Because TOPITS uses secondary structure predictions, several methods that are available for this purpose (Levin, Gibrat, DPM, NnPredict, PHD, SOPM and NNSP) have also been evaluated on cytochromes P450 with known 3D structures. More reliable predictions on alpha-helices have been obtained with PHD, which is the method implemented in TOPITS. Thus, a 3D model for human cytochrome P450 1A2 has been built using the known crystal coordinates of P450 BM3 as the template. The model was refined using molecular mechanics computations. The model obtained shows a consistent location of the substrate recognition segments previously postulated for the CYP2 family members. The interaction of caffeine and a carcinogenic aromatic amine (MeIQ), which are characteristic P450 1A2 substrates, has been investigated. The substrates were solvated taking into account their molecular electrostatic potential distributions. The docking of the solvated substrates in the active site of the model was explored with the AUTODOCK programme, followed by molecular mechanics optimisation of the most interesting complexes. Stable complexes were obtained that could explain the oxidation of the considered substrates by cytochrome P450 1A2 and could offer an insight into the role played by water molecules.

Assuntos

Proteínas de Bactérias , Cafeína/metabolismo , Simulação por Computador , Citocromo P-450 CYP1A2/química , Citocromo P-450 CYP1A2/metabolismo , Sistema Enzimático do Citocromo P-450/química , Oxigenases de Função Mista/química , Modelos Moleculares , Conformação Proteica , Quinolinas/metabolismo , Sequência de Aminoácidos , Sítios de Ligação , Cafeína/química , Sequência Conservada , Humanos , Dados de Sequência Molecular , NADPH-Ferri-Hemoproteína Redutase , Estrutura Secundária de Proteína , Quinolinas/química , Alinhamento de Sequência , Homologia de Sequência de Aminoácidos , Software

18.

An assessment of gene prediction accuracy in large DNA sequences.

Guigó, R; Agarwal, P; Abril, J F; Burset, M; Fickett, J W.

Genome Res ; 10(10): 1631-42, 2000 Oct.

Artigo em Inglês | MEDLINE | ID: mdl-11042160

RESUMO

One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the approximately 200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.

Assuntos

Biologia Computacional/métodos , DNA/química , DNA/genética , Genes/genética , Composição de Bases , Cromossomos Artificiais/química , Cromossomos Artificiais/genética , Humanos , Reprodutibilidade dos Testes , Software

19.

Exon structure conservation despite low sequence similarity: a relic of dramatic events in evolution?

Betts, M J; Guigó, R; Agarwal, P; Russell, R B.

EMBO J ; 20(19): 5354-60, 2001 Oct 01.

Artigo em Inglês | MEDLINE | ID: mdl-11574467

RESUMO

The evolutionary significance of introns remains a mystery. The current availability of several complete eukaryotic genomes permits new studies to probe the possible function of these peculiar genomic features. Here we investigate the degree to which gene structure (intron position, phase and length) is conserved between homologous protein domains. We find that for certain extracellular-signalling and nuclear domains, gene structures are similar even when protein sequence similarity is low or not significant and sequences can only be aligned with a knowledge of protein tertiary structure. In contrast, other domains, including most intracellular signalling modules, show little gene structure conservation. Intriguingly, many domains with conserved gene structures, such as cytokines, are involved in similar biological processes, such as the immune response. This suggests that gene structure conservation may be a record of key events in evolution, such as the origin of the vertebrate immune system or the duplication of nuclear receptors in nematodes. The results suggest ways to detect new and potentially very remote homologues, and to construct phylogenies for proteins with limited sequence similarity.

Assuntos

Evolução Molecular , Éxons/genética , Íntrons/genética , Sequência de Aminoácidos , Citocinas/genética , Proteínas Ativadoras de GTPase/genética , Genes , Modelos Moleculares , Estrutura Secundária de Proteína , Estrutura Terciária de Proteína

20.

SGP-1: prediction and validation of homologous genes based on sequence alignments.

Wiehe, T; Gebauer-Jung, S; Mitchell-Olds, T; Guigó, R.

Genome Res ; 11(9): 1574-83, 2001 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-11544202

RESUMO

Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of depends little on species-specific properties such as codon usage or the nucleotide distribution. may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors.

Assuntos

Algoritmos , Genes/genética , Alinhamento de Sequência/métodos , Homologia de Sequência do Ácido Nucleico , Animais , Brassica/genética , Códon/genética , Bases de Dados Factuais , Evolução Molecular , Humanos , Camundongos , Sítios de Splice de RNA/genética , Ratos

RESUMO

Assuntos

RESUMO

Assuntos

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

Detalhe da pesquisa