Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 25
Filtrar
Más filtros










Base de datos
Intervalo de año de publicación
1.
Sci Rep ; 13(1): 17203, 2023 10 11.
Artículo en Inglés | MEDLINE | ID: mdl-37821494

RESUMEN

Invasive plant pathogenic fungi have a global impact, with devastating economic and environmental effects on crops and forests. Biosurveillance, a critical component of threat mitigation, requires risk prediction based on fungal lifestyles and traits. Recent studies have revealed distinct genomic patterns associated with specific groups of plant pathogenic fungi. We sought to establish whether these phytopathogenic genomic patterns hold across diverse taxonomic and ecological groups from the Ascomycota and Basidiomycota, and furthermore, if those patterns can be used in a predictive capacity for biosurveillance. Using a supervised machine learning approach that integrates phylogenetic and genomic data, we analyzed 387 fungal genomes to test a proof-of-concept for the use of genomic signatures in predicting fungal phytopathogenic lifestyles and traits during biosurveillance activities. Our machine learning feature sets were derived from genome annotation data of carbohydrate-active enzymes (CAZymes), peptidases, secondary metabolite clusters (SMCs), transporters, and transcription factors. We found that machine learning could successfully predict fungal lifestyles and traits across taxonomic groups, with the best predictive performance coming from feature sets comprising CAZyme, peptidase, and SMC data. While phylogeny was an important component in most predictions, the inclusion of genomic data improved prediction performance for every lifestyle and trait tested. Plant pathogenicity was one of the best-predicted traits, showing the promise of predictive genomics for biosurveillance applications. Furthermore, our machine learning approach revealed expansions in the number of genes from specific CAZyme and peptidase families in the genomes of plant pathogens compared to non-phytopathogenic genomes (saprotrophs, endo- and ectomycorrhizal fungi). Such genomic feature profiles give insight into the evolution of fungal phytopathogenicity and could be useful to predict the risks of unknown fungi in future biosurveillance activities.


Asunto(s)
Ascomicetos , Genoma Fúngico , Humanos , Filogenia , Genoma Fúngico/genética , Ascomicetos/genética , Genómica , Péptido Hidrolasas/genética , Estilo de Vida , Aprendizaje Automático
2.
Stud Mycol ; 96: 141-153, 2020 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-32206138

RESUMEN

Dothideomycetes is the largest class of kingdom Fungi and comprises an incredible diversity of lifestyles, many of which have evolved multiple times. Plant pathogens represent a major ecological niche of the class Dothideomycetes and they are known to infect most major food crops and feedstocks for biomass and biofuel production. Studying the ecology and evolution of Dothideomycetes has significant implications for our fundamental understanding of fungal evolution, their adaptation to stress and host specificity, and practical implications with regard to the effects of climate change and on the food, feed, and livestock elements of the agro-economy. In this study, we present the first large-scale, whole-genome comparison of 101 Dothideomycetes introducing 55 newly sequenced species. The availability of whole-genome data produced a high-confidence phylogeny leading to reclassification of 25 organisms, provided a clearer picture of the relationships among the various families, and indicated that pathogenicity evolved multiple times within this class. We also identified gene family expansions and contractions across the Dothideomycetes phylogeny linked to ecological niches providing insights into genome evolution and adaptation across this group. Using machine-learning methods we classified fungi into lifestyle classes with >95 % accuracy and identified a small number of gene families that positively correlated with these distinctions. This can become a valuable tool for genome-based prediction of species lifestyle, especially for rarely seen and poorly studied species.

3.
Stud Mycol ; 91: 61-78, 2018 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-30425417

RESUMEN

The fungal kingdom is too large to be discovered exclusively by classical genetics. The access to omics data opens a new opportunity to study the diversity within the fungal kingdom and how adaptation to new environments shapes fungal metabolism. Genomes are the foundation of modern science but their quality is crucial when analysing omics data. In this study, we demonstrate how one gold-standard genome can improve functional prediction across closely related species to be able to identify key enzymes, reactions and pathways with the focus on primary carbon metabolism. Based on this approach we identified alternative genes encoding various steps of the different sugar catabolic pathways, and as such provided leads for functional studies into this topic. We also revealed significant diversity with respect to genome content, although this did not always correlate to the ability of the species to use the corresponding sugar as a carbon source.

5.
Nature ; 452(7183): 88-92, 2008 Mar 06.
Artículo en Inglés | MEDLINE | ID: mdl-18322534

RESUMEN

Mycorrhizal symbioses--the union of roots and soil fungi--are universal in terrestrial ecosystems and may have been fundamental to land colonization by plants. Boreal, temperate and montane forests all depend on ectomycorrhizae. Identification of the primary factors that regulate symbiotic development and metabolic activity will therefore open the door to understanding the role of ectomycorrhizae in plant development and physiology, allowing the full ecological significance of this symbiosis to be explored. Here we report the genome sequence of the ectomycorrhizal basidiomycete Laccaria bicolor (Fig. 1) and highlight gene sets involved in rhizosphere colonization and symbiosis. This 65-megabase genome assembly contains approximately 20,000 predicted protein-encoding genes and a very large number of transposons and repeated sequences. We detected unexpected genomic features, most notably a battery of effector-type small secreted proteins (SSPs) with unknown function, several of which are only expressed in symbiotic tissues. The most highly expressed SSP accumulates in the proliferating hyphae colonizing the host root. The ectomycorrhizae-specific SSPs probably have a decisive role in the establishment of the symbiosis. The unexpected observation that the genome of L. bicolor lacks carbohydrate-active enzymes involved in degradation of plant cell walls, but maintains the ability to degrade non-plant cell wall polysaccharides, reveals the dual saprotrophic and biotrophic lifestyle of the mycorrhizal fungus that enables it to grow within both soil and living plant roots. The predicted gene inventory of the L. bicolor genome, therefore, points to previously unknown mechanisms of symbiosis operating in biotrophic mycorrhizal fungi. The availability of this genome provides an unparalleled opportunity to develop a deeper understanding of the processes by which symbionts interact with plants within their ecosystem to perform vital functions in the carbon and nitrogen cycles that are fundamental to sustainable plant productivity.


Asunto(s)
Basidiomycota/genética , Basidiomycota/fisiología , Genoma Fúngico/genética , Micorrizas/genética , Micorrizas/fisiología , Raíces de Plantas/microbiología , Simbiosis/fisiología , Abies/microbiología , Abies/fisiología , Basidiomycota/enzimología , Proteínas Fúngicas/clasificación , Proteínas Fúngicas/genética , Proteínas Fúngicas/metabolismo , Regulación de la Expresión Génica , Genes Fúngicos/genética , Hifa/genética , Hifa/metabolismo , Micorrizas/enzimología , Raíces de Plantas/fisiología , Simbiosis/genética
6.
Science ; 313(5793): 1596-604, 2006 Sep 15.
Artículo en Inglés | MEDLINE | ID: mdl-16973872

RESUMEN

We report the draft genome of the black cottonwood tree, Populus trichocarpa. Integration of shotgun sequence assembly with genetic mapping enabled chromosome-scale reconstruction of the genome. More than 45,000 putative protein-coding genes were identified. Analysis of the assembled genome revealed a whole-genome duplication event; about 8000 pairs of duplicated genes from that event survived in the Populus genome. A second, older duplication event is indistinguishably coincident with the divergence of the Populus and Arabidopsis lineages. Nucleotide substitution, tandem gene duplication, and gross chromosomal rearrangement appear to proceed substantially more slowly in Populus than in Arabidopsis. Populus has more protein-coding genes than Arabidopsis, ranging on average from 1.4 to 1.6 putative Populus homologs for each Arabidopsis gene. However, the relative frequency of protein domains in the two genomes is similar. Overrepresented exceptions in Populus include genes associated with lignocellulosic wall biosynthesis, meristem development, disease resistance, and metabolite transport.


Asunto(s)
Duplicación de Gen , Genoma de Planta , Populus/genética , Análisis de Secuencia de ADN , Arabidopsis/genética , Mapeo Cromosómico , Biología Computacional , Evolución Molecular , Etiquetas de Secuencia Expresada , Expresión Génica , Genes de Plantas , Análisis de Secuencia por Matrices de Oligonucleótidos , Filogenia , Proteínas de Plantas/química , Proteínas de Plantas/genética , Polimorfismo de Nucleótido Simple , Populus/crecimiento & desarrollo , Populus/metabolismo , Estructura Terciaria de Proteína , ARN de Planta/análisis , ARN no Traducido/análisis
7.
Biochem Soc Trans ; 28(2): 269-75, 2000 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-10816141

RESUMEN

The CATH database of protein structures contains approximately 18000 domains organized according to their (C)lass, (A)rchitecture, (T)opology and (H)omologous superfamily. Relationships between evolutionary related structures (homologues) within the database have been used to test the sensitivity of various sequence search methods in order to identify relatives in Genbank and other sequence databases. Subsequent application of the most sensitive and efficient algorithms, gapped blast and the profile based method, Position Specific Iterated Basic Local Alignment Tool (PSI-BLAST), could be used to assign structural data to between 22 and 36 % of microbial genomes in order to improve functional annotation and enhance understanding of biological mechanism. However, on a cautionary note, an analysis of functional conservation within fold groups and homologous superfamilies in the CATH database, revealed that whilst function was conserved in nearly 55% of enzyme families, function had diverged considerably, in some highly populated families. In these families, functional properties should be inherited far more cautiously and the probable effects of substitutions in key functional residues carefully assessed.


Asunto(s)
Bases de Datos Factuales , Genoma , Algoritmos , Conformación Proteica , Estructura Terciaria de Proteína , Relación Estructura-Actividad
8.
Genome Res ; 10(4): 516-22, 2000 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-10779491

RESUMEN

Ab initio gene identification in the genomic sequence of Drosophila melanogaster was obtained using (human gene predictor) and Fgenesh programs that have organism-specific parameters for human, Drosophila, plants, yeast, and nematode. We did not use information about cDNA/EST in most predictions to model a real situation for finding new genes because information about complete cDNA is often absent or based on very small partial fragments. We investigated the accuracy of gene prediction on different levels and designed several schemes to predict an unambiguous set of genes (annotation CGG1), a set of reliable exons (annotation CGG2), and the most complete set of exons (annotation CGG3). For 49 genes, protein products of which have clear homologs in protein databases, predictions were recomputed by Fgenesh+ program. The first annotation serves as the optimal computational description of new sequence to be presented in a database. Reliable exons from the second annotation serve as good candidates for selecting the PCR primers for experimental work for gene structure verification. Our results shows that we can identify approximately 90% of coding nucleotides with 20% false positives. At the exon level we accurately predicted 65% of exons and 89% including overlapping exons with 49% false positives. Optimizing accuracy of prediction, we designed a gene identification scheme using Fgenesh, which provided sensitivity (Sn) = 98% and specificity (Sp) = 86% at the base level, Sn = 81% (97% including overlapping exons) and Sp = 58% at the exon level and Sn = 72% and Sp = 39% at the gene level (estimating sensitivity on std1 set and specificity on std3 set). In general, these results showed that computational gene prediction can be a reliable tool for annotating new genomic sequences, giving accurate information on 90% of coding sequences with 14% false positives. However, exact gene prediction (especially at the gene level) needs additional improvement using gene prediction algorithms. The program was also tested for predicting genes of human Chromosome 22 (the last variant of Fgenesh can analyze the whole chromosome sequence). This analysis has demonstrated that the 88% of manually annotated exons in Chromosome 22 were among the ab initio predicted exons. The suite of gene identification programs is available through the WWW server of Computational Genomics Group at http://genomic.sanger.ac.uk/gf. html.


Asunto(s)
Algoritmos , ADN/genética , Drosophila melanogaster/genética , Genes de Insecto/genética , Programas Informáticos , Animales , Biología Computacional/métodos , Bases de Datos Factuales , Humanos
9.
Protein Sci ; 8(4): 771-7, 1999 Apr.
Artículo en Inglés | MEDLINE | ID: mdl-10211823

RESUMEN

We describe the results of a procedure for maximizing the number of sequences that can be reliably linked to a protein of known three-dimensional structure. Unlike other methods, which try to increase sensitivity through the use of fold recognition software, we only use conventional sequence alignment tools, but apply them in a manner that significantly increases the number of relationships detected. We analyzed 11 genomes and found that, depending on the genome, between 23 and 32% of the ORFs had significant matches to proteins of known structure. In all cases, the aligned region consisted of either >100 residues or >50% of the smaller sequence. Slightly higher percentages could be attained if smaller motifs were also included. This is significantly higher than most previously reported methods, even those that have a fold-recognition component. We survey the biochemical and structural characteristics of the most frequently occurring proteins, and discuss the extent to which alignment methods can realistically assign function to gene products.


Asunto(s)
Conformación Proteica , Análisis de Secuencia de ADN/métodos , Algoritmos , Simulación por Computador , Bases de Datos Factuales , Estructura Terciaria de Proteína , Sensibilidad y Especificidad , Alineación de Secuencia/métodos
10.
Protein Eng ; 12(2): 95-100, 1999 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-10195280

RESUMEN

Using data from the CATH structure classification, we have assessed the blastp, fasta, smith-waterman and gapped-blast algorithms, developed a portable normalization scheme and identified safe thresholds for database searching. Of the four methods assessed, fasta, smith-waterman and gapped-blast perform similarly, whereas the sensitivity of blastp was much lower. Introduction of an intermediate sequence search substantially improved the results. When tested on a set of relationships that could not be identified by blastp, intermediate sequences were able to find double the number of relationships identified by the smith-waterman algorithm alone. However, we found that the benefit of using intermediates varied considerably between each family and depended not only on the number of available sequences, but also their diversity. In an attempt to increase sensitivity further, a multiple intermediate sequence search (MISS) procedure was developed. When assessed on 1906 cases from a wide range of homologous families that could not be detected by the previous approaches, MISS was able to identify 241 additional relationships. MISS uses the full extent of sequence diversity to detect additional relationships, but does not consider any structure-specific information. For this reason, it is more generally applicable than fold recognition and threading methods, which require a library of known structures.


Asunto(s)
Bases de Datos Factuales , Alineación de Secuencia/métodos , Homología de Secuencia de Aminoácido , Simulación por Computador , Modelos Estadísticos , Conformación Proteica , Sensibilidad y Especificidad
11.
Nucleic Acids Res ; 27(1): 248-50, 1999 Jan 01.
Artículo en Inglés | MEDLINE | ID: mdl-9847192

RESUMEN

INFOGENE is a database of known and predicted gene structures with descriptions of basic functional signals and gene components. It provides a possibility to create compilations of sequences with a given gene feature as well as to accumulate and analyze predicted genes in finished and unfinished sequences from genome sequencing projects. Protein sequence similarity searches in the database of predicted proteins is offered through the BLASTP program. INFOGENE is realized under the Sequence Retrieval System that provides useful links with the other informational databases. The database is available through the WWW server of the Computational Genomics Group at http://genomic.sanger.ac.uk/db.html


Asunto(s)
Bases de Datos Factuales , Genes , Genoma , Proteínas/genética , Análisis de Secuencia de ADN , Animales , Arabidopsis/genética , Secuencia de Bases , Drosophila/genética , Exones/genética , Etiquetas de Secuencia Expresada , Proyecto Genoma Humano , Humanos , Almacenamiento y Recuperación de la Información , Internet , Ratones , Proteínas/química , Homología de Secuencia de Aminoácido , Programas Informáticos
12.
Bioinformatics ; 14(5): 384-90, 1998 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-9682051

RESUMEN

MOTIVATION: In cDNA sequencing projects, it is vital to know whether the protein coding region of a sequence is complete, or whether errors have occurred during library construction. Here we present a linear discriminant approach that predicts this completeness by estimating the probability of each ATG being the initiation codon. RESULTS: Because of the current shortage of full-length cDNA data on which to base this work, tests were performed on a non-redundant set of 660 initiation codon-containing DNA sequences that had been conceptually spliced into mRNA/cDNA. We also used an edited set of the same sequences that only contained the region following the initiation codon as a negative control. Using the criterion that only a single prediction is allowed for each sequence, a cut-off was selected at which discrimination of both positive and negative sets was equal. At this cut-off, 67% of each set could be correctly distinguished, with the correct ATG codon also being identified in the positive set. Reliability could be increased further by raising the cut-off or including homologues, the relative merits of which are discussed. AVAILABILITY: The prediction program, called ATGpr, and other data are available at http://www.hri.co.jp/atgpr CONTACT: swintech@hri.co.jp


Asunto(s)
ADN Complementario/genética , Proteínas/genética , Análisis de Secuencia de ADN , Secuencia de Bases , Codón Iniciador/genética , Biología Computacional , Bases de Datos Factuales , Humanos , Sistemas de Lectura Abierta , ARN Mensajero/genética
13.
J Mol Biol ; 268(1): 31-6, 1997 Apr 25.
Artículo en Inglés | MEDLINE | ID: mdl-9149139

RESUMEN

The accuracy of secondary structure prediction methods has been improved significantly by the use of aligned protein sequences. The PHD method and the NNSSP method reach 71 to 72% of sustained overall three-state accuracy when multiple sequence alignments are with neural networks and nearest-neighbor algorithms, respectively. We introduce a variant of the nearest-neighbor approach that can achieve similar accuracy using a single sequence as the query input. We compute the 50 best non-intersecting local alignments of the query sequence with each sequence from a set of proteins with known 3D structures. Each position of the query sequence is aligned with the database amino acids in alpha-helical, beta-strand or coil states. The prediction type of secondary structure is selected as the type of aligned position with the maximal total score. On the dataset of 124 non-membrane non-homologous proteins, used earlier as a benchmark for secondary structure predictions, our method reaches an overall three-state accuracy of 71.2%. The performance accuracy is verified by an additional test on 461 non-homologous proteins giving an accuracy of 71.0%. The main strength of the method is the high level of prediction accuracy for proteins without any known homolog. Using multiple sequence alignments as input the method has a prediction accuracy of 73.5%. Prediction of secondary structure by the SSPAL method is available via Baylor College of Medicine World Wide Web server.


Asunto(s)
Algoritmos , Modelos Moleculares , Estructura Secundaria de Proteína , Alineación de Secuencia/métodos , Secuencia de Aminoácidos , Bases de Datos Factuales , Datos de Secuencia Molecular , Proteínas/química
14.
Comput Appl Biosci ; 13(1): 23-8, 1997 Feb.
Artículo en Inglés | MEDLINE | ID: mdl-9088705

RESUMEN

We have developed a computer program POLYAH and an algorithm for the identification of 3'-processing sites of human mRNA precursors. The algorithm is based on a linear discriminant function (LDF) trained to discriminate real poly(A) signal regions from the other regions of human genes possessing the AATAAA sequence which is most likely non-functional. As the parameters of LDF, various significant contextual characteristics of sequences surrounding AATAAA signals were used. An accuracy of method has been estimated on a set of 131 poly(A) regions and 1466 regions of human genes having the AATAAA sequence. When the threshold was set to predict 86% of poly(A) regions correctly, specificity of 51% and correlation coefficient of 0.62 had been achieved. The precision of this approach is better than for the other methods and has been tested on a larger data set. POLYAH can be used through World Wide Web (at Gene-Finder Home page: URL http:@dot.imgen.bcm.tmc.edu:9331/gene-finder/ gf.html) or by sending files with uncharacterized human sequences to the University of Houston or Weizmann Institute of Science e-mail servers.


Asunto(s)
Algoritmos , Precursores del ARN/metabolismo , Procesamiento Postranscripcional del ARN , ARN Mensajero/metabolismo , Programas Informáticos , Secuencia de Bases , Sitios de Unión , Redes de Comunicación de Computadores , Simulación por Computador , Bases de Datos Factuales , Análisis Discriminante , Humanos , Precursores del ARN/química , Precursores del ARN/genética , ARN Mensajero/química , ARN Mensajero/genética
15.
Artículo en Inglés | MEDLINE | ID: mdl-9322052

RESUMEN

We present a complex of new programs for promoter, 3'-processing, splice sites, coding exons and gene structure identification in genomic DNA of several model species. The human gene structure prediction program FGENEH, exon prediction-FEXH and splice site prediction-HSPL have been modified for sequence analysis of Drosophila (FGENED, FEXD and DSPL), C.elegance (FGENEN, FEXN and NSPL), Yeast (FEXY and YSPL) and Plant (FGENEA, FEXA and ASPL) genomic sequences. We recomputed all frequency and discriminant function parameters for these organisms and adjusted organism specific minimal intron lengths. An accuracy of coding region prediction for these programs is similar with the observed accuracy of FEXH and FGENEH. We have developed FEXHB and FGENEHB programs combining pattern recognition features and information about similarity of predicted exons with known sequences in protein databases. These programs have approximately 10% higher average accuracy of coding region recognition. Two new programs for human promoter site prediction (TSSG and TSSW) have been developed which use Gosh (1993) and Wingender (1994) data bases of functional motifs, respectively. POLYAH program was designed for prediction of 3'-processing regions in human genes and CDSB program was developed for bacterial gene prediction. We have developed a new approach to predict multiple genes based on double dynamic programming, that is very important for analysis of long genomic DNA fragments generated by genome sequencing projects. Analysis of uncharacterized sequences based on our methods is available through the University of Houston, Weizmann Institute of Science email servers and several Web pages at Baylor College of Medicine.


Asunto(s)
Genoma Humano , Genoma , Programas Informáticos , Animales , ADN/genética , Bases de Datos Factuales , Exones , Genes Bacterianos , Humanos , Modelos Genéticos , Regiones Promotoras Genéticas , Empalme del ARN , Alineación de Secuencia/métodos , Alineación de Secuencia/estadística & datos numéricos
16.
J Mol Biol ; 247(1): 11-5, 1995 Mar 17.
Artículo en Inglés | MEDLINE | ID: mdl-7897654

RESUMEN

Recently Yi & Lander used a neural network and nearest-neighbor method with a scoring system that combined a sequence-similarity matrix with the local structural environment scoring scheme described by Bowie and co-workers for predicting protein secondary structure. We have improved their scoring system by taking into consideration N and C-terminal positions of alpha-helices and beta-strands and also beta-turns as distinctive types of secondary structure. Another improvement, which also decreases the time of computation, is performed by restricting a data base with a smaller subset of proteins that are similar with a query sequence. Using multiple sequence alignments rather than single sequences and a simple jury decision procedure our method reaches a sustained overall three-state accuracy of 72.2%, which is better than that observed for the most accurate multilayered neural-network approach, tested on the same data set of 126 non-homologous protein chains.


Asunto(s)
Estructura Secundaria de Proteína , Proteínas/química , Algoritmos , Evolución Biológica , Alineación de Secuencia , Homología de Secuencia de Aminoácido , Programas Informáticos
17.
Artículo en Inglés | MEDLINE | ID: mdl-7584460

RESUMEN

Development of advanced technique to identify gene structure is one of the main challenges of the Human Genome Project. Discriminant analysis was applied to the construction of recognition functions for various components of gene structure. Linear discriminant functions for splice sites, 5'-coding, internal exon, and 3'-coding region recognition have been developed. A gene structure prediction system FGENE has been developed based on the exon recognition functions. We compute a graph of mutual compatibility of different exons and present a gene structure models as paths of this directed acyclic graph. For an optimal model selection we apply a variant of dynamic programming algorithm to search for the path in the graph with the maximal value of the corresponding discriminant functions. Prediction by FGENE for 185 complete human gene sequences has 81% exact exon recognition accuracy and 91% accuracy at the level of individual exon nucleotides with the correlation coefficient (C) equals 0.90. Testing FGENE on 35 genes not used in the development of discriminant functions shows 71% accuracy of exact exon prediction and 89% at the nucleotide level (C = 0.86). FGENE compares very favorably with the other programs currently used to predict protein-coding regions. Analysis of uncharacterized human sequences based on our methods for splice site (HSPL, RNASPL), internal exons (HEXON), all type of exons (FEXH) and human (FGENEH) and bacterial (CDSB) gene structure prediction and recognition of human and bacterial sequences (HBR) (to test a library for E. coli contamination) is available through the University of Houston, Weizmann Institute of Science network server and a WWW page of the Human Genome Center at Baylor College of Medicine.


Asunto(s)
Algoritmos , ADN/química , ADN/genética , Exones , Proyecto Genoma Humano , Programas Informáticos , Secuencia de Bases , Análisis Discriminante , Genes Bacterianos , Humanos , Modelos Genéticos , Datos de Secuencia Molecular , Sistemas de Lectura Abierta
18.
Nucleic Acids Res ; 22(24): 5156-63, 1994 Dec 11.
Artículo en Inglés | MEDLINE | ID: mdl-7816600

RESUMEN

A new method which predicts internal exon sequences in human DNA has been developed. The method is based on a splice site prediction algorithm that uses the linear discriminant function to combine information about significant triplet frequencies of various functional parts of splice site regions and preferences of oligonucleotides in protein coding and intron regions. The accuracy of our splice site recognition function is 97% for donor splice sites and 96% for acceptor splice sites. For exon prediction, we combine in a discriminant function the characteristics describing the 5'-intron region, donor splice site, coding region, acceptor splice site and 3'-intron region for each open reading frame flanked by GT and AG base pairs. The accuracy of precise internal exon recognition on a test set of 451 exon and 246693 pseudoexon sequences is 77% with a specificity of 79%. The recognition quality computed at the level of individual nucleotides is 89% for exon sequences and 98% for intron sequences. This corresponds to a correlation coefficient for exon prediction of 0.87. The precision of this approach is better than other methods and has been tested on a larger data set. We have also developed a means for predicting exon-exon junctions in cDNA sequences, which can be useful for selecting optimal PCR primers.


Asunto(s)
Algoritmos , Composición de Base , Exones/genética , Sistemas de Lectura Abierta/genética , Secuencia de Bases , Bases de Datos Factuales , Análisis Discriminante , Humanos , Intrones/genética , Datos de Secuencia Molecular , Oligodesoxirribonucleótidos/genética , Empalme del ARN/genética
19.
Comput Appl Biosci ; 10(6): 661-9, 1994 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-7704665

RESUMEN

All current methods of protein secondary structure prediction are based on evaluation of a single residue state. Although the accuracy of the best of them is approximately 60-70%, for reliable prediction of tertiary structure it is more useful to predict an approximate location of alpha-helix and beta-strand segments, especially prolonged ones. We have developed a simple method for protein secondary structure prediction which is oriented on the location of secondary structure segments. The method uses linear discriminant analysis to assign segments of a given amino acid sequence a particular type of secondary structure, by taking into account the amino acid composition of internal parts of segments as well as their terminal and adjacent regions. Four linear discriminant functions were constructed for recognition of short and long alpha-helix and beta-strand segments respectively. These functions combine three characteristics: hydrophobic moment, segment singlet, and pair preferences to an alpha-helix or beta-strand. The last two characteristics are calculated by summing the preference parameters of single residues and pairs of residues located in a segment and its adjacent regions. The final program SSP predicts all possible potential alpha-helices and beta-strands and resolves some possible overlap between them. Overall three-state (alpha, beta, c) prediction gives approximately 65.1% correctly predicted residues on 126 non-homologous proteins using the jackknife test procedure. Analysis of the prediction results shows a high prediction accuracy of long secondary structure segments (approximately 89% of alpha-helices of length > 8 and approximately 71% of beta-strands of length > 6 are correctly located with probability of correct prediction 0.82 and 0.78 respectively.(ABSTRACT TRUNCATED AT 250 WORDS)


Asunto(s)
Estructura Secundaria de Proteína , Proteínas/química , Programas Informáticos , Algoritmos , Secuencia de Aminoácidos , Simulación por Computador , Análisis Discriminante , Modelos Moleculares , Datos de Secuencia Molecular , Estructura Molecular , Pliegue de Proteína , Estructura Terciaria de Proteína , Alineación de Secuencia
20.
Artículo en Inglés | MEDLINE | ID: mdl-7584412

RESUMEN

Discriminant analysis is applied to the problem of recognition 5'-, internal and 3'-exons in human DNA sequences. Specific recognition functions were developed for revealing exons of particular types. The method based on a splice site prediction algorithm that uses the linear Fisher discriminant to combine the information about significant triplet frequencies of various functional parts of splice site regions and preferences of oligonucleotides in protein coding and intron regions (Solovyev, Lawrence, 1994). The accuracy of our splice site recognition function is about 97%. A discriminant function for 5'-exon prediction includes hexanucleotide composition of upstream region, triplet composition around the ATG codon, ORF coding potential, donor splice site potential and composition of downstream intron region. For internal exon prediction, we combine in a discriminant function the characteristics describing the 5'-intron region, donor splice site, coding region, acceptor splice site and 3'-intron region for each open reading frame flanked by GT and AG base pairs. The accuracy of precise internal exon recognition on a test set of 451 exon and 246693 pseudoexon sequences is 77% with a specificity of 79% and a level of pseudoexon ORF prediction of 99.96%. The recognition quality computed at the level of individual nucleotides is 89% for exon sequences and 98% for intron sequences. A discriminant function for 3'-exon prediction includes octanucleotide composition of upstream intron region, triplet composition around the stop codon, ORF coding potential, acceptor splice site potential and hexanucleotide composition of downstream region.(ABSTRACT TRUNCATED AT 250 WORDS)


Asunto(s)
Simulación por Computador , Análisis Discriminante , Exones/genética , Modelos Teóricos , Análisis de Secuencia , Humanos , Oligonucleótidos/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...