Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 39
Filtrar
1.
Mol Cell Proteomics ; 12(7): 1829-43, 2013 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-23481700

RESUMO

Peptides presented by human leukocyte antigen (HLA) molecules on the cell surface play a crucial role in adaptive immunology, mediating the communication between T cells and antigen presenting cells. Knowledge of these peptides is of pivotal importance in fundamental studies of T cell action and in cellular immunotherapy and transplantation. In this paper we present the in-depth identification and relative quantification of 14,500 peptide ligands constituting the HLA ligandome of B cells. This large number of identified ligands provides general insight into the presented peptide repertoire and antigen presentation. Our uniquely large set of HLA ligands allowed us to characterize in detail the peptides constituting the ligandome in terms of relative abundance, peptide length distribution, physicochemical properties, binding affinity to the HLA molecule, and presence of post-translational modifications. The presented B-lymphocyte ligandome is shown to be a rich source of information by the presence of minor histocompatibility antigens, virus-derived epitopes, and post-translationally modified HLA ligands, and it can be a good starting point for solving a wealth of specific immunological questions. These HLA ligands can form the basis for reversed immunology approaches to identify T cell epitopes based not on in silico predictions but on the bona fide eluted HLA ligandome.


Assuntos
Linfócitos B/metabolismo , Antígenos HLA/metabolismo , Peptídeos/metabolismo , Apresentação de Antígeno , Linhagem Celular Transformada , Herpesvirus Humano 4/genética , Humanos , Ligantes
2.
Nucleic Acids Res ; 40(Database issue): D394-9, 2012 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-22102581

RESUMO

ProRepeat (http://prorepeat.bioinformatics.nl/) is an integrated curated repository and analysis platform for in-depth research on the biological characteristics of amino acid tandem repeats. ProRepeat collects repeats from all proteins included in the UniProt knowledgebase, together with 85 completely sequenced eukaryotic proteomes contained within the RefSeq collection. It contains non-redundant perfect tandem repeats, approximate tandem repeats and simple, low-complexity sequences, covering the majority of the amino acid tandem repeat patterns found in proteins. The ProRepeat web interface allows querying the repeat database using repeat characteristics like repeat unit and length, number of repetitions of the repeat unit and position of the repeat in the protein. Users can also search for repeats by the characteristics of repeat containing proteins, such as entry ID, protein description, sequence length, gene name and taxon. ProRepeat offers powerful analysis tools for finding biological interesting properties of repeats, such as the strong position bias of leucine repeats in the N-terminus of eukaryotic protein sequences, the differences of repeat abundance among proteomes, the functional classification of repeat containing proteins and GC content constrains of repeats' corresponding codons.


Assuntos
Bases de Dados de Proteínas , Proteínas/química , Sequências Repetitivas de Aminoácidos , Análise de Sequência de Proteína , Interface Usuário-Computador
3.
Trends Genet ; 24(11): 539-51, 2008 Nov.
Artigo em Inglês | MEDLINE | ID: mdl-18819722

RESUMO

Orthology is a key evolutionary concept in many areas of genomic research. It provides a framework for subjects as diverse as the evolution of genomes, gene functions, cellular networks and functional genome annotation. Although orthologous proteins usually perform equivalent functions in different species, establishing true orthologous relationships requires a phylogenetic approach, which combines both trees and graphs (networks) using reliable species phylogeny and available genomic data from more than two species, and an insight into the processes of molecular evolution. Here, we evaluate the available bioinformatics tools and provide a set of guidelines to aid researchers in choosing the most appropriate tool for any situation.


Assuntos
Evolução Molecular , Genômica/métodos , Filogenia , Homologia de Sequência , Animais , Bases de Dados Genéticas , Genoma , Humanos , Proteínas/química
4.
Immunogenetics ; 63(3): 143-53, 2011 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-21125265

RESUMO

T cell epitopes derived from polymorphic proteins or from proteins encoded by alternative reading frames (ARFs) play an important role in (tumor) immunology. Identification of these peptides is successfully performed with mass spectrometry. In a mass spectrometry-based approach, the recorded tandem mass spectra are matched against hypothetical spectra generated from known protein sequence databases. Commonly used protein databases contain a minimal level of redundancy, and thus, are not suitable data sources for searching polymorphic T cell epitopes, either in normal or ARFs. At the same time, however, these databases contain much non-polymorphic sequence information, thereby complicating the matching of recorded and theoretical spectra, and increasing the potential for finding false positives. Therefore, we created a database with peptides from ARFs and peptide variation arising from single nucleotide polymorphisms (SNPs). It is based on the human mRNA sequences from the well-annotated reference sequence (RefSeq) database and associated variation information derived from the Single Nucleotide Polymorphism Database (dbSNP). In this process, we removed all non-polymorphic information. Investigation of the frequency of SNPs in the dbSNP revealed that many SNPs are non-polymorphic "SNPs". Therefore, we removed those from our dedicated database, and this resulted in a comprehensive high quality database, which we coined the Human Short Peptide Variation Database (HSPVdb). The value of our HSPVdb is shown by identification of the majority of published polymorphic SNP- and/or ARF-derived epitopes from a mass spectrometry-based proteomics workflow, and by a large variety of polymorphic peptides identified as potential T cell epitopes in the HLA-ligandome presented by the Epstein-Barr virus cells.


Assuntos
Bases de Dados Genéticas , Epitopos de Linfócito T/química , Ligantes , Peptídeos , Linhagem Celular Transformada , Epitopos de Linfócito T/genética , Antígenos HLA/metabolismo , Humanos , Espectrometria de Massas , Neoplasias/tratamento farmacológico , Neoplasias/imunologia , Polimorfismo de Nucleotídeo Único
5.
Bioinformatics ; 26(19): 2482-3, 2010 Oct 01.
Artigo em Inglês | MEDLINE | ID: mdl-20679333

RESUMO

UNLABELLED: Multi-netclust is a simple tool that allows users to extract connected clusters of data represented by different networks given in the form of matrices. The tool uses user-defined threshold values to combine the matrices, and uses a straightforward, memory-efficient graph algorithm to find clusters that are connected in all or in either of the networks. The tool is written in C/C++ and is available either as a form-based or as a command-line-based program running on Linux platforms. The algorithm is fast, processing a network of > 10(6) nodes and 10(8) edges takes only a few minutes on an ordinary computer. AVAILABILITY: http://www.bioinformatics.nl/netclust/.


Assuntos
Análise por Conglomerados , Software , Algoritmos , Bases de Dados Factuais , Interface Usuário-Computador
7.
Nucleic Acids Res ; 37(Web Server issue): W428-34, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19494185

RESUMO

Current protein sequence databases employ different classification schemes that often provide conflicting annotations, especially for poorly characterized proteins. ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap) is a web-tool designed to help researchers and database annotators to assess the coherence of protein groups defined in various databases and thereby facilitate the annotation of newly sequenced proteins. ProGMap is based on a non-redundant dataset of over 6.6 million protein sequences which is mapped to 240,000 protein group descriptions collected from UniProt, RefSeq, Ensembl, COG, KOG, OrthoMCL-DB, HomoloGene, TRIBES and PIRSF. ProGMap combines the underlying classification schemes via a network of links constructed by a fast and fully automated mapping approach originally developed for document classification. The web interface enables queries to be made using sequence identifiers, gene symbols, protein functions or amino acid and nucleotide sequences. For the latter query type BLAST similarity search and QuickMatch identity search services have been incorporated, for finding sequences similar (or identical) to a query sequence. ProGMap is meant to help users of high throughput methodologies who deal with partially annotated genomic data.


Assuntos
Proteínas/classificação , Software , Bases de Dados de Proteínas , Internet , Proteínas/química , Análise de Sequência de Proteína , Integração de Sistemas , Interface Usuário-Computador
8.
Mol Biol Evol ; 26(8): 1707-14, 2009 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-19429673

RESUMO

The members of the p24 protein family have an important but unclear role in transport processes in the early secretory pathway. The p24 family consists of four subfamilies (alpha, beta, gamma, and delta), whereby the exact composition of the family varies among species. Despite more than 15 years of p24 research, the vertebrate p24 family is still surprisingly ill characterized. Here, we describe the human, mouse, Xenopus, and zebrafish orthologues of 10 p24 family members and a new member that we term p24gamma(5). Of these eleven p24 family members, nine are conserved throughout the vertebrate lineage, whereas two (p24gamma(4) and p24delta(2)) occur in some but not all vertebrates. We further show that all p24 proteins are widely expressed in mouse, except for p24alpha(1) and p24gamma(5) that display restricted expression patterns. Thus, we present for the first time a comprehensive overview of the phylogeny and expression of the vertebrate p24 protein family.


Assuntos
Proteínas de Transporte/genética , Proteínas de Membrana/genética , Vertebrados/genética , Animais , Códon de Terminação , Feminino , Regulação da Expressão Gênica , Humanos , Camundongos , Filogenia
9.
Brief Bioinform ; 9(3): 220-31, 2008 May.
Artigo em Inglês | MEDLINE | ID: mdl-18238804

RESUMO

The BioMoby project was initiated in 2001 from within the model organism database community. It aimed to standardize methodologies to facilitate information exchange and access to analytical resources, using a consensus driven approach. Six years later, the BioMoby development community is pleased to announce the release of the 1.0 version of the interoperability framework, registry Application Programming Interface and supporting Perl and Java code-bases. Together, these provide interoperable access to over 1400 bioinformatics resources worldwide through the BioMoby platform, and this number continues to grow. Here we highlight and discuss the features of BioMoby that make it distinct from other Semantic Web Service and interoperability initiatives, and that have been instrumental to its deployment and use by a wide community of bioinformatics service providers. The standard, client software, and supporting code libraries are all freely available at http://www.biomoby.org/.


Assuntos
Biologia Computacional/métodos , Sistemas de Gerenciamento de Base de Dados , Bases de Dados Factuais , Armazenamento e Recuperação da Informação/métodos , Internet , Linguagens de Programação , Integração de Sistemas
10.
In Silico Biol ; 10(3): 193-205, 2010.
Artigo em Inglês | MEDLINE | ID: mdl-22430292

RESUMO

BACKGROUND: In the field of bioinformatics interchangeable data formats based on XML are widely used. XML-type data is also at the core of most web services. With the increasing amount of data stored in XML comes the need for storing and accessing the data. In this paper we analyse the suitability of different database systems for storing and querying large datasets in general and Medline in particular. RESULTS: All reviewed database systems perform well when tested with small to medium sized datasets, however when the full Medline dataset is queried a large variation in query times is observed. CONCLUSIONS: There is not one system that is vastly superior to the others in this comparison and, depending on the database size and the query requirements, different systems are most suitable. The best all-round solution is the Oracle 11~g database system using the new binary storage option. Alias-i's Lingpipe is a more lightweight, customizable and sufficiently fast solution. It does however require more initial configuration steps. For data with a changing XML structure Sedna and BaseX as native XML database systems or MySQL with an XML-type column are suitable.


Assuntos
Sistemas de Gerenciamento de Base de Dados , Bases de Dados Genéticas , Linguagens de Programação , Biologia Computacional , Processamento Eletrônico de Dados
11.
Nucleic Acids Res ; 36(Web Server issue): W255-9, 2008 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-18463137

RESUMO

With over 46 000 proteins, the Protein Data Bank (PDB) is the most important database with structural information of biological macromolecules. PDB files contain sequence and coordinate information. Residues present in the sequence can be absent from the coordinate section, which means their position in space is unknown. Similarity searches are routinely carried out against sequences taken from PDB SEQRES. However, there no distinction is made between residues that have a known or unknown position in the 3D protein structure. We present a FASTA sequence database that is produced by combining the sequence and coordinate information. All residues absent from the PDB coordinate section are masked with lower-case letters, thereby providing a view of these residues in the context of the entire protein sequence, which facilitates inspecting 'missing' regions. We also provide a masked version of the CATH domain database. A user-friendly BLAST interface is available for similarity searching. In contrast to standard (stand-alone) BLAST output, which only contains upper-case letters, our output retains the lower-case letters of the masked regions. Thus, our server can be used to perform BLAST searching case-sensitively. Here, we have applied it to the study of missing regions in their sequence context. SEQATOMS is available at http://www.bioinformatics.nl/tools/seqatoms/.


Assuntos
Bases de Dados de Proteínas , Alinhamento de Sequência , Análise de Sequência de Proteína , Software , Internet , Estrutura Terciária de Proteína
12.
Nucleic Acids Res ; 35(Web Server issue): W71-4, 2007 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-17485472

RESUMO

Here we present Primer3Plus, a new web interface to the popular Primer3 primer design program as an enhanced alternative for the CGI- scripts that come with Primer3. Primer3 consists of a command line program and a web interface. The web interface is one large form showing all of the possible options. This makes the interface powerful, but at the same time confusing for occasional users. Primer3Plus provides an intuitive user interface using present-day web technologies and has been developed in close collaboration with molecular biologists and technicians regularly designing primers. It focuses on the task at hand, and hides detailed settings from the user until these are needed. We also added functionality to automate specific tasks like designing primers for cloning or step-wise sequencing. Settings and designed primer sequences can be stored locally for later use. Primer3Plus supports a range of common sequence formats, such as FASTA. Finally, primers selected by Primer3Plus can be sent to an order form, allowing tight integration into laboratory ordering systems. Moreover, the open architecture of Primer3Plus allows easy expansion or integration of external software packages. The Primer3Plus Perl source code is available under GPL license from SourceForge. Primer3Plus is available at http://www.bioinformatics.nl/primer3plus.


Assuntos
Biologia Computacional/métodos , Sistemas de Gerenciamento de Base de Dados , Técnicas Genéticas , Internet , Sequência de Bases , Clonagem Molecular , Primers do DNA , Dados de Sequência Molecular , Reação em Cadeia da Polimerase/métodos , Homologia de Sequência do Ácido Nucleico , Interface Usuário-Computador
13.
Nucleic Acids Res ; 35(Database issue): D232-6, 2007 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-17142240

RESUMO

Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection (http://hydra.icgeb.trieste.it/benchmark) was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms.


Assuntos
Inteligência Artificial , Bases de Dados de Proteínas , Proteínas/classificação , Algoritmos , Internet , Estrutura Terciária de Proteína , Proteínas/química , Reprodutibilidade dos Testes , Análise de Sequência de Proteína , Interface Usuário-Computador
14.
BMC Genet ; 9: 23, 2008 Feb 28.
Artigo em Inglês | MEDLINE | ID: mdl-18307806

RESUMO

BACKGROUND: Single nucleotide polymorphisms (SNPs) and small insertions or deletions (indels) are the most common type of polymorphisms and are frequently used for molecular marker development. Such markers have become very popular for all kinds of genetic analysis, including haplotype reconstruction. Haplotypes can be reconstructed for whole chromosomes but also for specific genes, based on the SNPs present. Haplotypes in the latter context represent the different alleles of a gene. The computational approach to SNP mining is becoming increasingly popular because of the continuously increasing number of sequences deposited in databases, which allows a more accurate identification of SNPs. Several software packages have been developed for SNP mining from databases. From these, QualitySNP is the only tool that combines SNP detection with the reconstruction of alleles, which results in a lower number of false positive SNPs and also works much faster than other programs. We have build a web-based SNP discovery and allele detection tool (HaploSNPer) based on QualitySNP. RESULTS: HaploSNPer is a flexible web-based tool for detecting SNPs and alleles in user-specified input sequences from both diploid and polyploid species. It includes BLAST for finding homologous sequences in public EST databases, CAP3 or PHRAP for aligning them, and QualitySNP for discovering reliable allelic sequences and SNPs. All possible and reliable alleles are detected by a mathematical algorithm using potential SNP information. Reliable SNPs are then identified based on the reconstructed alleles and on sequence redundancy. CONCLUSION: Thorough testing of HaploSNPer (and the underlying QualitySNP algorithm) has shown that EST information alone is sufficient for the identification of alleles and that reliable SNPs can be found efficiently. Furthermore, HaploSNPer supplies a user friendly interface for visualization of SNP and alleles. HaploSNPer is available from http://www.bioinformatics.nl/tools/haplosnper/.


Assuntos
Alelos , Haplótipos/genética , Polimorfismo de Nucleotídeo Único/genética , Algoritmos , Animais , Análise por Conglomerados , Bases de Dados Genéticas , Etiquetas de Sequências Expressas , Humanos , Internet , Ploidias
15.
J Biochem Biophys Methods ; 70(6): 1215-23, 2008 Apr 24.
Artigo em Inglês | MEDLINE | ID: mdl-17604112

RESUMO

Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.


Assuntos
Algoritmos , Proteínas/análise , Proteínas/classificação , Proteínas/química , Análise de Sequência de Proteína
16.
Nucleic Acids Res ; 34(Web Server issue): W104-9, 2006 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-16844970

RESUMO

Phylogenetic analysis and examination of protein domains allow accurate genome annotation and are invaluable to study proteins and protein complex evolution. However, two sequences can be homologous without sharing statistically significant amino acid or nucleotide identity, presenting a challenging bioinformatics problem. We present TreeDomViewer, a visualization tool available as a web-based interface that combines phylogenetic tree description, multiple sequence alignment and InterProScan data of sequences and generates a phylogenetic tree projecting the corresponding protein domain information onto the multiple sequence alignment. Thereby it makes use of existing domain prediction tools such as InterProScan. TreeDomViewer adopts an evolutionary perspective on how domain structure of two or more sequences can be aligned and compared, to subsequently infer the function of an unknown homolog. This provides insight into the function assignment of, in terms of amino acid substitution, very divergent but yet closely related family members. Our tool produces an interactive scalar vector graphics image that provides orthological relationship and domain content of proteins of interest at one glance. In addition, PDF, JPEG or PNG formatted output is also provided. These features make TreeDomViewer a valuable addition to the annotation pipeline of unknown genes or gene products. TreeDomViewer is available at http://www.bioinformatics.nl/tools/treedom/.


Assuntos
Gráficos por Computador , Filogenia , Estrutura Terciária de Proteína , Proteínas/classificação , Software , Internet , Proteínas/genética , Alinhamento de Sequência , Análise de Sequência de Proteína , Design de Software
18.
BMC Bioinformatics ; 7: 438, 2006 Oct 09.
Artigo em Inglês | MEDLINE | ID: mdl-17029635

RESUMO

BACKGROUND: Single nucleotide polymorphisms (SNPs) are important tools in studying complex genetic traits and genome evolution. Computational strategies for SNP discovery make use of the large number of sequences present in public databases (in most cases as expressed sequence tags (ESTs)) and are considered to be faster and more cost-effective than experimental procedures. A major challenge in computational SNP discovery is distinguishing allelic variation from sequence variation between paralogous sequences, in addition to recognizing sequencing errors. For the majority of the public EST sequences, trace or quality files are lacking which makes detection of reliable SNPs even more difficult because it has to rely on sequence comparisons only. RESULTS: We have developed a new algorithm to detect reliable SNPs and insertions/deletions (indels) in EST data, both with and without quality files. Implemented in a pipeline called QualitySNP, it uses three filters for the identification of reliable SNPs. Filter 1 screens for all potential SNPs and identifies variation between or within genotypes. Filter 2 is the core filter that uses a haplotype-based strategy to detect reliable SNPs. Clusters with potential paralogs as well as false SNPs caused by sequencing errors are identified. Filter 3 screens SNPs by calculating a confidence score, based upon sequence redundancy and quality. Non-synonymous SNPs are subsequently identified by detecting open reading frames of consensus sequences (contigs) with SNPs. The pipeline includes a data storage and retrieval system for haplotypes, SNPs and alignments. QualitySNP's versatility is demonstrated by the identification of SNPs in EST datasets from potato, chicken and humans. CONCLUSION: QualitySNP is an efficient tool for SNP detection, storage and retrieval in diploid as well as polyploid species. It is available for running on Linux or UNIX systems. The program, test data, and user manual are available at http://www.bioinformatics.nl/tools/snpweb/ and as Additional files.


Assuntos
Mapeamento Cromossômico/métodos , Análise Mutacional de DNA/métodos , Elementos de DNA Transponíveis/genética , Etiquetas de Sequências Expressas , Deleção de Genes , Polimorfismo de Nucleotídeo Único/genética , Software , Sequência de Bases , Bases de Dados Genéticas , Diploide , Dados de Sequência Molecular , Poliploidia
19.
BMC Bioinformatics ; 7: 444, 2006 Oct 12.
Artigo em Inglês | MEDLINE | ID: mdl-17038163

RESUMO

BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS: All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION: The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons.


Assuntos
Bases de Dados de Proteínas/estatística & dados numéricos , Alinhamento de Sequência/métodos , Homologia Estrutural de Proteína , Sequência de Bases/genética , Humanos , Alinhamento de Sequência/estatística & dados numéricos , Análise de Sequência de Proteína/métodos , Análise de Sequência de Proteína/estatística & dados numéricos
20.
Eur J Hum Genet ; 14(5): 535-42, 2006 May.
Artigo em Inglês | MEDLINE | ID: mdl-16493445

RESUMO

A number of large-scale efforts are underway to define the relationships between genes and proteins in various species. But, few attempts have been made to systematically classify all such relationships at the phenotype level. Also, it is unknown whether such a phenotype map would carry biologically meaningful information. We have used text mining to classify over 5000 human phenotypes contained in the Online Mendelian Inheritance in Man database. We find that similarity between phenotypes reflects biological modules of interacting functionally related genes. These similarities are positively correlated with a number of measures of gene function, including relatedness at the level of protein sequence, protein motifs, functional annotation, and direct protein-protein interaction. Phenotype grouping reflects the modular nature of human disease genetics. Thus, phenotype mapping may be used to predict candidate genes for diseases as well as functional relations between genes and proteins. Such predictions will further improve if a unified system of phenotype descriptors is developed. The phenotype similarity data are accessible through a web interface at http://www.cmbi.ru.nl/MimMiner/.


Assuntos
Mapeamento Cromossômico/métodos , Bases de Dados Genéticas , Predisposição Genética para Doença , Genoma Humano , Vetores Genéticos , Genótipo , Humanos , Modelos Genéticos , Modelos Estatísticos , Família Multigênica , Fenótipo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA