Pesquisa | Portal de Pesquisa da BVS

NovelFam3000--uncharacterized human protein domains conserved across model organisms.

Kemmer, Danielle; Podowski, Raf M; Arenillas, David; Lim, Jonathan; Hodges, Emily; Roth, Peggy; Sonnhammer, Erik L L; Höög, Christer; Wasserman, Wyeth W.

BMC Genomics ; 7: 48, 2006 Mar 13.

Artigo em Inglês | MEDLINE | ID: mdl-16533400

RESUMO

BACKGROUND: Despite significant efforts from the research community, an extensive portion of the proteins encoded by human genes lack an assigned cellular function. Most metazoan proteins are composed of structural and/or functional domains, of which many appear in multiple proteins. Once a domain is characterized in one protein, the presence of a similar sequence in an uncharacterized protein serves as a basis for inference of function. Thus knowledge of a domain's function, or the protein within which it arises, can facilitate the analysis of an entire set of proteins. DESCRIPTION: From the Pfam domain database, we extracted uncharacterized protein domains represented in proteins from humans, worms, and flies. A data centre was created to facilitate the analysis of the uncharacterized domain-containing proteins. The centre both provides researchers with links to dispersed internet resources containing gene-specific experimental data and enables them to post relevant experimental results or comments. For each human gene in the system, a characterization score is posted, allowing users to track the progress of characterization over time or to identify for study uncharacterized domains in well-characterized genes. As a test of the system, a subset of 39 domains was selected for analysis and the experimental results posted to the NovelFam3000 system. For 25 human protein members of these 39 domain families, detailed sub-cellular localizations were determined. Specific observations are presented based on the analysis of the integrated information provided through the online NovelFam3000 system. CONCLUSION: Consistent experimental results between multiple members of a domain family allow for inferences of the domain's functional role. We unite bioinformatics resources and experimental data in order to accelerate the functional characterization of scarcely annotated domain families.

Assuntos

Bases de Dados de Proteínas , Estrutura Terciária de Proteína , Animais , Caenorhabditis elegans/genética , Biologia Computacional , Drosophila melanogaster/genética , Genômica , Humanos , Internet , Proteoma/análise , Homologia de Sequência , Integração de Sistemas , Interface Usuário-Computador

NotI flanking sequences: a tool for gene discovery and verification of the human genome.

Kutsenko, Alexey S; Gizatullin, Rinat Z; Al-Amin, Ali N; Wang, Fuli; Kvasha, Sergei M; Podowski, Raf M; Matushkin, Yuri G; Gyanchandani, Anita; Muravenko, Olga V; Levitsky, Viktor G; Kolchanov, Nikolay A; Protopopov, Alexei I; Kashuba, Vladimir I; Kisselev, Lev L; Wasserman, Wyeth; Wahlestedt, Claes; Zabarovsky, Eugene R.

Nucleic Acids Res ; 30(14): 3163-70, 2002 Jul 15.

Artigo em Inglês | MEDLINE | ID: mdl-12136098

RESUMO

A set of 22 551 unique human NotI flanking sequences (16.2 Mb) was generated. More than 40% of the set had regions with significant similarity to known proteins and expressed sequences. The data demonstrate that regions flanking NotI sites are less likely to form nucleosomes efficiently and resemble promoter regions. The draft human genome sequence contained 55.7% of the NotI flanking sequences, Celera's database contained matches to 57.2% of the clones and all public databases (including non-human and previously sequenced NotI flanks) matched 89.2% of the NotI flanking sequences (identity > or =90% over at least 50 bp, data from December 2001). The data suggest that the shotgun sequencing approach used to generate the draft human genome sequence resulted in a bias against cloning and sequencing of NotI flanks. A rough estimation (based primarily on chromosomes 21 and 22) is that the human genome contains 15 000-20 000 NotI sites, of which 6000-9000 are unmethylated in any particular cell. The results of the study suggest that the existing tools for computational determination of CpG islands fail to identify a significant fraction of functional CpG islands, and unmethylated DNA stretches with a high frequency of CpG dinucleotides can be found even in regions with low CG content.

Assuntos

DNA/metabolismo , Desoxirribonucleases de Sítio Específico do Tipo II/metabolismo , Análise de Sequência de DNA/métodos , Linhagem Celular Transformada , Cromossomos Humanos Par 21/genética , Cromossomos Humanos Par 22/genética , Ilhas de CpG/genética , DNA/química , DNA/genética , Bases de Dados de Ácidos Nucleicos , Genes/genética , Genoma Humano , Humanos , Dados de Sequência Molecular , Sequências Repetitivas de Ácido Nucleico/genética

Suregene, a scalable system for automated term disambiguation of gene and protein names.

Podowski, Raf M; Cleary, John G; Goncharoff, Nicholas T; Amoutzias, Gregory; Hayes, William S.

J Bioinform Comput Biol ; 3(3): 743-70, 2005 Jun.

Artigo em Inglês | MEDLINE | ID: mdl-16108092

RESUMO

Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system that is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts is described. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7344 produced good quality models (F-measure >0.7, nearly 60% of which were >0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the system's internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.

Assuntos

Algoritmos , Genes , MEDLINE , Processamento de Linguagem Natural , Proteínas/classificação , Software , Terminologia como Assunto , Bases de Dados de Proteínas , Humanos , Vocabulário Controlado

Gene characterization index: assessing the depth of gene annotation.

Kemmer, Danielle; Podowski, Raf M; Yusuf, Dimas; Brumm, Jochen; Cheung, Warren; Wahlestedt, Claes; Lenhard, Boris; Wasserman, Wyeth W.

PLoS One ; 3(1): e1440, 2008 Jan 23.

Artigo em Inglês | MEDLINE | ID: mdl-18213364

RESUMO

BACKGROUND: We introduce the Gene Characterization Index, a bioinformatics method for scoring the extent to which a protein-encoding gene is functionally described. Inherently a reflection of human perception, the Gene Characterization Index is applied for assessing the characterization status of individual genes, thus serving the advancement of both genome annotation and applied genomics research by rapid and unbiased identification of groups of uncharacterized genes for diverse applications such as directed functional studies and delineation of novel drug targets. METHODOLOGY/PRINCIPAL FINDINGS: The scoring procedure is based on a global survey of researchers, who assigned characterization scores from 1 (poor) to 10 (extensive) for a sample of genes based on major online resources. By evaluating the survey as training data, we developed a bioinformatics procedure to assign gene characterization scores to all genes in the human genome. We analyzed snapshots of functional genome annotation over a period of 6 years to assess temporal changes reflected by the increase of the average Gene Characterization Index. Applying the Gene Characterization Index to genes within pharmaceutically relevant classes, we confirmed known drug targets as high-scoring genes and revealed potentially interesting novel targets with low characterization indexes. Removing known drug targets and genes linked to sequence-related patent filings from the entirety of indexed genes, we identified sets of low-scoring genes particularly suited for further experimental investigation. CONCLUSIONS/SIGNIFICANCE: The Gene Characterization Index is intended to serve as a tool to the scientific community and granting agencies for focusing resources and efforts on unexplored areas of the genome. The Gene Characterization Index is available from http://cisreg.ca/gci/.

Assuntos

Biologia Computacional , Genoma Humano , Humanos

AZuRE, a scalable system for automated term disambiguation of gene and protein names.

Podowski, Raf M; Cleary, John G; Goncharoff, Nicholas T; Amoutzias, Gregory; Hayes, William S.

Proc IEEE Comput Syst Bioinform Conf ; : 415-24, 2004.

Artigo em Inglês | MEDLINE | ID: mdl-16448034

RESUMO

Researchers, hindered by a lack of standard gene and protein-naming conventions, endure long, sometimes fruitless, literature searches. A system is described which is able to automatically assign gene names to their LocusLink ID (LLID) in previously unseen MEDLINE abstracts. The system is based on supervised learning and builds a model for each LLID. The training sets for all LLIDs are extracted automatically from MEDLINE references in the LocusLink and SwissProt databases. A validation was done of the performance for all 20,546 human genes with LLIDs. Of these, 7,344 produced good quality models (F-measure > 0.7, nearly 60% of which were > 0.9) and 13,202 did not, mainly due to insufficient numbers of known document references. A hand validation of MEDLINE documents for a set of 66 genes agreed well with the system's internal accuracy assessment. It is concluded that it is possible to achieve high quality gene disambiguation using scaleable automated techniques.

Assuntos

Bases de Dados de Proteínas , Armazenamento e Recuperação da Informação/métodos , MEDLINE , Processamento de Linguagem Natural , Proteínas/classificação , Software , Terminologia como Assunto , Genes , Interface Usuário-Computador , Vocabulário Controlado

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA