RESUMO
A procedure to recruit members to enlarge protein family databases is described here. The procedure makes use of UniRef50 clusters produced by UniProt. Current family entries are used to recruit additional members based on the UniRef50 clusters to which they belong. Only those additional UniRef50 members that are not fragments and whose length is within a restricted range relative to the original entry are recruited. The enriched dataset is then limited to contain only genomes from selected clades. We used the COG database - used for genome annotation and for studies of phylogenetics and gene evolution - as a model. To validate the method, a UniRef-Enriched COG0151 (UECOG) was tested with distinct procedures to compare recruited members with the recruiters: PSI-BLAST, secondary structure overlap (SOV), Seed Linkage, COGnitor, shared domain content, and neighbor-joining single-linkage, and observed that the former four agree in their validations. Presently, the UniRef50-based recruitment procedure enriches the COG database for Archaea, Bacteria and its subgroups Actinobacteria, Firmicutes, Proteobacteria, and other bacteria by 2.2-, 8.0-, 7.0-, 8.8-, 8.7-, and 4.2-fold, respectively, in terms of sequences, and also considerably increased the number of species.
Assuntos
Biologia Computacional/métodos , Bases de Dados de Proteínas , Reprodutibilidade dos TestesRESUMO
The KEGG Orthology (KO) database was tested as a source for automated annotation of expressed sequence tags (ESTs). We used a control experiment where every EST was assigned to its cognate protein, and an annotation experiment where the ESTs were annotated by proteins from other organisms. Analyzing the results, we could assign classes to the annotation: correct, changed and speculated. The correct annotation ranged from 57 (Caenorhabditis elegans) to 81% (Homo sapiens). In spite of the changed annotation being low (1 in H. sapiens to 9% in Arabidopsis thaliana), the speculation was very high (18 in H. sapiens to 38% in C. elegans). We propose eliminating part of the speculated annotation using the KEGG Genes database to enrich KO clusters, decreasing the speculation from 38 to 2% in C. elegans. Thus, the KO database still demands some effort for moving sequences from Kegg GENES to KO, to complement the annotation performance.
Assuntos
Análise por Conglomerados , Bases de Dados Genéticas , Etiquetas de Sequências Expressas , Animais , Arabidopsis/genética , Caenorhabditis elegans/genética , Biologia Computacional/métodos , Drosophila melanogaster/genética , Humanos , Análise de Sequência de DNA/métodosRESUMO
We show here an example of the application of a novel method, MUTIC (model utilization-based clustering), used for identifying complex interactions between genes or gene categories based on gene expression data. The method deals with binary categorical data which consist of a set of gene expression profiles divided into two biologically meaningful categories. It does not require data from multiple time points. Gene expression profiles are represented by feature vectors whose component features are either gene expression values, or averaged expression values corresponding to gene ontology or protein information resource categories. A supervised learning algorithm (genetic programming) is used to learn an ensemble of classification models distinguishing the two categories based on the feature vectors corresponding to their members. Each feature is associated with a "model utilization vector", which has an entry for each high-quality classification model found, indicating whether or not the feature was used in that model. These utilization vectors are then clustered using a variant of hierarchical clustering called Omniclust. The result is a set of model utilization-based clusters, in which features are gathered together if they are often considered together by classification models - which may be because they are co-expressed, or may be for subtler reasons involving multi-gene interactions. The MUTIC method is illustrated here by applying it to a dataset regarding gene expression in prostate cancer and control samples. Compared to traditional expression-based clustering, MUTIC yields clusters that have higher mathematical quality (in the sense of homogeneity and separation) and that also yield novel insights into the underlying biological processes.
Assuntos
Regulação Neoplásica da Expressão Gênica , Técnicas Genéticas , Neoplasias da Próstata/genética , Análise por Conglomerados , Humanos , MasculinoRESUMO
We show here an example of the application of a novel method, MUTIC (model utilization-based clustering), used for identifying complex interactions between genes or gene categories based on gene expression data. The method deals with binary categorical data which consist of a set of gene expression profiles divided into two biologically meaningful categories. It does not require data from multiple time points. Gene expression profiles are represented by feature vectors whose component features are either gene expression values, or averaged expression values corresponding to gene ontology or protein information resource categories. A supervised learning algorithm (genetic programming) is used to learn an ensemble of classification models distinguishing the two categories based on the feature vectors corresponding to their members. Each feature is associated with a "model utilization vector", which has an entry for each high-quality classification model found, indicating whether or not the feature was used in that model. These utilization vectors are then clustered using a variant of hierarchical clustering called Omniclust. The result is a set of model utilization-based clusters, in which features are gathered together if they are often considered together by classification models - which may be because they are co-expressed, or may be for subtler reasons involving multi-gene interactions. The MUTIC method is illustrated here by applying it to a dataset regarding gene expression in prostate cancer and control samples. Compared to traditional expression-based clustering, MUTIC yields clusters that have higher mathematical quality (in the sense of homogeneity and separation) and that also yield novel insights into the underlying biological processes.
Assuntos
Humanos , Masculino , Regulação Neoplásica da Expressão Gênica , Técnicas Genéticas , Neoplasias da Próstata/genética , Análise por ConglomeradosRESUMO
T-type Ca2+ channels are important for cell signaling by a variety of cells. We report here the electrophysiological and molecular characteristics of the whole-cell Ca2+ current in GH3 clonal pituitary cells. The current inactivation at 0 mV was described by a single exponential function with a time constant of 18.32 +/- 1.87 ms (N = 16). The I-V relationship measured with Ca2+ as a charge carrier was shifted to the left when we applied a conditioning pre-pulse of up to -120 mV, indicating that a low voltage-activated current may be present in GH3 cells. Transient currents were first activated at -50 mV and peaked around -20 mV. The half-maximal voltage activation and the slope factors for the two conditions are -35.02 +/- 2.4 and 6.7 +/- 0.3 mV (pre-pulse of -120 mV, N = 15), and -27.0 +/- 0.97 and 7.5 +/- 0.7 mV (pre-pulse of -40 mV, N = 9). The 8-mV shift in the activation mid-point was statistically significant (P < 0.05). The tail currents decayed bi-exponentially suggesting two different T-type Ca2+ channel populations. RT-PCR revealed the presence of alpha1G (CaV3.1) and alpha1I (CaV3.3) T-type Ca2+ channel mRNA transcripts.
Assuntos
Canais de Cálcio Tipo T/fisiologia , Hipófise/citologia , Canais de Cálcio Tipo T/genética , Linhagem Celular , Células Clonais , Eletrofisiologia , Humanos , Reação em Cadeia da Polimerase Via Transcriptase ReversaRESUMO
T-type Ca2+ channels are important for cell signaling by a variety of cells. We report here the electrophysiological and molecular characteristics of the whole-cell Ca2+ current in GH3 clonal pituitary cells. The current inactivation at 0 mV was described by a single exponential function with a time constant of 18.32 ñ 1.87 ms (N = 16). The I-V relationship measured with Ca2+ as a charge carrier was shifted to the left when we applied a conditioning pre-pulse of up to -120 mV, indicating that a low voltage-activated current may be present in GH3 cells. Transient currents were first activated at -50 mV and peaked around -20 mV. The half-maximal voltage activation and the slope factors for the two conditions are -35.02 ñ 2.4 and 6.7 ñ 0.3 mV (pre-pulse of -120 mV, N = 15), and -27.0 ñ 0.97 and 7.5 ñ 0.7 mV (pre-pulse of -40 mV, N = 9). The 8-mV shift in the activation mid-point was statistically significant (P < 0.05). The tail currents decayed bi-exponentially suggesting two different T-type Ca2+ channel populations. RT-PCR revealed the presence of a1G (CaV3.1) and a1I (CaV3.3) T-type Ca2+ channel mRNA transcripts.