Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 20
Filtrar
1.
bioRxiv ; 2024 Jan 12.
Artigo em Inglês | MEDLINE | ID: mdl-38260683

RESUMO

Folate is a vitamin required for cell growth and is present in fortified foods in the form of folic acid to prevent congenital abnormalities. The impact of low folate status on life-long health is poorly understood. We found that limiting folate levels with the folate antagonist methotrexate increased the lifespan of yeast and worms. We then restricted folate intake in aged mice and measured various health metrics, metabolites, and gene expression signatures. Limiting folate intake decreased anabolic biosynthetic processes in mice and enhanced metabolic plasticity. Despite reduced serum folate levels in mice with limited folic acid intake, these animals maintained their weight and adiposity late in life, and we did not observe adverse health outcomes. These results argue that the effectiveness of folate dietary interventions may vary depending on an individual's age and sex. A higher folate intake is advantageous during the early stages of life to support cell divisions needed for proper development. However, a lower folate intake later in life may result in healthier aging.

2.
Theor Appl Genet ; 136(7): 155, 2023 Jun 17.
Artigo em Inglês | MEDLINE | ID: mdl-37329482

RESUMO

KEY MESSAGE: A novel locus was discovered on chromosome 7 associated with a lesion mimic in maize; this lesion mimic had a quantitative and heritable phenotype and was predicted better via subset genomic markers than whole genome markers across diverse environments. Lesion mimics are a phenotype of leaf micro-spotting in maize (Zea mays L.), which can be early signs of biotic or abiotic stresses. Dissecting its inheritance is helpful to understand how these loci behave across different genetic backgrounds. Here, 538 maize recombinant inbred lines (RILs) segregating for a novel lesion mimic were quantitatively phenotyped in Georgia, Texas, and Wisconsin. These RILs were derived from three bi-parental crosses using a tropical pollinator (Tx773) as the common parent crossed with three inbreds (LH195, LH82, and PB80). While this lesion mimic was heritable across three environments based on phenotypic ([Formula: see text] = 0.68) and genomic ([Formula: see text] = 0.91) data, transgressive segregation was observed. A genome-wide association study identified a single novel locus on chromosome 7 (at 70.6 Mb) also covered by a quantitative trait locus interval (69.3-71.0 Mb), explaining 11-15% of the variation, depending on the environment. One candidate gene identified in this region, Zm00001eb308070, is related to the abscisic acid pathway involving in cell death. Genomic predictions were applied to genome-wide markers (39,611 markers) contrasted with a marker subset (51 markers). Population structure explained more variation than environment in genomic prediction, but other substantial genetic background effects were additionally detected. Subset markers explained substantially less genetic variation (24.9%) for the lesion mimic than whole genome markers (55.4%) in the model, yet predicted the lesion mimic better (0.56-0.66 vs. 0.26-0.29). These results indicate this lesion mimic phenotype was less affected by environment than by epistasis and genetic background effects, which explain its transgressive segregation.


Assuntos
Estudo de Associação Genômica Ampla , Zea mays , Zea mays/genética , Epistasia Genética , Mapeamento Cromossômico , Fenótipo , Patrimônio Genético , Polimorfismo de Nucleotídeo Único
3.
Am J Trop Med Hyg ; 105(5): 1227-1229, 2021 09 20.
Artigo em Inglês | MEDLINE | ID: mdl-34544043

RESUMO

To better understand the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variant lineage distribution in a college campus population, we carried out viral genome surveillance over a 7-week period from January to March 2021. Among the sequences were three novel viral variants: BV-1 with a B.1.1.7/20I genetic background and an additional spike mutation Q493R, associated with a mild but longer-than-usual COVID-19 case in a college-age person, BV-2 with a T478K mutation on a 20B genetic background, and BV-3, an apparent recombinant lineage. This work highlights the potential of an undervaccinated younger population as a reservoir for the spread and generation of novel variants. This also demonstrates the value of whole genome sequencing as a routine disease surveillance tool.


Assuntos
COVID-19/virologia , Reservatórios de Doenças/virologia , Mutação , SARS-CoV-2/genética , Estudantes/estatística & dados numéricos , Universidades , Adulto , COVID-19/etiologia , Genoma Viral , Humanos , Testes de Neutralização , SARS-CoV-2/imunologia , SARS-CoV-2/isolamento & purificação , Adulto Jovem
4.
PLoS One ; 13(3): e0193757, 2018.
Artigo em Inglês | MEDLINE | ID: mdl-29579071

RESUMO

BACKGROUND: Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. RESULTS: HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. CONCLUSIONS: HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following URL: https://github.com/BBCMdP/HMMERCTTER.


Assuntos
Biologia Computacional/métodos , Proteínas/química , Aprendizado de Máquina Supervisionado , Sequência de Aminoácidos , Análise por Conglomerados , Proteômica
5.
Genomics ; 109(5-6): 438-445, 2017 10.
Artigo em Inglês | MEDLINE | ID: mdl-28694080

RESUMO

It is usually assumed that co-expressed genes suggest co-regulation in the underlying regulatory network. Determining sets of co-expressed genes is an important task, based on some criteria of similarity. This task is usually performed by clustering algorithms, where the genes are clustered into meaningful groups based on their expression values in a set of experiment. In this work, we propose a method to find sets of co-expressed genes, based on cluster validation indices as a measure of similarity for individual gene groups, and a combination of variants of hierarchical clustering to generate the candidate groups. We evaluated its ability to retrieve significant sets on simulated correlated and real genomics data, where the performance is measured based on its detection ability of co-regulated sets against a full search. Additionally, we analyzed the quality of the best ranked groups using an online bioinformatics tool that provides network information for the selected genes.


Assuntos
Biologia Computacional/métodos , Estudos de Associação Genética/métodos , Algoritmos , Animais , Análise por Conglomerados , Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Humanos , Análise de Sequência com Séries de Oligonucleotídeos/métodos
6.
Artigo em Inglês | MEDLINE | ID: mdl-19407354

RESUMO

A more complete understanding of the alterations in cellular regulatory and control mechanisms that occur in the various forms of cancer has been one of the central targets of the genomic and proteomic methods that allow surveys of the abundance and/or state of cellular macromolecules. This preference is driven both by the intractability of cancer to generic therapies, assumed to be due to the highly varied molecular etiologies observed in cancer, and by the opportunity to discern and dissect the regulatory and control interactions presented by the highly diverse assortment of perturbations of regulation and control that arise in cancer. Exploiting the opportunities for inference on the regulatory and control connections offered by these revealing system perturbations is fraught with the practical problems that arise from the way biological systems operate. Two classes of regulatory action in biological systems are particularly inimical to inference, convergent regulation, where a variety of regulatory actions result in a common set of control responses (crosstalk), and divergent regulation, where a single regulatory action produces entirely different sets of control responses, depending on cellular context (conditioning). We have constructed a coarse mathematical model of the propagation of regulatory influence in such distributed, context-sensitive regulatory networks that allows a quantitative estimation of the amount of crosstalk and conditioning associated with a candidate regulatory gene taken from a set of genes that have been profiled over a series of samples where the candidate's activity varies.


Assuntos
Regulação Neoplásica da Expressão Gênica , Modelos Genéticos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos , Transdução de Sinais/genética , Algoritmos , Linhagem Celular Tumoral , Genes , Humanos , Reprodutibilidade dos Testes
7.
Curr Genomics ; 10(6): 430-45, 2009 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-20190957

RESUMO

The development of microarray technology has enabled scientists to measure the expression of thousands of genes simultaneously, resulting in a surge of interest in several disciplines throughout biology and medicine. While data clustering has been used for decades in image processing and pattern recognition, in recent years it has joined this wave of activity as a popular technique to analyze microarrays. To illustrate its application to genomics, clustering applied to genes from a set of microarray data groups together those genes whose expression levels exhibit similar behavior throughout the samples, and when applied to samples it offers the potential to discriminate pathologies based on their differential patterns of gene expression. Although clustering has now been used for many years in the context of gene expression microarrays, it has remained highly problematic. The choice of a clustering algorithm and validation index is not a trivial one, more so when applying them to high throughput biological or medical data. Factors to consider when choosing an algorithm include the nature of the application, the characteristics of the objects to be analyzed, the expected number and shape of the clusters, and the complexity of the problem versus computational power available. In some cases a very simple algorithm may be appropriate to tackle a problem, but many situations may require a more complex and powerful algorithm better suited for the job at hand. In this paper, we will cover the theoretical aspects of clustering, including error and learning, followed by an overview of popular clustering algorithms and classical validation indices. We also discuss the relative performance of these algorithms and indices and conclude with examples of the application of clustering to computational biology.

8.
Artigo em Inglês | MEDLINE | ID: mdl-18483613

RESUMO

Is it better to design a classifier and estimate its error on the full sample or to design a classifier on a training subset and estimate its error on the holdout test subset? Full-sample design provides the better classifier; nevertheless, one might choose holdout with the hope of better error estimation. A conservative criterion to decide the best course is to aim at a classifier whose error is less than a given bound. Then the choice between full-sample and holdout designs depends on which possesses the smaller expected bound. Using this criterion, we examine the choice between holdout and several full-sample error estimators using covariance models and a patient-data model. Full-sample design consistently outperforms holdout design. The relation between the two designs is revealed via a decomposition of the expected bound into the sum of the expected true error and the expected conditional standard deviation of the true error.

9.
Nucleic Acids Res ; 35(10): e72, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-17478523

RESUMO

Microarray gene expression data becomes more valuable as our confidence in the results grows. Guaranteeing data quality becomes increasingly important as microarrays are being used to diagnose and treat patients (1-4). The MAQC Quality Control Consortium, the FDA's Critical Path Initiative, NCI's caBIG and others are implementing procedures that will broadly enhance data quality. As GEO continues to grow, its usefulness is constrained by the level of correlation across experiments and general applicability. Although RNA preparation and array platform play important roles in data accuracy, pre-processing is a user-selected factor that has an enormous effect. Normalization of expression data is necessary, but the methods have specific and pronounced effects on precision, accuracy and historical correlation. As a case study, we present a microarray calibration process using normalization as the adjustable parameter. We examine the impact of eight normalizations across both Agilent and Affymetrix expression platforms on three expression readouts: (1) sensitivity and power, (2) functional/biological interpretation and (3) feature selection and classification error. The reader is encouraged to measure their own discordant data, whether cross-laboratory, cross-platform or across any other variance source, and to use their results to tune the adjustable parameters of their laboratory to ensure increased correlation.


Assuntos
Perfilação da Expressão Gênica/normas , Análise de Sequência com Séries de Oligonucleotídeos/normas , Calibragem , Perfilação da Expressão Gênica/métodos , Humanos , Fígado/metabolismo , Pulmão/metabolismo , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Controle de Qualidade , Baço/metabolismo , Distribuições Estatísticas
10.
Genomics ; 90(2): 176-85, 2007 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-17521869

RESUMO

Computational approaches were used to define structural and functional determinants of a putative genetic regulatory network of murine LINE-1 (long interspersed nuclear element-1), an active mammalian retrotransposon that uses RNA intermediates to populate new sites throughout the genome. Polymerase (RNA) II polypeptide E AI845735 and mouse DNA homologous to Drosophila per fragment M12039 were identified as primary attractors. siRNA knockdown of the aryl hydrocarbon receptor NM_013464 modulated gene expression within the network, including LINE-1, Sgpl1, Sdcbp, and Mgst1. Genes within the network did not exhibit physical proximity and instead were dispersed throughout the genome. The potential impact of individual members of the network on the global dynamical behavior of LINE-1 was examined from a theoretical and empirical framework.


Assuntos
Redes Reguladoras de Genes , Elementos Nucleotídeos Longos e Dispersos/genética , Algoritmos , Animais , Biologia Computacional , Genoma , Genômica , Células HeLa , Humanos , Camundongos , Modelos Biológicos , Modelos Genéticos , RNA/metabolismo , Receptores de Hidrocarboneto Arílico/genética
11.
Bioinformatics ; 23(1): 57-63, 2007 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-17062589

RESUMO

MOTIVATION: The technology to genotype single nucleotide polymorphisms (SNPs) at extremely high densities provides for hypothesis-free genome-wide scans for common polymorphisms associated with complex disease. However, we find that some errors introduced by commonly employed genotyping algorithms may lead to inflation of false associations between markers and phenotype. RESULTS: We have developed a novel SNP genotype calling program, SNiPer-High Density (SNiPer-HD), for highly accurate genotype calling across hundreds of thousands of SNPs. The program employs an expectation-maximization (EM) algorithm with parameters based on a training sample set. The algorithm choice allows for highly accurate genotyping for most SNPs. Also, we introduce a quality control metric for each assayed SNP, such that poor-behaving SNPs can be filtered using a metric correlating to genotype class separation in the calling algorithm. SNiPer-HD is superior to the standard dynamic modeling algorithm and is complementary and non-redundant to other algorithms, such as BRLMM. Implementing multiple algorithms together may provide highly accurate genotyping calls, without inflation of false positives due to systematically miss-called SNPs. A reliable and accurate set of SNP genotypes for increasingly dense panels will eliminate some false association signals and false negative signals, allowing for rapid identification of disease susceptibility loci for complex traits. AVAILABILITY: SNiPer-HD is available at TGen's website: http://www.tgen.org/neurogenomics/data.


Assuntos
Algoritmos , Biologia Computacional/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Polimorfismo de Nucleotídeo Único/genética , Mapeamento Cromossômico , Bases de Dados Genéticas , Reações Falso-Positivas , Perfilação da Expressão Gênica , Genótipo , Humanos , Modelos Genéticos , Modelos Estatísticos , Família Multigênica , Reprodutibilidade dos Testes , Análise de Sequência de DNA , População Branca/genética
12.
Cancer Inform ; 2: 189-96, 2007 Feb 16.
Artigo em Inglês | MEDLINE | ID: mdl-19458767

RESUMO

The issue of wide feature-set variability has recently been raised in the context of expression-based classification using microarray data. This paper addresses this concern by demonstrating the natural manner in which many feature sets of a certain size chosen from a large collection of potential features can be so close to being optimal that they are statistically indistinguishable. Feature-set optimality is inherently related to sample size because it only arises on account of the tendency for diminished classifier accuracy as the number of features grows too large for satisfactory design from the sample data. The paper considers optimal feature sets in the framework of a model in which the features are grouped in such a way that intra-group correlation is substantial whereas inter-group correlation is minimal, the intent being to model the situation in which there are groups of highly correlated co-regulated genes and there is little correlation between the co-regulated groups. This is accomplished by using a block model for the covariance matrix that reflects these conditions. Focusing on linear discriminant analysis, we demonstrate how these assumptions can lead to very large numbers of close-to-optimal feature sets.

13.
Artigo em Inglês | MEDLINE | ID: mdl-18309365

RESUMO

The modeling of genetic regulatory networks is becoming increasingly widespread in the study of biological systems. In the abstract, one would prefer quantitatively comprehensive models, such as a differential-equation model, to coarse models; however, in practice, detailed models require more accurate measurements for inference and more computational power to analyze than coarse-scale models. It is crucial to address the issue of model complexity in the framework of a basic scientific paradigm: the model should be of minimal complexity to provide the necessary predictive power. Addressing this issue requires a metric by which to compare networks. This paper proposes the use of a classical measure of difference between amplitude distributions for periodic signals to compare two networks according to the differences of their trajectories in the steady state. The metric is applicable to networks with both continuous and discrete values for both time and state, and it possesses the critical property that it allows the comparison of networks of different natures. We demonstrate application of the metric by comparing a continuous-valued reference network against simplified versions obtained via quantization.

14.
Am J Hum Genet ; 80(1): 126-39, 2007 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-17160900

RESUMO

We report the development and validation of experimental methods, study designs, and analysis software for pooling-based genomewide association (GWA) studies that use high-throughput single-nucleotide-polymorphism (SNP) genotyping microarrays. We first describe a theoretical framework for establishing the effectiveness of pooling genomic DNA as a low-cost alternative to individually genotyping thousands of samples on high-density SNP microarrays. Next, we describe software called "GenePool," which directly analyzes SNP microarray probe intensity data and ranks SNPs by increased likelihood of being genetically associated with a trait or disorder. Finally, we apply these methods to experimental case-control data and demonstrate successful identification of published genetic susceptibility loci for a rare monogenic disease (sudden infant death with dysgenesis of the testes syndrome), a rare complex disease (progressive supranuclear palsy), and a common complex disease (Alzheimer disease) across multiple SNP genotyping platforms. On the basis of these theoretical calculations and their experimental validation, our results suggest that pooling-based GWA studies are a logical first step for determining whether major genetic associations exist in diseases with high heritability.


Assuntos
Genoma Humano , Modelos Genéticos , Polimorfismo de Nucleotídeo Único/genética , Software , Doença de Alzheimer/genética , Estudos de Casos e Controles , Simulação por Computador , Marcadores Genéticos , Genótipo , Disgenesia Gonadal/genética , Humanos , Masculino , Análise de Sequência com Séries de Oligonucleotídeos , Projetos de Pesquisa , Paralisia Supranuclear Progressiva/genética , Síndrome , Testículo/anormalidades
15.
Bioinformatics ; 22(7): 837-42, 2006 Apr 01.
Artigo em Inglês | MEDLINE | ID: mdl-16428263

RESUMO

MOTIVATION: Given a large set of potential features, such as the set of all gene-expression values from a microarray, it is necessary to find a small subset with which to classify. The task of finding an optimal feature set of a given size is inherently combinatoric because to assure optimality all feature sets of a given size must be checked. Thus, numerous suboptimal feature-selection algorithms have been proposed. There are strong impediments to evaluate feature-selection algorithms using real data when data are limited, a common situation in genetic classification. The difficulty is compound. First, there are no class-conditional distributions from which to draw data points, only a single small labeled sample. Second, there are no test data with which to estimate the feature-set errors, and one must depend on a training-data-based error estimator. Finally, there is no optimal feature set with which to compare the feature sets found by the algorithms. RESULTS: This paper describes a genetic test bed for the evaluation of feature-selection algorithms. It begins with a large biological feature-label dataset that is used as an empirical distribution and, using massively parallel computation, finds the top feature sets of various sizes based on a given sample size and classification rule. The user can draw random samples from the data, apply a proposed algorithm, and evaluate the proficiency of the proposed algorithm via three different measures (code provided). A key feature of the test bed is that, once a dataset is input, a single command creates the entire test bed relative to the dataset. The particular dataset used for the first version of the test bed comes from a microarray-based classification study that analyzes a large number of microarrays, prepared with RNA from breast tumor samples from each of 295 patients. AVAILABILITY: The software and supplementary material are available at http://public.tgen.org/tgen-cb/support/testbed/ CONTACT: edward@ece.tamu.edu.


Assuntos
Algoritmos , Simulação por Computador , Perfilação da Expressão Gênica/métodos , Neoplasias da Mama , Coleta de Dados , Bases de Dados Genéticas , Feminino , Humanos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos/métodos
16.
Transplantation ; 77(8): 1301-4, 2004 Apr 27.
Artigo em Inglês | MEDLINE | ID: mdl-15114103

RESUMO

BACKGROUND: The influence of islet transportation on pancreatic islet allotransplantation in type 1 diabetic patients was evaluated within the GRAGIL network. PATIENTS AND METHODS: From December 2001 to April 2003, 16 human pancreatic islet transplants were performed in 9 type 1 diabetic patients with an established kidney graft (functioning for at least 6 months) in four centers of the GRAGIL network. Islet isolation was performed in a core laboratory in Geneva, and the islet preparations were shipped by ambulance to each center for transplantation. One month after transplantation, the efficiency of the graft was assessed according to islet transportation time (ITT): ITT less than 2 hours (group 1, n=5), and ITT greater than 4.5 hours (group 2, n=4, mediant 5 hours). RESULTS: Primary graft dysfunction was observed in one patient in group 1 after one month. Two patients became insulin independent in groups 1 and 2. All other patients in both groups had a plasma C-peptide level greater than 0.5 ng/ml. The HbA1c level and the exogenous insulin needs decreased in both groups. CONCLUSIONS: ITT does not seem to influence the efficiency of pancreatic islet allotransplantation in type 1 diabetic patients. These results emphasize the scope for multicenter networks such as the GRAGIL group.


Assuntos
Diabetes Mellitus Tipo 1/cirurgia , Transplante das Ilhotas Pancreáticas/métodos , Obtenção de Tecidos e Órgãos/métodos , Adulto , Peptídeo C/sangue , Creatinina/sangue , Feminino , França , Hemoglobinas Glicadas/metabolismo , Sobrevivência de Enxerto , Humanos , Transplante das Ilhotas Pancreáticas/fisiologia , Masculino , Pessoa de Meia-Idade , Suíça , Fatores de Tempo , Meios de Transporte
17.
Environ Health Perspect ; 112(4): 403-12, 2004 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-15033587

RESUMO

The co-expression of genes coupled to additive probabilistic relationships was used to identify gene sets predictive of the complex biological interactions regulated by ligands of the aryl hydrocarbon receptor ((Italic)Ahr(/Italic)). To maximize the number of possible gene-gene combinations, data sets from murine embryonic kidney, fetal heart, and vascular smooth muscle cells challenged (Italic)in vitro(/Italic) with ligands of the (Italic)Ahr(/Italic) were used to create predictor/training data sets. Biologically relevant gene predictor sets were calculated for (Italic)Ahr(/Italic), cytochrome P450 1B1, insulin-like growth factor-binding protein-5, lysyl oxidase, and osteopontin. Transcript levels were categorized into ternary expressions and target genes selected from the data set and tested for all possible combinations using three gene sets as predictors of transitional level. The goodness of prediction for each set was quantified using a multivariate nonlinear coefficient of determination. Evidence is presented that predictor gene combinations can be effectively used to resolve gene-gene interactions regulated by (Italic)Ahr(/Italic) ligands. (Italic)Key words:(/Italic) aryl hydrocarbon receptor, bioinformatics, gene networks, genomics. (Italic)Environ Health Perspect (/Italic)112:403-412 (2004). [Online 14 January 2004]


Assuntos
Regulação da Expressão Gênica , Receptores de Hidrocarboneto Arílico/genética , Receptores de Hidrocarboneto Arílico/fisiologia , Animais , Poluentes Ambientais/toxicidade , Genômica , Ligantes , Camundongos , Camundongos Endogâmicos C57BL , Análise de Sequência com Séries de Oligonucleotídeos , Valor Preditivo dos Testes , Transcrição Gênica
18.
Bioinformatics ; 19(8): 944-51, 2003 May 22.
Artigo em Inglês | MEDLINE | ID: mdl-12761056

RESUMO

MOTIVATION: A major problem of pattern classification is estimation of the Bayes error when only small samples are available. One way to estimate the Bayes error is to design a classifier based on some classification rule applied to sample data, estimate the error of the designed classifier, and then use this estimate as an estimate of the Bayes error. Relative to the Bayes error, the expected error of the designed classifier is biased high, and this bias can be severe with small samples. RESULTS: This paper provides a correction for the bias by subtracting a term derived from the representation of the estimation error. It does so for Boolean classifiers, these being defined on binary features. Although the general theory applies to any Boolean classifier, a model is introduced to reduce the number of parameters. A key point is that the expected correction is conservative. Properties of the corrected estimate are studied via simulation. The correction applies to binary predictors because they are mathematically identical to Boolean classifiers. In this context the correction is adapted to the coefficient of determination, which has been used to measure nonlinear multivariate relations between genes and design genetic regulatory networks. An application using gene-expression data from a microarray experiment is provided on the website http://gspsnap.tamu.edu/smallsample/ (user:'smallsample', password:'smallsample)').


Assuntos
Teorema de Bayes , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Reconhecimento Automatizado de Padrão , Alinhamento de Sequência/métodos , Algoritmos , Simulação por Computador , Regulação da Expressão Gênica/genética , Modelos Genéticos , Controle de Qualidade , Reprodutibilidade dos Testes , Tamanho da Amostra , Sensibilidade e Especificidade , Processamento de Sinais Assistido por Computador
19.
Cancer Lett ; 191(2): 193-202, 2003 Mar 10.
Artigo em Inglês | MEDLINE | ID: mdl-12618333

RESUMO

Here, we describe the identification of three human genes with altered expression in thyroid diseases. One of them corresponds to insulin-like growth factor binding protein 5 (IGFBP5), which has already been described as over expressed in other cancers and, for the first time, is identified as overexpressed in thyroid tumors. The other genes, named 44 and 199, are ESTs with yet unknown function and were mapped on human chromosomes seven and four, respectively. We determined by RT-PCR the expression level of these genes in ten samples of disease-free thyroid, ten of goiter, nine of papillary carcinoma, ten of adenoma and seven of follicular carcinoma and the significance of observed differences was statistically determined. IGFBP-5 and gene 44 were significantly overexpressed in papillary carcinoma when compared to normal and goiter. Genes 44 and 199 were differentially expressed in follicular carcinoma and adenoma when compared to normal thyroid tissue.


Assuntos
Adenocarcinoma Folicular/genética , Adenoma/genética , Carcinoma Papilar/genética , Etiquetas de Sequências Expressas , Bócio/genética , Proteína 5 de Ligação a Fator de Crescimento Semelhante à Insulina/genética , Neoplasias da Glândula Tireoide/genética , Adenocarcinoma Folicular/metabolismo , Adenocarcinoma Folicular/patologia , Adenoma/metabolismo , Adenoma/patologia , Southern Blotting , Carcinoma Papilar/metabolismo , Carcinoma Papilar/patologia , Cromossomos Humanos Par 4/genética , Cromossomos Humanos Par 7/genética , Primers do DNA/química , Diagnóstico Diferencial , Perfilação da Expressão Gênica , Bócio/metabolismo , Bócio/patologia , Humanos , Proteína 5 de Ligação a Fator de Crescimento Semelhante à Insulina/metabolismo , RNA Mensageiro/análise , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Glândula Tireoide/metabolismo , Neoplasias da Glândula Tireoide/metabolismo , Neoplasias da Glândula Tireoide/patologia
20.
J Comput Biol ; 9(1): 105-26, 2002.
Artigo em Inglês | MEDLINE | ID: mdl-11911797

RESUMO

There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-based and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression profile graphics are generated and error analysis is displayed within the context of these profile graphics. A large amount of generated output is available over the web.


Assuntos
Análise de Sequência com Séries de Oligonucleotídeos/métodos , Biologia Computacional , Impressões Digitais de DNA , Regulação da Expressão Gênica , Marcadores Genéticos , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...