Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 58
Filtrar
Mais filtros

Base de dados
País/Região como assunto
Tipo de documento
Intervalo de ano de publicação
1.
BMC Bioinformatics ; 17(1): 213, 2016 May 13.
Artigo em Inglês | MEDLINE | ID: mdl-27177941

RESUMO

BACKGROUND: Next-generation sequencing (NGS) technologies have provided researchers with vast possibilities in various biological and biomedical research areas. Efficient data mining strategies are in high demand for large scale comparative and evolutional studies to be performed on the large amounts of data derived from NGS projects. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. METHODS: We report a novel procedure to analyse NGS data using topic modeling. It consists of four major procedures: NGS data retrieval, preprocessing, topic modeling, and data mining using Latent Dirichlet Allocation (LDA) topic outputs. The NGS data set of the Salmonella enterica strains were used as a case study to show the workflow of this procedure. The perplexity measurement of the topic numbers and the convergence efficiencies of Gibbs sampling were calculated and discussed for achieving the best result from the proposed procedure. RESULTS: The output topics by LDA algorithms could be treated as features of Salmonella strains to accurately describe the genetic diversity of fliC gene in various serotypes. The results of a two-way hierarchical clustering and data matrix analysis on LDA-derived matrices successfully classified Salmonella serotypes based on the NGS data. The implementation of topic modeling in NGS data analysis procedure provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data. CONCLUSION: The implementation of topic modeling in NGS data analysis provides a new way to elucidate genetic information from NGS data, and identify the gene-phenotype relationships and biomarkers, especially in the era of biological and medical big data.


Assuntos
Algoritmos , Mineração de Dados/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Biomarcadores/análise , Análise por Conglomerados , Modelos Teóricos , Polimorfismo de Nucleotídeo Único/genética , Salmonella/classificação , Salmonella/genética , Sorotipagem
2.
BMC Public Health ; 16: 279, 2016 Mar 19.
Artigo em Inglês | MEDLINE | ID: mdl-26993983

RESUMO

BACKGROUND: Both adolescent substance use and adolescent depression are major public health problems, and have the tendency to co-occur. Thousands of articles on adolescent substance use or depression have been published. It is labor intensive and time consuming to extract huge amounts of information from the cumulated collections. Topic modeling offers a computational tool to find relevant topics by capturing meaningful structure among collections of documents. METHODS: In this study, a total of 17,723 abstracts from PubMed published from 2000 to 2014 on adolescent substance use and depression were downloaded as objects, and Latent Dirichlet allocation (LDA) was applied to perform text mining on the dataset. Word clouds were used to visually display the content of topics and demonstrate the distribution of vocabularies over each topic. RESULTS: The LDA topics recaptured the search keywords in PubMed, and further discovered relevant issues, such as intervention program, association links between adolescent substance use and adolescent depression, such as sexual experience and violence, and risk factors of adolescent substance use, such as family factors and peer networks. Using trend analysis to explore the dynamics of proportion of topics, we found that brain research was assessed as a hot issue by the coefficient of the trend test. CONCLUSIONS: Topic modeling has the ability to segregate a large collection of articles into distinct themes, and it could be used as a tool to understand the literature, not only by recapturing known facts but also by discovering other relevant topics.


Assuntos
Mineração de Dados/métodos , Depressão/epidemiologia , Transtornos Relacionados ao Uso de Substâncias/epidemiologia , Adolescente , Comportamento do Adolescente , Humanos
3.
BMC Bioinformatics ; 16 Suppl 13: S8, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26424364

RESUMO

BACKGROUND: Topic modelling is an active research field in machine learning. While mainly used to build models from unstructured textual data, it offers an effective means of data mining where samples represent documents, and different biological endpoints or omics data represent words. Latent Dirichlet Allocation (LDA) is the most commonly used topic modelling method across a wide number of technical fields. However, model development can be arduous and tedious, and requires burdensome and systematic sensitivity studies in order to find the best set of model parameters. Often, time-consuming subjective evaluations are needed to compare models. Currently, research has yielded no easy way to choose the proper number of topics in a model beyond a major iterative approach. METHODS AND RESULTS: Based on analysis of variation of statistical perplexity during topic modelling, a heuristic approach is proposed in this study to estimate the most appropriate number of topics. Specifically, the rate of perplexity change (RPC) as a function of numbers of topics is proposed as a suitable selector. We test the stability and effectiveness of the proposed method for three markedly different types of grounded-truth datasets: Salmonella next generation sequencing, pharmacological side effects, and textual abstracts on computational biology and bioinformatics (TCBB) from PubMed. CONCLUSION: The proposed RPC-based method is demonstrated to choose the best number of topics in three numerical experiments of widely different data types, and for databases of very different sizes. The work required was markedly less arduous than if full systematic sensitivity studies had been carried out with number of topics as a parameter. We understand that additional investigation is needed to substantiate the method's theoretical basis, and to establish its generalizability in terms of dataset characteristics.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Heurística/fisiologia , Bases de Dados Factuais , Sequenciamento de Nucleotídeos em Larga Escala
4.
Chem Res Toxicol ; 28(9): 1784-95, 2015 Sep 21.
Artigo em Inglês | MEDLINE | ID: mdl-26308263

RESUMO

Bisphenol A (BPA) replacement compounds are released to the environment and cause widespread human exposure. However, a lack of thorough safety evaluations on the BPA replacement compounds has raised public concerns. We assessed the endocrine disruption potential of BPA replacement compounds in the market to assist their safety evaluations. A literature search was conducted to ascertain the BPA replacement compounds in use. Available experimental estrogenic activity data of these compounds were extracted from the Estrogenic Activity Database (EADB) to assess their estrogenic potential. An in silico model was developed to predict the estrogenic activity of compounds lacking experimental data. Molecular dynamics (MD) simulations were performed to understand the mechanisms by which the estrogenic compounds bind to and activate the estrogen receptor (ER). Forty-five BPA replacement compounds were identified in the literature. Seven were more estrogenic and five less estrogenic than BPA, while six were nonestrogenic in EADB. A two-tier in silico model was developed based on molecular docking to predict the estrogenic activity of the 27 compounds lacking data. Eleven were predicted as ER binders and 16 as nonbinders. MD simulations revealed hydrophobic contacts and hydrogen bonds as the main interactions between ER and the estrogenic compounds.


Assuntos
Compostos Benzidrílicos/toxicidade , Disruptores Endócrinos/toxicidade , Estrogênios/farmacologia , Fenóis/toxicidade , Simulação por Computador , Bases de Dados de Compostos Químicos , Simulação de Dinâmica Molecular
5.
BMC Bioinformatics ; 15 Suppl 11: S4, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25349983

RESUMO

BACKGROUND: Endocrine disrupting chemicals (EDCs) are exogenous compounds that interfere with the endocrine system of vertebrates, often through direct or indirect interactions with nuclear receptor proteins. Estrogen receptors (ERs) are particularly important protein targets and many EDCs are ER binders, capable of altering normal homeostatic transcription and signaling pathways. An estrogenic xenobiotic can bind ER as either an agonist or antagonist to increase or inhibit transcription, respectively. The receptor conformations in the complexes of ER bound with agonists and antagonists are different and dependent on interactions with co-regulator proteins that vary across tissue type. Assessment of chemical endocrine disruption potential depends not only on binding affinity to ERs, but also on changes that may alter the receptor conformation and its ability to subsequently bind DNA response elements and initiate transcription. Using both agonist and antagonist conformations of the ERα, we developed an in silico approach that can be used to differentiate agonist versus antagonist status of potential binders. METHODS: The approach combined separate molecular docking models for ER agonist and antagonist conformations. The ability of this approach to differentiate agonists and antagonists was first evaluated using true agonists and antagonists extracted from the crystal structures available in the protein data bank (PDB), and then further validated using a larger set of ligands from the literature. The usefulness of the approach was demonstrated with enrichment analysis in data sets with a large number of decoy ligands. RESULTS: The performance of individual agonist and antagonist docking models was found comparable to similar models in the literature. When combined in a competitive docking approach, they provided the ability to discriminate agonists from antagonists with good accuracy, as well as the ability to efficiently select true agonists and antagonists from decoys during enrichment analysis. CONCLUSION: This approach enables evaluation of potential ER biological function changes caused by chemicals bound to the receptor which, in turn, allows the assessment of a chemical's endocrine disrupting potential. The approach can be used not only by regulatory authorities to perform risk assessments on potential EDCs but also by the industry in drug discovery projects to screen for potential agonists and antagonists.


Assuntos
Disruptores Endócrinos/química , Antagonistas do Receptor de Estrogênio/química , Receptor alfa de Estrogênio/agonistas , Receptor alfa de Estrogênio/antagonistas & inibidores , Estrogênios/química , Simulação de Acoplamento Molecular/métodos , Simulação por Computador , Disruptores Endócrinos/metabolismo , Antagonistas do Receptor de Estrogênio/metabolismo , Receptor alfa de Estrogênio/química , Receptor alfa de Estrogênio/metabolismo , Estrogênios/metabolismo , Ligantes
6.
BMC Bioinformatics ; 15 Suppl 11: S6, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-25350283

RESUMO

BACKGROUND: Due to a significant decline in the costs associated with next-generation sequencing, it has become possible to decipher the genetic architecture of a population by sequencing a large number of individuals to a deep coverage. The Korean Personal Genomes Project (KPGP) recently sequenced 35 Korean genomes at high coverage using the Illumina Hiseq platform and made the deep sequencing data publicly available, providing the scientific community opportunities to decipher the genetic architecture of the Korean population. METHODS: In this study, we used two single nucleotide variant (SNV) calling pipelines: mapping the raw reads obtained from whole genome sequencing of 35 Korean individuals in KPGP using BWA and SOAP2 followed by SNV calling using SAMtools and SOAPsnp, respectively. The consensus SNVs obtained from the two SNV pipelines were used to represent the SNVs of the Korean population. We compared these SNVs to those from 17 other populations provided by the HapMap consortium and the 1000 Genomes Project (1KGP) and identified SNVs that were only present in the Korean population. We studied the mutation spectrum and analyzed the genes of non-synonymous SNVs only detected in the Korean population. RESULTS: We detected a total of 8,555,726 SNVs in the 35 Korean individuals and identified 1,213,613 SNVs detected in at least one Korean individual (SNV-1) and 12,640 in all of 35 Korean individuals (SNV-35) but not in 17 other populations. In contrast with the SNVs common to other populations in HapMap and 1KGP, the Korean only SNVs had high percentages of non-silent variants, emphasizing the unique roles of these Korean only SNVs in the Korean population. Specifically, we identified 8,361 non-synonymous Korean only SNVs, of which 58 SNVs existed in all 35 Korean individuals. The 5,754 genes of non-synonymous Korean only SNVs were highly enriched in some metabolic pathways. We found adhesion is the top disease term associated with SNV-1 and Nelson syndrome is the only disease term associated with SNV-35. We found that a significant number of Korean only SNVs are in genes that are associated with the drug term of adenosine. CONCLUSION: We identified the SNVs that were found in the Korean population but not seen in other populations, and explored the corresponding genes and pathways as well as the associated disease terms and drug terms. The results expand our knowledge of the genetic architecture of the Korean population, which will benefit the implementation of personalized medicine for the Korean population.


Assuntos
Povo Asiático/genética , Polimorfismo de Nucleotídeo Único , Doença/genética , Ontologia Genética , Estudos de Associação Genética , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Coreia (Geográfico) , Mutação , Alinhamento de Sequência , Análise de Sequência de DNA , Software
7.
Chem Res Toxicol ; 27(9): 1528-36, 2014 Sep 15.
Artigo em Inglês | MEDLINE | ID: mdl-25083553

RESUMO

Toxicogenomics (TGx) endeavors to elucidate the underlying molecular mechanisms through exploring gene expression profiles in response to toxic substances. Recently, RNA-Seq is increasingly regarded as a more powerful alternative to microarrays in TGx studies. However, realizing RNA-Seq's full potential requires novel approaches to extracting information from the complex TGx data. Considering read counts as the number of times a word occurs in a document, gene expression profiles from RNA-Seq are analogous to a word by document matrix used in text mining. Topic modeling aiming at to discover the latent structures in text corpora would be helpful to explore RNA-Seq based TGx data. In this study, topic modeling was applied on a typical RNA-Seq based TGx data set to discover hidden functional modules. The RNA-Seq based gene expression profiles were transformed into "documents", on which latent Dirichlet allocation (LDA) was used to build a topic model. We found samples treated by the compounds with the same modes of actions (MoAs) could be clustered based on topic similarities. The topic most relevant to each cluster was identified as a "marker" topic, which was interpreted by gene enrichment analysis with MoAs then confirmed by compound and pathways associations mined from literature. To further validate the "marker" topics, we tested topic transferability from RNA-Seq to microarrays. The RNA-Seq based gene expression profile of a topic specifically associated with peroxisome proliferator-activated receptors (PPAR) signaling pathway was used to query samples with similar expression profiles in two different microarray data sets, yielding accuracy of about 85%. This proof-of-concept study demonstrates the applicability of topic modeling to discover functional modules in RNA-Seq data and suggests a valuable computational tool for leveraging information within TGx data in RNA-Seq era.


Assuntos
RNA/química , Toxicogenética , Análise por Conglomerados , Análise de Sequência com Séries de Oligonucleotídeos , Receptores Ativados por Proliferador de Peroxissomo/genética , Receptores Ativados por Proliferador de Peroxissomo/metabolismo , Receptores de Estrogênio/genética , Receptores de Estrogênio/metabolismo , Análise de Sequência de RNA , Transdução de Sinais , Transcriptoma
8.
BMC Bioinformatics ; 14 Suppl 14: S6, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-24266910

RESUMO

BACKGROUND: An important mechanism of endocrine activity is chemicals entering target cells via transport proteins and then interacting with hormone receptors such as the estrogen receptor (ER). α-Fetoprotein (AFP) is a major transport protein in rodent serum that can bind and sequester estrogens, thus preventing entry to the target cell and where they could otherwise induce ER-mediated endocrine activity. Recently, we reported rat AFP binding affinities for a large set of structurally diverse chemicals, including 53 binders and 72 non-binders. However, the lack of three-dimensional (3D) structures of rat AFP hinders further understanding of the structural dependence for binding. Therefore, a 3D structure of rat AFP was built using homology modeling in order to elucidate rat AFP-ligand binding modes through docking analyses and molecular dynamics (MD) simulations. METHODS: Homology modeling was first applied to build a 3D structure of rat AFP. Molecular docking and Molecular Mechanics-Generalized Born Surface Area (MM-GBSA) scoring were then used to examine potential rat AFP ligand binding modes. MD simulations and free energy calculations were performed to refine models of binding modes. RESULTS: A rat AFP tertiary structure was first obtained using homology modeling and MD simulations. The rat AFP-ligand binding modes of 13 structurally diverse, representative binders were calculated using molecular docking, (MM-GBSA) ranking and MD simulations. The key residues for rat AFP-ligand binding were postulated through analyzing the binding modes. CONCLUSION: The optimized 3D rat AFP structure and associated ligand binding modes shed light on rat AFP-ligand binding interactions that, in turn, provide a means to estimate binding affinity of unknown chemicals. Our results will assist in the evaluation of the endocrine disruption potential of chemicals.


Assuntos
alfa-Fetoproteínas/química , Sequência de Aminoácidos , Animais , Sítios de Ligação , Ligantes , Modelos Moleculares , Simulação de Dinâmica Molecular , Dados de Sequência Molecular , Estrutura Terciária de Proteína , Coelhos , Ratos , Alinhamento de Sequência , alfa-Fetoproteínas/metabolismo
9.
Chem Res Toxicol ; 25(11): 2553-66, 2012 Nov 19.
Artigo em Inglês | MEDLINE | ID: mdl-23013281

RESUMO

Endocrine disrupting chemicals interfere with the endocrine system in animals, including humans, to exert adverse effects. One of the mechanisms of endocrine disruption is through the binding of receptors such as the estrogen receptor (ER) in target cells. The concentration of any chemical in serum is important for its entry into the target cells to bind the receptors. α-Fetoprotein (AFP) is a major transport protein in rodent serum that can bind with estrogens and thus change a chemical's availability for entrance into the target cell. Sequestration of an estrogen in the serum can alter the chemical's potential for disrupting estrogen receptor-mediated responses. To better understand endocrine disruption, we developed a competitive binding assay using rat amniotic fluid, which contains very high levels of AFP, and measured the binding to the rat AFP for 125 structurally diverse chemicals, most of which are known to bind ER. Fifty-three chemicals were able to bind the rat AFP in the assay, while 72 chemicals were determined to be nonbinders. Observations from closely examining the relationship between the binding data and structures of the tested chemicals are rationally explained in a manner consistent with proposed binding regions of rat AFP in the literature. The data reported here represent the largest data set of structurally diverse chemicals tested for rat AFP binding. The data assist in elucidating binding interactions and mechanisms between chemicals and rat AFP and, in turn, assist in the evaluation of the endocrine disrupting potential of chemicals.


Assuntos
Compostos Orgânicos/farmacologia , alfa-Fetoproteínas/metabolismo , Animais , Ligação Competitiva/efeitos dos fármacos , Relação Dose-Resposta a Droga , Feminino , Estrutura Molecular , Compostos Orgânicos/química , Ratos , Ratos Sprague-Dawley , Relação Estrutura-Atividade , alfa-Fetoproteínas/química
10.
BMC Bioinformatics ; 12 Suppl 10: S3, 2011 Oct 18.
Artigo em Inglês | MEDLINE | ID: mdl-22166133

RESUMO

BACKGROUND: Genomic biomarkers play an increasing role in both preclinical and clinical application. Development of genomic biomarkers with microarrays is an area of intensive investigation. However, despite sustained and continuing effort, developing microarray-based predictive models (i.e., genomics biomarkers) capable of reliable prediction for an observed or measured outcome (i.e., endpoint) of unknown samples in preclinical and clinical practice remains a considerable challenge. No straightforward guidelines exist for selecting a single model that will perform best when presented with unknown samples. In the second phase of the MicroArray Quality Control (MAQC-II) project, 36 analysis teams produced a large number of models for 13 preclinical and clinical endpoints. Before external validation was performed, each team nominated one model per endpoint (referred to here as 'nominated models') from which MAQC-II experts selected 13 'candidate models' to represent the best model for each endpoint. Both the nominated and candidate models from MAQC-II provide benchmarks to assess other methodologies for developing microarray-based predictive models. METHODS: We developed a simple ensemble method by taking a number of the top performing models from cross-validation and developing an ensemble model for each of the MAQC-II endpoints. We compared the ensemble models with both nominated and candidate models from MAQC-II using blinded external validation. RESULTS: For 10 of the 13 MAQC-II endpoints originally analyzed by the MAQC-II data analysis team from the National Center for Toxicological Research (NCTR), the ensemble models achieved equal or better predictive performance than the NCTR nominated models. Additionally, the ensemble models had performance comparable to the MAQC-II candidate models. Most ensemble models also had better performance than the nominated models generated by five other MAQC-II data analysis teams that analyzed all 13 endpoints. CONCLUSIONS: Our findings suggest that an ensemble method can often attain a higher average predictive performance in an external validation set than a corresponding "optimized" model method. Using an ensemble method to determine a final model is a potentially important supplement to the good modeling practices recommended by the MAQC-II project for developing microarray-based genomic biomarkers.


Assuntos
Modelos Genéticos , Neoplasias/genética , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Toxicogenética/métodos , Perfilação da Expressão Gênica/métodos , Humanos , Metanálise como Assunto , Controle de Qualidade
12.
Chem Res Toxicol ; 24(9): 1486-93, 2011 Sep 19.
Artigo em Inglês | MEDLINE | ID: mdl-21834575

RESUMO

RNA-Seq has been increasingly used for the quantification and characterization of transcriptomes. The ongoing development of the technology promises the more accurate measurement of gene expression. However, its benefits over widely accepted microarray technologies have not been adequately assessed, especially in toxicogenomics studies. The goal of this study is to enhance the scientific community's understanding of the advantages and challenges of RNA-Seq in the quantification of gene expression by comparing analysis results from RNA-Seq and microarray data on a toxicogenomics study. A typical toxicogenomics study design was used to compare the performance of an RNA-Seq approach (Illumina Genome Analyzer II) to a microarray-based approach (Affymetrix Rat Genome 230 2.0 arrays) for detecting differentially expressed genes (DEGs) in the kidneys of rats treated with aristolochic acid (AA), a carcinogenic and nephrotoxic chemical most notably used for weight loss. We studied the comparability of the RNA-Seq and microarray data in terms of absolute gene expression, gene expression patterns, differentially expressed genes, and biological interpretation. We found that RNA-Seq was more sensitive in detecting genes with low expression levels, while similar gene expression patterns were observed for both platforms. Moreover, although the overlap of the DEGs was only 40-50%, the biological interpretation was largely consistent between the RNA-Seq and microarray data. RNA-Seq maintained a consistent biological interpretation with time-tested microarray platforms while generating more sensitive results. However, there is clearly a need for future investigations to better understand the advantages and limitations of RNA-Seq in toxicogenomics studies and environmental health research.


Assuntos
Ácidos Aristolóquicos/toxicidade , Carcinógenos/toxicidade , Perfilação da Expressão Gênica/métodos , Rim/efeitos dos fármacos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência de RNA/métodos , Animais , Testes de Carcinogenicidade/métodos , Regulação da Expressão Gênica/efeitos dos fármacos , Rim/metabolismo , Ratos , Toxicogenética/métodos
13.
BMC Bioinformatics ; 11 Suppl 6: S5, 2010 Oct 07.
Artigo em Inglês | MEDLINE | ID: mdl-20946616

RESUMO

BACKGROUND: Endocrine disruptors (EDs) and their broad range of potential adverse effects in humans and other animals have been a concern for nearly two decades. Many putative EDs are widely used in commercial products regulated by the Food and Drug Administration (FDA) such as food packaging materials, ingredients of cosmetics, medical and dental devices, and drugs. The Endocrine Disruptor Knowledge Base (EDKB) project was initiated in the mid 1990's by the FDA as a resource for the study of EDs. The EDKB database, a component of the project, contains data across multiple assay types for chemicals across a broad structural diversity. This paper demonstrates the utility of EDKB database, an integral part of the EDKB project, for understanding and prioritizing EDs for testing. RESULTS: The EDKB database currently contains 3,257 records of over 1,800 EDs from different assays including estrogen receptor binding, androgen receptor binding, uterotropic activity, cell proliferation, and reporter gene assays. Information for each compound such as chemical structure, assay type, potency, etc. is organized to enable efficient searching. A user-friendly interface provides rapid navigation, Boolean searches on EDs, and both spreadsheet and graphical displays for viewing results. The search engine implemented in the EDKB database enables searching by one or more of the following fields: chemical structure (including exact search and similarity search), name, molecular formula, CAS registration number, experiment source, molecular weight, etc. The data can be cross-linked to other publicly available and related databases including TOXNET, Cactus, ChemIDplus, ChemACX, Chem Finder, and NCI DTP. CONCLUSION: The EDKB database enables scientists and regulatory reviewers to quickly access ED data from multiple assays for specific or similar compounds. The data have been used to categorize chemicals according to potential risks for endocrine activity, thus providing a basis for prioritizing chemicals for more definitive but expensive testing. The EDKB database is publicly available and can be found online at http://edkb.fda.gov/webstart/edkb/index.html.


Assuntos
Bases de Dados Factuais , Disruptores Endócrinos/química , Disruptores Endócrinos/toxicidade , Contaminação de Alimentos/prevenção & controle , Embalagem de Alimentos , Regulamentação Governamental , Humanos , Bases de Conhecimento , Ferramenta de Busca
14.
Anal Biochem ; 385(2): 203-7, 2009 Feb 15.
Artigo em Inglês | MEDLINE | ID: mdl-19059192

RESUMO

Quality control of a microarray experiment has become an important issue for both research and regulation. External RNA controls (ERCs), which can be either added to the total RNA level (tERCs) or introduced right before hybridization (cERCs), are designed and recommended by commercial microarray platforms for assessment of performance of a microarray experiment. However, the utility of ERCs has not been fully realized mainly due to the lack of sufficient data resources. The US Food and Drug Administration (FDA)-led community-wide Microarray Quality Control (MAQC) study generates a large amount of microarray data with implementation of ERCs across several commercial microarray platforms. The utility of ERCs in quality control by assessing the ERCs' concentration-response behavior was investigated in the MAQC study. In this work, an ERC-based correlation analysis was conducted to assess the quality of a microarray experiment. We found that the pairwise correlations of tERCs are sample independent, indicating that the array data obtained from different biological samples can be treated as technical replicates in analysis of tERCs. Consequently, the commonly used quality control method of applying correlation analysis on technical replicates can be adopted for assessing array performance based on different biological samples using tERCs. The proposed approach is sensitive to identifying outlying assays and is not dependent on the choice of normalization method.


Assuntos
Análise de Sequência com Séries de Oligonucleotídeos/normas , Controle de Qualidade , RNA/análise , Padrões de Referência
15.
Curr Opin Biotechnol ; 19(1): 10-8, 2008 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-18155896

RESUMO

Over a few short years, microarray gene expression profiling has permeated most areas of biomedical research. Microarrays are now poised to enter the more demanding realm of clinical applications. The prospect of using microarray data to derive biomarkers of disease or toxicity, predict prognosis, or select treatments raises the validity and reliability bar substantially higher. The potential future payoffs are huge in terms of faster approval of more efficacious and safer medical interventions, and a more personalized implementation of them. Arriving at the future sooner rather than later is the motivation for the FDA-led MicroArray Quality Control (MAQC) project. The widespread collaboration aims to assess achievable technical performance of microarrays and capabilities and limitations of methods for microarray data analysis.


Assuntos
Análise de Sequência com Séries de Oligonucleotídeos/normas , Biotecnologia , Interpretação Estatística de Dados , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/normas , Perfilação da Expressão Gênica/estatística & dados numéricos , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , Controle de Qualidade , RNA/genética , RNA/normas , Padrões de Referência , Reprodutibilidade dos Testes , Estados Unidos , United States Food and Drug Administration
16.
Methods Mol Biol ; 563: 379-98, 2009.
Artigo em Inglês | MEDLINE | ID: mdl-19597796

RESUMO

A robust bioinformatics capability is widely acknowledged as central to realizing the promises of toxicogenomics. Successful application of toxicogenomic approaches, such as DNA microarrays, inextricably relies on appropriate data management, the ability to extract knowledge from massive amounts of data, and the availability of functional information for data interpretation. At the FDA's National Center for Toxicological Research (NCTR), we are developing a public microarray data management and analysis software, called ArrayTrack, that is also used in the routine review of genomic data submitted to the FDA. ArrayTrack stores a full range of information related to DNA microarrays and clinical and non-clinical studies as well as the digested data derived from proteomics and metabonomics experiments. In addition, ArrayTrack provides a rich collection of functional information about genes, proteins, and pathways drawn from various public biological databases for facilitating data interpretation. Many data analysis and visualization tools are available with ArrayTrack for individual platform data analysis, multiple omics data integration, and integrated analysis of omics data with study data. Importantly, gene expression data, functional information, and analysis methods are fully integrated so that the data analysis and interpretation process is simplified and enhanced. Using ArrayTrack, users can select an analysis method from the ArrayTrack tool box, apply the method to selected microarray data, and the analysis of results can be directly linked to individual gene, pathway, and Gene Ontology analysis. ArrayTrack is publicly available online ( http://www.fda.gov/nctr/science/centers/toxicoinformatics/ArrayTrack/index.htm ) and the prospective user can also request a local installation version by contacting the authors.


Assuntos
Biologia Computacional/métodos , Análise de Sequência com Séries de Oligonucleotídeos , Farmacogenética/métodos , Software , Bases de Dados Genéticas , Toxicogenética/métodos , Estados Unidos , United States Food and Drug Administration
17.
BMC Bioinformatics ; 9 Suppl 9: S9, 2008 Aug 12.
Artigo em Inglês | MEDLINE | ID: mdl-18793473

RESUMO

BACKGROUND: Advances in DNA microarray technology portend that molecular signatures from which microarray will eventually be used in clinical environments and personalized medicine. Derivation of biomarkers is a large step beyond hypothesis generation and imposes considerably more stringency for accuracy in identifying informative gene subsets to differentiate phenotypes. The inherent nature of microarray data, with fewer samples and replicates compared to the large number of genes, requires identifying informative genes prior to classifier construction. However, improving the ability to identify differentiating genes remains a challenge in bioinformatics. RESULTS: A new hybrid gene selection approach was investigated and tested with nine publicly available microarray datasets. The new method identifies a Very Important Pool (VIP) of genes from the broad patterns of gene expression data. The method uses a bagging sampling principle, where the re-sampled arrays are used to identify the most informative genes. Frequency of selection is used in a repetitive process to identify the VIP genes. The putative informative genes are selected using two methods, t-statistic and discriminatory analysis. In the t-statistic, the informative genes are identified based on p-values. In the discriminatory analysis, disjoint Principal Component Analyses (PCAs) are conducted for each class of samples, and genes with high discrimination power (DP) are identified. The VIP gene selection approach was compared with the p-value ranking approach. The genes identified by the VIP method but not by the p-value ranking approach are also related to the disease investigated. More importantly, these genes are part of the pathways derived from the common genes shared by both the VIP and p-ranking methods. Moreover, the binary classifiers built from these genes are statistically equivalent to those built from the top 50 p-value ranked genes in distinguishing different types of samples. CONCLUSION: The VIP gene selection approach could identify additional subsets of informative genes that would not always be selected by the p-value ranking method. These genes are likely to be additional true positives since they are a part of pathways identified by the p-value ranking method and expected to be related to the relevant biology. Therefore, these additional genes derived from the VIP method potentially provide valuable biological insights.


Assuntos
Algoritmos , Perfilação da Expressão Gênica/métodos , Pool Gênico , Genes/genética , Marcadores Genéticos/genética , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Proteoma/genética
18.
BMC Bioinformatics ; 9 Suppl 9: S17, 2008 Aug 12.
Artigo em Inglês | MEDLINE | ID: mdl-18793462

RESUMO

BACKGROUND: Genome-wide association studies (GWAS) aim to identify genetic variants (usually single nucleotide polymorphisms [SNPs]) across the entire human genome that are associated with phenotypic traits such as disease status and drug response. Highly accurate and reproducible genotype calling are paramount since errors introduced by calling algorithms can lead to inflation of false associations between genotype and phenotype. Most genotype calling algorithms currently used for GWAS are based on multiple arrays. Because hundreds of gigabytes (GB) of raw data are generated from a GWAS, the samples are typically partitioned into batches containing subsets of the entire dataset for genotype calling. High call rates and accuracies have been achieved. However, the effects of batch size (i.e., number of chips analyzed together) and of batch composition (i.e., the choice of chips in a batch) on call rate and accuracy as well as the propagation of the effects into significantly associated SNPs identified have not been investigated. In this paper, we analyzed both the batch size and batch composition for effects on the genotype calling algorithm BRLMM using raw data of 270 HapMap samples analyzed with the Affymetrix Human Mapping 500 K array set. RESULTS: Using data from 270 HapMap samples interrogated with the Affymetrix Human Mapping 500 K array set, three different batch sizes and three different batch compositions were used for genotyping using the BRLMM algorithm. Comparative analysis of the calling results and the corresponding lists of significant SNPs identified through association analysis revealed that both batch size and composition affected genotype calling results and significantly associated SNPs. Batch size and batch composition effects were more severe on samples and SNPs with lower call rates than ones with higher call rates, and on heterozygous genotype calls compared to homozygous genotype calls. CONCLUSION: Batch size and composition affect the genotype calling results in GWAS using BRLMM. The larger the differences in batch sizes, the larger the effect. The more homogenous the samples in the batches, the more consistent the genotype calls. The inconsistency propagates to the lists of significantly associated SNPs identified in downstream association analysis. Thus, uniform and large batch sizes should be used to make genotype calls for GWAS. In addition, samples of high homogeneity should be placed into the same batch.


Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Genoma Humano/genética , Haplótipos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Polimorfismo de Nucleotídeo Único/genética , Software , Sequência de Bases , Análise Mutacional de DNA/métodos , Genótipo , Humanos , Dados de Sequência Molecular
19.
BMC Bioinformatics ; 9 Suppl 9: S10, 2008 Aug 12.
Artigo em Inglês | MEDLINE | ID: mdl-18793455

RESUMO

BACKGROUND: Reproducibility is a fundamental requirement in scientific experiments. Some recent publications have claimed that microarrays are unreliable because lists of differentially expressed genes (DEGs) are not reproducible in similar experiments. Meanwhile, new statistical methods for identifying DEGs continue to appear in the scientific literature. The resultant variety of existing and emerging methods exacerbates confusion and continuing debate in the microarray community on the appropriate choice of methods for identifying reliable DEG lists. RESULTS: Using the data sets generated by the MicroArray Quality Control (MAQC) project, we investigated the impact on the reproducibility of DEG lists of a few widely used gene selection procedures. We present comprehensive results from inter-site comparisons using the same microarray platform, cross-platform comparisons using multiple microarray platforms, and comparisons between microarray results and those from TaqMan - the widely regarded "standard" gene expression platform. Our results demonstrate that (1) previously reported discordance between DEG lists could simply result from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion with a non-stringent P-value cutoff filtering, the DEG lists become much more reproducible, especially when fewer genes are selected as differentially expressed, as is the case in most microarray studies; and (3) the instability of short DEG lists solely based on P-value ranking is an expected mathematical consequence of the high variability of the t-values; the more stringent the P-value threshold, the less reproducible the DEG list is. These observations are also consistent with results from extensive simulation calculations. CONCLUSION: We recommend the use of FC-ranking plus a non-stringent P cutoff as a straightforward and baseline practice in order to generate more reproducible DEG lists. Specifically, the P-value cutoff should not be stringent (too small) and FC should be as large as possible. Our results provide practical guidance to choose the appropriate FC and P-value cutoffs when selecting a given number of DEGs. The FC criterion enhances reproducibility, whereas the P criterion balances sensitivity and specificity.


Assuntos
Algoritmos , Interpretação Estatística de Dados , Perfilação da Expressão Gênica/métodos , Genes/genética , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Simulação por Computador , Modelos Genéticos , Modelos Estatísticos , Reprodutibilidade dos Testes , Sensibilidade e Especificidade
20.
OMICS ; 11(1): 14-24, 2007.
Artigo em Inglês | MEDLINE | ID: mdl-17411393

RESUMO

Dye-specific bias effects, commonly observed in the two-color microarray platform, are normally corrected using the dye swap design. This design, however, is relatively expensive and labor-intensive. We propose a self-self hybridization design as an alternative to the dye swap design. In this design, the treated and control samples are labeled with Cy5 and Cy3 (or Cy3 and Cy5), respectively, without dye swap, along with a set of self-self hybridizations on the control sample. We compare this design with the dye swap design through investigation of mouse primary hepatocytes treated with three peroxisome proliferator-activated receptor-alpha (PPARalpha) agonists at three dose levels. Using Agilent's Whole Mouse Genome microarray, differentially expressed genes (DEG) were determined for both the self-self hybridization and dye swap designs. The DEG concordance between the two designs was over 80% across each dose treatment and chemical. Furthermore, 90% of DEG-associated biological pathways were in common between the designs, indicating that biological interpretations would be consistent. The reduced labor and expense for the self-self hybridization design make it an efficient substitute for the dye swap design. For example, in larger toxicogenomic studies, only about half the chips are required for the self-self hybridization design compared to that needed in the dye swap design.


Assuntos
Regulação da Expressão Gênica , Hibridização de Ácido Nucleico , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Animais , Carbocianinas/farmacologia , Corantes Fluorescentes/farmacologia , Genoma , Genômica , Hepatócitos/metabolismo , Camundongos , Modelos Genéticos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA