RESUMO
BACKGROUND: Modern genomic and proteomic studies reveal that many diseases are heterogeneous, comprising multiple different subtypes. The common notion that one biomarker can be predictive for all patients may need to be replaced by an understanding that each subtype has its own set of unique biomarkers, affecting how discovery studies are designed and analyzed. METHODS: We used Monte Carlo simulation to measure and compare the performance of eight selection methods with homogeneous and heterogeneous diseases using both single-stage and two-stage designs. We also applied the selection methods in an actual proteomic biomarker screening study of heterogeneous breast cancer cases. RESULTS: Different selection methods were optimal, and more than two-fold larger sample sizes were needed for heterogeneous diseases compared with homogeneous diseases. We also found that for larger studies, two-stage designs can achieve nearly the same statistical power as single-stage designs at significantly reduced cost. CONCLUSIONS: We found that disease heterogeneity profoundly affected biomarker performance. We report sample size requirements and provide guidance on the design and analysis of biomarker discovery studies for both homogeneous and heterogeneous diseases. IMPACT: We have shown that studies to identify biomarkers for the early detection of heterogeneous disease require different statistical selection methods and larger sample sizes than if the disease were homogeneous. These findings provide a methodologic platform for biomarker discovery of heterogeneous diseases.
Assuntos
Biomarcadores Tumorais/metabolismo , Neoplasias da Mama/metabolismo , Neoplasias da Mama/patologia , Genômica/métodos , Modelos Biológicos , Proteômica/métodos , Biomarcadores Tumorais/genética , Neoplasias da Mama/genética , Estudos de Casos e Controles , Feminino , Humanos , Método de Monte Carlo , Projetos de PesquisaRESUMO
Peptide and protein identification via tandem mass spectrometry (MS/MS) lies at the heart of proteomic characterization of biological samples. Several algorithms are able to search, score, and assign peptides to large MS/MS datasets. Most popular methods, however, underutilize the intensity information available in the tandem mass spectrum due to the complex nature of the peptide fragmentation process, thus contributing to loss of potential identifications. We present a novel probabilistic scoring algorithm called Context-Sensitive Peptide Identification (CSPI) based on highly flexible Input-Output Hidden Markov Models (IO-HMM) that capture the influence of peptide physicochemical properties on their observed MS/MS spectra. We use several local and global properties of peptides and their fragment ions from literature. Comparison with two popular algorithms, Crux (re-implementation of SEQUEST) and X!Tandem, on multiple datasets of varying complexity, shows that peptide identification scores from our models are able to achieve greater discrimination between true and false peptides, identifying up to â¼25% more peptides at a False Discovery Rate (FDR) of 1%. We evaluated two alternative normalization schemes for fragment ion-intensities, a global rank-based and a local window-based. Our results indicate the importance of appropriate normalization methods for learning superior models. Further, combining our scores with Crux using a state-of-the-art procedure, Percolator, we demonstrate the utility of using scoring features from intensity-based models, identifying â¼4-8 % additional identifications over Percolator at 1% FDR. IO-HMMs offer a scalable and flexible framework with several modeling choices to learn complex patterns embedded in MS/MS data.
Assuntos
Cadeias de Markov , Peptídeos/análise , Espectrometria de Massas em Tandem , Algoritmos , Bases de Dados de Proteínas , Reprodutibilidade dos Testes , Sensibilidade e Especificidade , SoftwareRESUMO
OBJECTIVE: Study the decision to issue a boil-water advisory in response to a spike in sales of diarrhea remedies or wait 72 hours for the results of definitive testing of water and people. METHODS: Decision analysis. RESULTS: In the base-case analysis, the optimal decision is test-and-wait. If the cost of issuing a boil-water advisory is less than 13.92 cents per person per day, the optimal decision is to issue the boil-water advisory immediately. CONCLUSIONS: Decisions based on surveillance data that are suggestive but not conclusive about the existence of a disease outbreak can be modeled.
Assuntos
Técnicas de Apoio para a Decisão , Surtos de Doenças/prevenção & controle , Purificação da Água , Abastecimento de Água , Antidiarreicos/uso terapêutico , Cryptosporidiidae , Diarreia/prevenção & controle , Diarreia/terapia , Humanos , Medicamentos sem Prescrição/uso terapêutico , Vigilância da População , Água/parasitologiaRESUMO
In the search for genetic determinants of complex disease, two approaches to association analysis are most often employed, testing single loci or testing a small group of loci jointly via haplotypes for their relationship to disease status. It is still debatable which of these approaches is more favourable, and under what conditions. The former has the advantage of simplicity but suffers severely when alleles at the tested loci are not in linkage disequilibrium (LD) with liability alleles; the latter should capture more of the signal encoded in LD, but is far from simple. The complexity of haplotype analysis could be especially troublesome for association scans over large genomic regions, which, in fact, is becoming the standard design. For these reasons, the authors have been evaluating statistical methods that bridge the gap between single-locus and haplotype-based tests. In this article, they present one such method, which uses non-parametric regression techniques embodied by Bayesian adaptive regression splines (BARS). For a set of markers falling within a common genomic region and a corresponding set of single-locus association statistics, the BARS procedure integrates these results into a single test by examining the class of smooth curves consistent with the data. The non-parametric BARS procedure generally finds no signal when no liability allele exists in the tested region (ie it achieves the specified size of the test) and it is sensitive enough to pick up signals when a liability allele is present. The BARS procedure provides a robust and potentially powerful alternative to classical tests of association, diminishes the multiple testing problem inherent in those tests and can be applied to a wide range of data types, including genotype frequencies estimated from pooled samples.