Pesquisa | Portal de Pesquisa da BVS Enfermagem

1.

A powerful approach to identify replicable variants in genome-wide association studies.

Li, Yan; Lei, Haochen; Wen, Xiaoquan; Cao, Hongyuan.

Am J Hum Genet ; 111(5): 966-978, 2024 05 02.

Artigo em Inglês | MEDLINE | ID: mdl-38701746

RESUMO

Replicability is the cornerstone of modern scientific research. Reliable identifications of genotype-phenotype associations that are significant in multiple genome-wide association studies (GWASs) provide stronger evidence for the findings. Current replicability analysis relies on the independence assumption among single-nucleotide polymorphisms (SNPs) and ignores the linkage disequilibrium (LD) structure. We show that such a strategy may produce either overly liberal or overly conservative results in practice. We develop an efficient method, ReAD, to detect replicable SNPs associated with the phenotype from two GWASs accounting for the LD structure. The local dependence structure of SNPs across two heterogeneous studies is captured by a four-state hidden Markov model (HMM) built on two sequences of p values. By incorporating information from adjacent locations via the HMM, our approach provides more accurate SNP significance rankings. ReAD is scalable, platform independent, and more powerful than existing replicability analysis methods with effective false discovery rate control. Through analysis of datasets from two asthma GWASs and two ulcerative colitis GWASs, we show that ReAD can identify replicable genetic loci that existing methods might otherwise miss.

Assuntos

Asma , Estudo de Associação Genômica Ampla , Desequilíbrio de Ligação , Polimorfismo de Nucleotídeo Único , Estudo de Associação Genômica Ampla/métodos , Humanos , Asma/genética , Cadeias de Markov , Colite Ulcerativa/genética , Reprodutibilidade dos Testes , Fenótipo , Genótipo

2.

Prog-Plot - a visual method to determine functional relationships for false discovery rate regression methods.

Bello, Nicolás; López-Kleine, Liliana.

J Cell Sci ; 136(1)2023 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-36482762

RESUMO

Multiple test corrections are a fundamental step in the analysis of differentially expressed genes, as the number of tests performed would otherwise inflate the false discovery rate (FDR). Recent methods for P-value correction involve a regression model in order to include covariates that are informative of the power of the test. Here, we present Progressive proportions plot (Prog-Plot), a visual tool to identify the functional relationship between the covariate and the proportion of P-values consistent with the null hypothesis. The relationship between the proportion of P-values and the covariate to be included is needed, but there are no available tools to verify it. The approach presented here aims at having an objective way to specify regression models instead of relying on prior knowledge.

3.

A novel strategy for designing the magic shotguns for distantly related target pairs.

Luo, Yongchao; Wang, Panpan; Mou, Minjie; Zheng, Hanqi; Hong, Jiajun; Tao, Lin; Zhu, Feng.

Brief Bioinform ; 24(1)2023 01 19.

Artigo em Inglês | MEDLINE | ID: mdl-36631399

RESUMO

Due to its promising capacity in improving drug efficacy, polypharmacology has emerged to be a new theme in the drug discovery of complex disease. In the process of novel multi-target drugs (MTDs) discovery, in silico strategies come to be quite essential for the advantage of high throughput and low cost. However, current researchers mostly aim at typical closely related target pairs. Because of the intricate pathogenesis networks of complex diseases, many distantly related targets are found to play crucial role in synergistic treatment. Therefore, an innovational method to develop drugs which could simultaneously target distantly related target pairs is of utmost importance. At the same time, reducing the false discovery rate in the design of MTDs remains to be the daunting technological difficulty. In this research, effective small molecule clustering in the positive dataset, together with a putative negative dataset generation strategy, was adopted in the process of model constructions. Through comprehensive assessment on 10 target pairs with hierarchical similarity-levels, the proposed strategy turned out to reduce the false discovery rate successfully. Constructed model types with much smaller numbers of inhibitor molecules gained considerable yields and showed better false-hit controllability than before. To further evaluate the generalization ability, an in-depth assessment of high-throughput virtual screening on ChEMBL database was conducted. As a result, this novel strategy could hierarchically improve the enrichment factors for each target pair (especially for those distantly related/unrelated target pairs), corresponding to target pair similarity-levels.

Assuntos

Descoberta de Drogas , Polifarmacologia , Descoberta de Drogas/métodos , Ensaios de Triagem em Larga Escala

4.

On the use of tandem mass spectra acquired from samples of evolutionarily distant organisms to validate methods for false discovery rate estimation.

Madej, Dominik; Lam, Henry.

Proteomics ; 24(15): e2300398, 2024 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-38491400

RESUMO

Estimating the false discovery rate (FDR) of peptide identifications is a key step in proteomics data analysis, and many methods have been proposed for this purpose. Recently, an entrapment-inspired protocol to validate methods for FDR estimation appeared in articles showcasing new spectral library search tools. That validation approach involves generating incorrect spectral matches by searching spectra from evolutionarily distant organisms (entrapment queries) against the original target search space. Although this approach may appear similar to the solutions using entrapment databases, it represents a distinct conceptual framework whose correctness has not been verified yet. In this viewpoint, we first discussed the background of the entrapment-based validation protocols and then conducted a few simple computational experiments to verify the assumptions behind them. The results reveal that entrapment databases may, in some implementations, be a reasonable choice for validation, while the assumptions underpinning validation protocols based on entrapment queries are likely to be violated in practice. This article also highlights the need for well-designed frameworks for validating FDR estimation methods in proteomics.

Assuntos

Bases de Dados de Proteínas , Proteômica , Espectrometria de Massas em Tandem , Proteômica/métodos , Espectrometria de Massas em Tandem/métodos , Peptídeos/análise , Animais , Humanos , Reprodutibilidade dos Testes

5.

Precursor deconvolution error estimation: The missing puzzle piece in false discovery rate in top-down proteomics.

Jeong, Kyowon; Kaulich, Philipp T; Jung, Wonhyeuk; Kim, Jihyung; Tholey, Andreas; Kohlbacher, Oliver.

Proteomics ; 24(3-4): e2300068, 2024 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-37997224

RESUMO

Top-down proteomics (TDP) directly analyzes intact proteins and thus provides more comprehensive qualitative and quantitative proteoform-level information than conventional bottom-up proteomics (BUP) that relies on digested peptides and protein inference. While significant advancements have been made in TDP in sample preparation, separation, instrumentation, and data analysis, reliable and reproducible data analysis still remains one of the major bottlenecks in TDP. A key step for robust data analysis is the establishment of an objective estimation of proteoform-level false discovery rate (FDR) in proteoform identification. The most widely used FDR estimation scheme is based on the target-decoy approach (TDA), which has primarily been established for BUP. We present evidence that the TDA-based FDR estimation may not work at the proteoform-level due to an overlooked factor, namely the erroneous deconvolution of precursor masses, which leads to incorrect FDR estimation. We argue that the conventional TDA-based FDR in proteoform identification is in fact protein-level FDR rather than proteoform-level FDR unless precursor deconvolution error rate is taken into account. To address this issue, we propose a formula to correct for proteoform-level FDR bias by combining TDA-based FDR and precursor deconvolution error rate.

Assuntos

Peptídeos , Proteômica , Proteínas de Ligação a DNA

6.

Target-decoy false discovery rate estimation using Crema.

Lin, Andy; See, Donavan; Fondrie, William E; Keich, Uri; Noble, William Stafford.

Proteomics ; 24(8): e2300084, 2024 Apr.

Artigo em Inglês | MEDLINE | ID: mdl-38380501

RESUMO

Assigning statistical confidence estimates to discoveries produced by a tandem mass spectrometry proteomics experiment is critical to enabling principled interpretation of the results and assessing the cost/benefit ratio of experimental follow-up. The most common technique for computing such estimates is to use target-decoy competition (TDC), in which observed spectra are searched against a database of real (target) peptides and a database of shuffled or reversed (decoy) peptides. TDC procedures for estimating the false discovery rate (FDR) at a given score threshold have been developed for application at the level of spectra, peptides, or proteins. Although these techniques are relatively straightforward to implement, it is common in the literature to skip over the implementation details or even to make mistakes in how the TDC procedures are applied in practice. Here we present Crema, an open-source Python tool that implements several TDC methods of spectrum-, peptide- and protein-level FDR estimation. Crema is compatible with a variety of existing database search tools and provides a straightforward way to obtain robust FDR estimates.

Assuntos

Algoritmos , Peptídeos , Bases de Dados de Proteínas , Peptídeos/química , Proteínas/análise , Proteômica/métodos

7.

Control of false discoveries in grouped hypothesis testing for eQTL data.

Rudra, Pratyaydipta; Zhou, Yi-Hui; Nobel, Andrew; Wright, Fred A.

BMC Bioinformatics ; 25(1): 147, 2024 Apr 11.

Artigo em Inglês | MEDLINE | ID: mdl-38605284

RESUMO

BACKGROUND: Expression quantitative trait locus (eQTL) analysis aims to detect the genetic variants that influence the expression of one or more genes. Gene-level eQTL testing forms a natural grouped-hypothesis testing strategy with clear biological importance. Methods to control family-wise error rate or false discovery rate for group testing have been proposed earlier, but may not be powerful or easily apply to eQTL data, for which certain structured alternatives may be defensible and may enable the researcher to avoid overly conservative approaches. RESULTS: In an empirical Bayesian setting, we propose a new method to control the false discovery rate (FDR) for grouped hypotheses. Here, each gene forms a group, with SNPs annotated to the gene corresponding to individual hypotheses. The heterogeneity of effect sizes in different groups is considered by the introduction of a random effects component. Our method, entitled Random Effects model and testing procedure for Group-level FDR control (REG-FDR), assumes a model for alternative hypotheses for the eQTL data and controls the FDR by adaptive thresholding. As a convenient alternate approach, we also propose Z-REG-FDR, an approximate version of REG-FDR, that uses only Z-statistics of association between genotype and expression for each gene-SNP pair. The performance of Z-REG-FDR is evaluated using both simulated and real data. Simulations demonstrate that Z-REG-FDR performs similarly to REG-FDR, but with much improved computational speed. CONCLUSION: Our results demonstrate that the Z-REG-FDR method performs favorably compared to other methods in terms of statistical power and control of FDR. It can be of great practical use for grouped hypothesis testing for eQTL analysis or similar problems in statistical genomics due to its fast computation and ability to be fit using only summary data.

Assuntos

Genômica , Locos de Características Quantitativas , Simulação por Computador , Teorema de Bayes , Genótipo

8.

Modifying the false discovery rate procedure based on the information theory under arbitrary correlation structure and its performance in high-dimensional genomic data.

Rastaghi, Sedighe; Saki, Azadeh; Tabesh, Hamed.

BMC Bioinformatics ; 25(1): 57, 2024 Feb 05.

Artigo em Inglês | MEDLINE | ID: mdl-38317067

RESUMO

BACKGROUND: Controlling the False Discovery Rate (FDR) in Multiple Comparison Procedures (MCPs) has widespread applications in many scientific fields. Previous studies show that the correlation structure between test statistics increases the variance and bias of FDR. The objective of this study is to modify the effect of correlation in MCPs based on the information theory. We proposed three modified procedures (M1, M2, and M3) under strong, moderate, and mild assumptions based on the conditional Fisher Information of the consecutive sorted test statistics for controlling the false discovery rate under arbitrary correlation structure. The performance of the proposed procedures was compared with the Benjamini-Hochberg (BH) and Benjamini-Yekutieli (BY) procedures in simulation study and real high-dimensional data of colorectal cancer gene expressions. In the simulation study, we generated 1000 differential multivariate Gaussian features with different levels of the correlation structure and screened the significance features by the FDR controlling procedures, with strong control on the Family Wise Error Rates. RESULTS: When there was no correlation between 1000 simulated features, the performance of the BH procedure was similar to the three proposed procedures. In low to medium correlation structures the BY procedure is too conservative. The BH procedure is too liberal, and the mean number of screened features was constant at the different levels of the correlation between features. The mean number of screened features by proposed procedures was between BY and BH procedures and reduced when the correlations increased. Where the features are highly correlated the number of screened features by proposed procedures reached the Bonferroni (BF) procedure, as expected. In real data analysis the BY, BH, M1, M2, and M3 procedures were done to screen gene expressions of colorectal cancer. To fit a predictive model based on the screened features the Efficient Bayesian Logistic Regression (EBLR) model was used. The fitted EBLR models based on the screened features by M1 and M2 procedures have minimum entropies and are more efficient than BY and BH procedures. CONCLUSION: The modified proposed procedures based on information theory, are much more flexible than BH and BY procedures for the amount of correlation between test statistics. The modified procedures avoided screening the non-informative features and so the number of screened features reduced with the increase in the level of correlation.

Assuntos

Neoplasias Colorretais , Teoria da Informação , Humanos , Teorema de Bayes , Genômica , Simulação por Computador

9.

Exact Integral Formulas for False Discovery Rate and the Variance of False Discovery Proportion.

Sadygov, Rovshan G; Zhu, Justin X; Deberneh, Henock M.

J Proteome Res ; 23(6): 2298-2305, 2024 Jun 07.

Artigo em Inglês | MEDLINE | ID: mdl-38809146

RESUMO

Multiple hypothesis testing is an integral component of data analysis for large-scale technologies such as proteomics, transcriptomics, or metabolomics, for which the false discovery rate (FDR) and positive FDR (pFDR) have been accepted as error estimation and control measures. The pFDR is the expectation of false discovery proportion (FDP), which refers to the ratio of the number of null hypotheses to that of all rejected hypotheses. In practice, the expectation of ratio is approximated by the ratio of expectation; however, the conditions for transforming the former into the latter have not been investigated. This work derives exact integral expressions for the expectation (pFDR) and variance of FDP. The widely used approximation (ratio of expectations) is shown to be a particular case (in the limit of a large sample size) of the integral formula for pFDR. A recurrence formula is provided to compute the pFDR for a predefined number of null hypotheses. The variance of FDP was approximated for a practical application in peptide identification using forward and reversed protein sequences. The simulations demonstrate that the integral expression exhibits better accuracy than the approximate formula in the case of a small number of hypotheses. For large sample sizes, the pFDRs obtained by the integral expression and approximation do not differ substantially. Applications to proteomics data sets are included.

Assuntos

Proteômica , Proteômica/métodos , Algoritmos , Reações Falso-Positivas , Peptídeos/análise , Peptídeos/química , Peptídeos/metabolismo , Simulação por Computador , Humanos

10.

False discovery rate: the Achilles' heel of proteogenomics.

Aggarwal, Suruchi; Raj, Anurag; Kumar, Dhirendra; Dash, Debasis; Yadav, Amit Kumar.

Brief Bioinform ; 23(5)2022 09 20.

Artigo em Inglês | MEDLINE | ID: mdl-35534181

RESUMO

Proteogenomics refers to the integrated analysis of the genome and proteome that leverages mass-spectrometry (MS)-based proteomics data to improve genome annotations, understand gene expression control through proteoforms and find sequence variants to develop novel insights for disease classification and therapeutic strategies. However, proteogenomic studies often suffer from reduced sensitivity and specificity due to inflated database size. To control the error rates, proteogenomics depends on the target-decoy search strategy, the de-facto method for false discovery rate (FDR) estimation in proteomics. The proteogenomic databases constructed from three- or six-frame nucleotide database translation not only increase the search space and compute-time but also violate the equivalence of target and decoy databases. These searches result in poorer separation between target and decoy scores, leading to stringent FDR thresholds. Understanding these factors and applying modified strategies such as two-pass database search or peptide-class-specific FDR can result in a better interpretation of MS data without introducing additional statistical biases. Based on these considerations, a user can interpret the proteogenomics results appropriately and control false positives and negatives in a more informed manner. In this review, first, we briefly discuss the proteogenomic workflows and limitations in database construction, followed by various considerations that can influence potential novel discoveries in a proteogenomic study. We conclude with suggestions to counter these challenges for better proteogenomic data interpretation.

Assuntos

Proteogenômica , Bases de Dados de Proteínas , Nucleotídeos , Peptídeos/química , Proteogenômica/métodos , Proteoma , Proteômica/métodos

11.

MERIT: Controlling Monte-Carlo error rate in large-scale Monte-Carlo hypothesis testing.

Li, Yunxiao; Hu, Yi-Juan; Satten, Glen A.

Stat Med ; 43(2): 279-295, 2024 01 30.

Artigo em Inglês | MEDLINE | ID: mdl-38124426

RESUMO

The use of Monte-Carlo (MC) p $$ p $$ -values when testing the significance of a large number of hypotheses is now commonplace. In large-scale hypothesis testing, we will typically encounter at least some p $$ p $$ -values near the threshold of significance, which require a larger number of MC replicates than p $$ p $$ -values that are far from the threshold. As a result, some incorrect conclusions can be reached due to MC error alone; for hypotheses near the threshold, even a very large number (eg, 1 0 6 $$ 1{0}^6 $$ ) of MC replicates may not be enough to guarantee conclusions reached using MC p $$ p $$ -values. Gandy and Hahn (GH)6-8 have developed the only method that directly addresses this problem. They defined a Monte-Carlo error rate (MCER) to be the probability that any decisions on accepting or rejecting a hypothesis based on MC p $$ p $$ -values are different from decisions based on ideal p $$ p $$ -values; their method then makes decisions by controlling the MCER. Unfortunately, the GH method is frequently very conservative, often making no rejections at all and leaving a large number of hypotheses "undecided". In this article, we propose MERIT, a method for large-scale MC hypothesis testing that also controls the MCER but is more statistically efficient than the GH method. Through extensive simulation studies, we demonstrate that MERIT controls the MCER while making more decisions that agree with the ideal p $$ p $$ -values than GH does. We also illustrate our method by an analysis of gene expression data from a prostate cancer study.

Assuntos

Projetos de Pesquisa , Humanos , Simulação por Computador , Probabilidade , Método de Monte Carlo

12.

Local false discovery rate estimation with competition-based procedures for variable selection.

Sun, Xiaoya; Fu, Yan.

Stat Med ; 43(1): 61-88, 2024 01 15.

Artigo em Inglês | MEDLINE | ID: mdl-37927105

RESUMO

Multiple hypothesis testing has been widely applied to problems dealing with high-dimensional data, for example, the selection of important variables or features from a large number of candidates while controlling the error rate. The most prevailing measure of error rate used in multiple hypothesis testing is the false discovery rate (FDR). In recent years, the local false discovery rate (fdr) has drawn much attention, due to its advantage of accessing the confidence of individual hypotheses. However, most methods estimate fdr through P $$ P $$ -values or statistics with known null distributions, which are sometimes unavailable or unreliable. Adopting the innovative methodology of competition-based procedures, for example, the knockoff filter, this paper proposes a new approach, named TDfdr, to fdr estimation, which is free of P $$ P $$ -values or known null distributions. Extensive simulation studies demonstrate that TDfdr can accurately estimate the fdr with two competition-based procedures. We applied the TDfdr method to two real biomedical tasks. One is to identify significantly differentially expressed proteins related to the COVID-19 disease, and the other is to detect mutations in the genotypes of HIV-1 that are associated with drug resistance. Higher discovery power was observed compared to existing popular methods.

Assuntos

Algoritmos , Projetos de Pesquisa , Humanos , Simulação por Computador

13.

Anomaly Detection and Correction in Dense Functional Data Within Electronic Medical Records.

Kuwaye, Daren; Cho, Hyunkeun Ryan.

Stat Med ; 2024 Sep 03.

Artigo em Inglês | MEDLINE | ID: mdl-39225196

RESUMO

In medical research, the accuracy of data from electronic medical records (EMRs) is critical, particularly when analyzing dense functional data, where anomalies can severely compromise research integrity. Anomalies in EMRs often arise from human errors in data measurement and entry, and increase in frequency with the volume of data. Despite the established methods in computer science, anomaly detection in medical applications remains underdeveloped. We address this deficiency by introducing a novel tool for identifying and correcting anomalies specifically in dense functional EMR data. Our approach utilizes studentized residuals from a mean-shift model, and therefore assumes that the data adheres to a smooth functional trajectory. Additionally, our method is tailored to be conservative, focusing on anomalies that signify actual errors in the data collection process while controlling for false discovery rates and type II errors. To support widespread implementation, we provide a comprehensive R package, ensuring that our methods can be applied in diverse settings. Our methodology's efficacy has been validated through rigorous simulation studies and real-world applications, confirming its ability to accurately identify and correct errors, thus enhancing the reliability and quality of medical data analysis.

14.

Adjusting for false discoveries in constraint-based differential metabolic flux analysis.

Galuzzi, Bruno G; Milazzo, Luca; Damiani, Chiara.

J Biomed Inform ; 150: 104597, 2024 02.

Artigo em Inglês | MEDLINE | ID: mdl-38272432

RESUMO

One of the critical steps to characterize metabolic alterations in multifactorial diseases, as well as their heterogeneity across different patients, is the identification of reactions that exhibit significantly different usage (or flux) between cohorts. However, since metabolic fluxes cannot be determined directly, researchers typically use constraint-based metabolic network models, customized on post-genomics datasets. The use of random sampling within the feasible region of metabolic networks is becoming more prevalent for comparing these networks. While many algorithms have been proposed and compared for efficiently and uniformly sampling the feasible region of metabolic networks, their impact on the risk of making false discoveries when comparing different samples has not been investigated yet, and no sampling strategy has been so far specifically designed to mitigate the problem. To be able to precisely assess the False Discovery Rate (FDR), in this work we compared different samples obtained from the very same metabolic model. We compared the FDR obtained for different model scales, sample sizes, parameters of the sampling algorithm, and strategies to filter out non-significant variations. To be able to compare the largely used hit-and-run strategy with the much less investigated corner-based strategy, we first assessed the intrinsic capability of current corner-based algorithms and of a newly proposed one to visit all vertices of a constraint-based region. We show that false discoveries can occur at high rates even for large samples of small-scale networks. However, we demonstrate that a statistical test based on the empirical null distribution of Kullback-Leibler divergence can effectively correct for false discoveries. We also show that our proposed corner-based algorithm is more efficient than state-of-the-art alternatives and much less prone to false discoveries than hit-and-run strategies. We report that the differences in the marginal distributions obtained with the two strategies are related to but not fully explained by differences in sample standard deviation, as previously thought. Overall, our study provides insights into the impact of sampling strategies on FDR in metabolic network analysis and offers new guidelines for more robust and reproducible analyses.

Assuntos

Análise do Fluxo Metabólico , Modelos Biológicos , Humanos , Algoritmos , Redes e Vias Metabólicas , Genômica

15.

Multiattribute Glycan Identification and FDR Control for Glycoproteomics.

Polasky, Daniel A; Geiszler, Daniel J; Yu, Fengchao; Nesvizhskii, Alexey I.

Mol Cell Proteomics ; 21(3): 100205, 2022 03.

Artigo em Inglês | MEDLINE | ID: mdl-35091091

RESUMO

Rapidly improving methods for glycoproteomics have enabled increasingly large-scale analyses of complex glycopeptide samples, but annotating the resulting mass spectrometry data with high confidence remains a major bottleneck. We recently introduced a fast and sensitive glycoproteomics search method in our MSFragger search engine, which reports glycopeptides as a combination of a peptide sequence and the mass of the attached glycan. In samples with complex glycosylation patterns, converting this mass to a specific glycan composition is not straightforward; however, as many glycans have similar or identical masses. Here, we have developed a new method for determining the glycan composition of N-linked glycopeptides fragmented by collisional or hybrid activation that uses multiple sources of information from the spectrum, including observed glycan B-type (oxonium) and Y-type ions and mass and precursor monoisotopic selection errors to discriminate between possible glycan candidates. Combined with false discovery rate estimation for the glycan assignment, we show that this method is capable of specifically and sensitively identifying glycans in complex glycopeptide analyses and effectively controls the rate of false glycan assignments. The new method has been incorporated into the PTM-Shepherd modification analysis tool to work directly with the MSFragger glyco search in the FragPipe graphical user interface, providing a complete computational pipeline for annotation of N-glycopeptide spectra with false discovery rate control of both peptide and glycan components that is both sensitive and robust against false identifications.

Assuntos

Proteômica , Espectrometria de Massas em Tandem , Glicopeptídeos/química , Glicosilação , Polissacarídeos/análise , Proteômica/métodos

16.

Reanalysis of ProteomicsDB Using an Accurate, Sensitive, and Scalable False Discovery Rate Estimation Approach for Protein Groups.

The, Matthew; Samaras, Patroklos; Kuster, Bernhard; Wilhelm, Mathias.

Mol Cell Proteomics ; 21(12): 100437, 2022 12.

Artigo em Inglês | MEDLINE | ID: mdl-36328188

RESUMO

Estimating false discovery rates (FDRs) of protein identification continues to be an important topic in mass spectrometry-based proteomics, particularly when analyzing very large datasets. One performant method for this purpose is the Picked Protein FDR approach which is based on a target-decoy competition strategy on the protein level that ensures that FDRs scale to large datasets. Here, we present an extension to this method that can also deal with protein groups, that is, proteins that share common peptides such as protein isoforms of the same gene. To obtain well-calibrated FDR estimates that preserve protein identification sensitivity, we introduce two novel ideas. First, the picked group target-decoy and second, the rescued subset grouping strategies. Using entrapment searches and simulated data for validation, we demonstrate that the new Picked Protein Group FDR method produces accurate protein group-level FDR estimates regardless of the size of the data set. The validation analysis also uncovered that applying the commonly used Occam's razor principle leads to anticonservative FDR estimates for large datasets. This is not the case for the Picked Protein Group FDR method. Reanalysis of deep proteomes of 29 human tissues showed that the new method identified up to 4% more protein groups than MaxQuant. Applying the method to the reanalysis of the entire human section of ProteomicsDB led to the identification of 18,000 protein groups at 1% protein group-level FDR. The analysis also showed that about 1250 genes were represented by ≥2 identified protein groups. To make the method accessible to the proteomics community, we provide a software tool including a graphical user interface that enables merging results from multiple MaxQuant searches into a single list of identified and quantified protein groups.

Assuntos

Peptídeos , Espectrometria de Massas em Tandem , Humanos , Espectrometria de Massas em Tandem/métodos , Bases de Dados de Proteínas , Software , Proteoma , Algoritmos

17.

DeepLINK: Deep learning inference using knockoffs with applications to genomics.

Zhu, Zifan; Fan, Yingying; Kong, Yinfei; Lv, Jinchi; Sun, Fengzhu.

Proc Natl Acad Sci U S A ; 118(36)2021 09 07.

Artigo em Inglês | MEDLINE | ID: mdl-34480002

RESUMO

We propose a deep learning-based knockoffs inference framework, DeepLINK, that guarantees the false discovery rate (FDR) control in high-dimensional settings. DeepLINK is applicable to a broad class of covariate distributions described by the possibly nonlinear latent factor models. It consists of two major parts: an autoencoder network for the knockoff variable construction and a multilayer perceptron network for feature selection with the FDR control. The empirical performance of DeepLINK is investigated through extensive simulation studies, where it is shown to achieve FDR control in feature selection with both high selection power and high prediction accuracy. We also apply DeepLINK to three real data applications to demonstrate its practical utility.

Assuntos

Biologia Computacional/métodos , Aprendizado Profundo , Genômica , Algoritmos , Simulação por Computador , Redes Neurais de Computação

18.

False discovery rate control in genome-wide association studies with population structure.

Sesia, Matteo; Bates, Stephen; Candès, Emmanuel; Marchini, Jonathan; Sabatti, Chiara.

Proc Natl Acad Sci U S A ; 118(40)2021 10 05.

Artigo em Inglês | MEDLINE | ID: mdl-34580220

RESUMO

We present a comprehensive statistical framework to analyze data from genome-wide association studies of polygenic traits, producing interpretable findings while controlling the false discovery rate. In contrast with standard approaches, our method can leverage sophisticated multivariate algorithms but makes no parametric assumptions about the unknown relation between genotypes and phenotype. Instead, we recognize that genotypes can be considered as a random sample from an appropriate model, encapsulating our knowledge of genetic inheritance and human populations. This allows the generation of imperfect copies (knockoffs) of these variables that serve as ideal negative controls, correcting for linkage disequilibrium and accounting for unknown population structure, which may be due to diverse ancestries or familial relatedness. The validity and effectiveness of our method are demonstrated by extensive simulations and by applications to the UK Biobank data. These analyses confirm our method is powerful relative to state-of-the-art alternatives, while comparisons with other studies validate most of our discoveries. Finally, fast software is made available for researchers to analyze Biobank-scale datasets.

Assuntos

Genoma Humano/genética , Algoritmos , Estudo de Associação Genômica Ampla/métodos , Genótipo , Humanos , Desequilíbrio de Ligação/genética , Herança Multifatorial/genética , Fenótipo , Software

19.

False Discovery Rate Control for Lesion-Symptom Mapping With Heterogeneous Data via Weighted p-Values.

Zheng, Siyu; McLain, Alexander C; Habiger, Joshua; Rorden, Christopher; Fridriksson, Julius.

Biom J ; 66(6): e202300198, 2024 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-39162085

RESUMO

Lesion-symptom mapping studies provide insight into what areas of the brain are involved in different aspects of cognition. This is commonly done via behavioral testing in patients with a naturally occurring brain injury or lesions (e.g., strokes or brain tumors). This results in high-dimensional observational data where lesion status (present/absent) is nonuniformly distributed, with some voxels having lesions in very few (or no) subjects. In this situation, mass univariate hypothesis tests have severe power heterogeneity where many tests are known a priori to have little to no power. Recent advancements in multiple testing methodologies allow researchers to weigh hypotheses according to side information (e.g., information on power heterogeneity). In this paper, we propose the use of p-value weighting for voxel-based lesion-symptom mapping studies. The weights are created using the distribution of lesion status and spatial information to estimate different non-null prior probabilities for each hypothesis test through some common approaches. We provide a monotone minimum weight criterion, which requires minimum a priori power information. Our methods are demonstrated on dependent simulated data and an aphasia study investigating which regions of the brain are associated with the severity of language impairment among stroke survivors. The results demonstrate that the proposed methods have robust error control and can increase power. Further, we showcase how weights can be used to identify regions that are inconclusive due to lack of power.

Assuntos

Biometria , Humanos , Biometria/métodos , Afasia/fisiopatologia , Encéfalo/diagnóstico por imagem , Mapeamento Encefálico/métodos , Reações Falso-Positivas

20.

Controlling for false discoveries subsequently to large scale one-way ANOVA testing in proteomics: Practical considerations.

Burger, Thomas.

Proteomics ; 23(18): e2200406, 2023 09.

Artigo em Inglês | MEDLINE | ID: mdl-37357151

RESUMO

In discovery proteomics, as well as many other "omic" approaches, the possibility to test for the differential abundance of hundreds (or of thousands) of features simultaneously is appealing, despite requiring specific statistical safeguards, among which controlling for the false discovery rate (FDR) has become standard. Moreover, when more than two biological conditions or group treatments are considered, it has become customary to rely on the one-way analysis of variance (ANOVA) framework, where a first global differential abundance landscape provided by an omnibus test can be subsequently refined using various post-hoc tests (PHTs). However, the interactions between the FDR control procedures and the PHTs are complex, because both correspond to different types of multiple test corrections (MTCs). This article surveys various ways to orchestrate them in a data processing workflow and discusses their pros and cons.

Assuntos

Proteômica , Proteômica/métodos , Análise de Variância

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA