Búsqueda | Portal Regional de la BVS

1.

Spatial landmark detection and tissue registration with deep learning.

Ekvall, Markus; Bergenstråhle, Ludvig; Andersson, Alma; Czarnewski, Paulo; Olegård, Johannes; Käll, Lukas; Lundeberg, Joakim.

Nat Methods ; 21(4): 673-679, 2024 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-38438615

RESUMEN

Spatial landmarks are crucial in describing histological features between samples or sites, tracking regions of interest in microscopy, and registering tissue samples within a common coordinate framework. Although other studies have explored unsupervised landmark detection, existing methods are not well-suited for histological image data as they often require a large number of images to converge, are unable to handle nonlinear deformations between tissue sections and are ineffective for z-stack alignment, other modalities beyond image data or multimodal data. We address these challenges by introducing effortless landmark detection, a new unsupervised landmark detection and registration method using neural-network-guided thin-plate splines. Our proposed method is evaluated on a diverse range of datasets including histology and spatially resolved transcriptomics, demonstrating superior performance in both accuracy and stability compared to existing approaches.

Asunto(s)

Aprendizaje Profundo , Procesamiento de Imagen Asistido por Computador/métodos

2.

Automated model building and protein identification in cryo-EM maps.

Jamali, Kiarash; Käll, Lukas; Zhang, Rui; Brown, Alan; Kimanius, Dari; Scheres, Sjors H W.

Nature ; 628(8007): 450-457, 2024 Apr.

Artículo en Inglés | MEDLINE | ID: mdl-38408488

RESUMEN

Interpreting electron cryo-microscopy (cryo-EM) maps with atomic models requires high levels of expertise and labour-intensive manual intervention in three-dimensional computer graphics programs1,2. Here we present ModelAngelo, a machine-learning approach for automated atomic model building in cryo-EM maps. By combining information from the cryo-EM map with information from protein sequence and structure in a single graph neural network, ModelAngelo builds atomic models for proteins that are of similar quality to those generated by human experts. For nucleotides, ModelAngelo builds backbones with similar accuracy to those built by humans. By using its predicted amino acid probabilities for each residue in hidden Markov model sequence searches, ModelAngelo outperforms human experts in the identification of proteins with unknown sequences. ModelAngelo will therefore remove bottlenecks and increase objectivity in cryo-EM structure determination.

Asunto(s)

Microscopía por Crioelectrón , Aprendizaje Automático , Modelos Moleculares , Proteínas , Secuencia de Aminoácidos , Microscopía por Crioelectrón/métodos , Microscopía por Crioelectrón/normas , Cadenas de Markov , Redes Neurales de la Computación , Conformación Proteica , Proteínas/química , Proteínas/ultraestructura , Gráficos por Computador

3.

Pathway analysis through mutual information.

Jeuken, Gustavo S; Käll, Lukas.

Bioinformatics ; 40(1)2024 01 02.

Artículo en Inglés | MEDLINE | ID: mdl-38195928

RESUMEN

MOTIVATION: In pathway analysis, we aim to establish a connection between the activity of a particular biological pathway and a difference in phenotype. There are many available methods to perform pathway analysis, many of them rely on an upstream differential expression analysis, and many model the relations between the abundances of the analytes in a pathway as linear relationships. RESULTS: Here, we propose a new method for pathway analysis, MIPath, that relies on information theoretical principles and, therefore, does not model the association between pathway activity and phenotype, resulting in relatively few assumptions. For this, we construct a graph of the data points for each pathway using a nearest-neighbor approach and score the association between the structure of this graph and the phenotype of these same samples using Mutual Information while adjusting for the effects of random chance in each score. The initial nearest neighbor approach evades individual gene-level comparisons, hence making the method scalable and less vulnerable to missing values. These properties make our method particularly useful for single-cell data. We benchmarked our method on several single-cell datasets, comparing it to established and new methods, and found that it produces robust, reproducible, and meaningful scores. AVAILABILITY AND IMPLEMENTATION: Source code is available at https://github.com/statisticalbiotechnology/mipath, or through Python Package Index as "mipathway."

Asunto(s)

Programas Informáticos , Fenotipo , Análisis por Conglomerados

4.

The Association of Biomolecular Resource Facilities Proteome Informatics Research Group Study on Metaproteomics (iPRG-2020).

Jagtap, Pratik D; Hoopmann, Michael R; Neely, Benjamin A; Harvey, Antony; Käll, Lukas; Perez-Riverol, Yasset; Abajorga, Milky K; Thomas, Julie A; Weintraub, Susan T; Palmblad, Magnus.

J Biomol Tech ; 34(3)2023 Sep 30.

Artículo en Inglés | MEDLINE | ID: mdl-37969874

RESUMEN

Metaproteomics research using mass spectrometry data has emerged as a powerful strategy to understand the mechanisms underlying microbiome dynamics and the interaction of microbiomes with their immediate environment. Recent advances in sample preparation, data acquisition, and bioinformatics workflows have greatly contributed to progress in this field. In 2020, the Association of Biomolecular Research Facilities Proteome Informatics Research Group launched a collaborative study to assess the bioinformatics options available for metaproteomics research. The study was conducted in 2 phases. In the first phase, participants were provided with mass spectrometry data files and were asked to identify the taxonomic composition and relative taxa abundances in the samples without supplying any protein sequence databases. The most challenging question asked of the participants was to postulate the nature of any biological phenomena that may have taken place in the samples, such as interactions among taxonomic species. In the second phase, participants were provided a protein sequence database composed of the species present in the sample and were asked to answer the same set of questions as for phase 1. In this report, we summarize the data processing methods and tools used by participants, including database searching and software tools used for taxonomic and functional analysis. This study provides insights into the status of metaproteomics bioinformatics in participating laboratories and core facilities.

Asunto(s)

Proteoma , Proteómica , Humanos , Proteómica/métodos , Programas Informáticos , Biología Computacional , Bases de Datos de Proteínas

5.

Spatial multimodal analysis of transcriptomes and metabolomes in tissues.

Vicari, Marco; Mirzazadeh, Reza; Nilsson, Anna; Shariatgorji, Reza; Bjärterot, Patrik; Larsson, Ludvig; Lee, Hower; Nilsson, Mats; Foyer, Julia; Ekvall, Markus; Czarnewski, Paulo; Zhang, Xiaoqun; Svenningsson, Per; Käll, Lukas; Andrén, Per E; Lundeberg, Joakim.

Nat Biotechnol ; 2023 Sep 04.

Artículo en Inglés | MEDLINE | ID: mdl-37667091

RESUMEN

We present a spatial omics approach that combines histology, mass spectrometry imaging and spatial transcriptomics to facilitate precise measurements of mRNA transcripts and low-molecular-weight metabolites across tissue regions. The workflow is compatible with commercially available Visium glass slides. We demonstrate the potential of our method using mouse and human brain samples in the context of dopamine and Parkinson's disease.

6.

Retention Time and Fragmentation Predictors Increase Confidence in Identification of Common Variant Peptides.

Skiadopoulou, Dafni; Vasícek, Jakub; Kuznetsova, Ksenia; Bouyssié, David; Käll, Lukas; Vaudel, Marc.

J Proteome Res ; 22(10): 3190-3199, 2023 Oct 06.

Artículo en Inglés | MEDLINE | ID: mdl-37656829

RESUMEN

Precision medicine focuses on adapting care to the individual profile of patients, for example, accounting for their unique genetic makeup. Being able to account for the effect of genetic variation on the proteome holds great promise toward this goal. However, identifying the protein products of genetic variation using mass spectrometry has proven very challenging. Here we show that the identification of variant peptides can be improved by the integration of retention time and fragmentation predictors into a unified proteogenomic pipeline. By combining these intrinsic peptide characteristics using the search-engine post-processor Percolator, we demonstrate improved discrimination power between correct and incorrect peptide-spectrum matches. Our results demonstrate that the drop in performance that is induced when expanding a protein sequence database can be compensated, hence enabling efficient identification of genetic variation products in proteomics data. We anticipate that this enhancement of proteogenomic pipelines can provide a more refined picture of the unique proteome of patients and thereby contribute to improving patient care.

7.

Automated model building and protein identification in cryo-EM maps.

Jamali, Kiarash; Käll, Lukas; Zhang, Rui; Brown, Alan; Kimanius, Dari; Scheres, Sjors H W.

bioRxiv ; 2023 Oct 17.

Artículo en Inglés | MEDLINE | ID: mdl-37292681

RESUMEN

Interpreting electron cryo-microscopy (cryo-EM) maps with atomic models requires high levels of expertise and labour-intensive manual intervention. We present ModelAngelo, a machine-learning approach for automated atomic model building in cryo-EM maps. By combining information from the cryo-EM map with information from protein sequence and structure in a single graph neural network, ModelAngelo builds atomic models for proteins that are of similar quality as those generated by human experts. For nucleotides, ModelAngelo builds backbones with similar accuracy as humans. By using its predicted amino acid probabilities for each residue in hidden Markov model sequence searches, ModelAngelo outperforms human experts in the identification of proteins with unknown sequences. ModelAngelo will thus remove bottlenecks and increase objectivity in cryo-EM structure determination.

8.

Triqler for Protein Summarization of Data from Data-Independent Acquisition Mass Spectrometry.

Truong, Patrick; The, Matthew; Käll, Lukas.

J Proteome Res ; 22(4): 1359-1366, 2023 04 07.

Artículo en Inglés | MEDLINE | ID: mdl-36988210

RESUMEN

A frequent goal, or subgoal, when processing data from a quantitative shotgun proteomics experiment is a list of proteins that are differentially abundant under the examined experimental conditions. Unfortunately, obtaining such a list is a challenging process, as the mass spectrometer analyzes the proteolytic peptides of a protein rather than the proteins themselves. We have previously designed a Bayesian hierarchical probabilistic model, Triqler, for combining peptide identification and quantification errors into probabilities of proteins being differentially abundant. However, the model was developed for data from data-dependent acquisition. Here, we show that Triqler is also compatible with data-independent acquisition data after applying minor alterations for the missing value distribution. Furthermore, we find that it has better performance than a set of compared state-of-the-art protein summarization tools when evaluated on data-independent acquisition data.

Asunto(s)

Péptidos , Proteínas , Teorema de Bayes , Proteínas/análisis , Péptidos/análisis , Espectrometría de Masas/métodos , Proteómica/métodos

9.

Toward an Integrated Machine Learning Model of a Proteomics Experiment.

Neely, Benjamin A; Dorfer, Viktoria; Martens, Lennart; Bludau, Isabell; Bouwmeester, Robbin; Degroeve, Sven; Deutsch, Eric W; Gessulat, Siegfried; Käll, Lukas; Palczynski, Pawel; Payne, Samuel H; Rehfeldt, Tobias Greisager; Schmidt, Tobias; Schwämmle, Veit; Uszkoreit, Julian; Vizcaíno, Juan Antonio; Wilhelm, Mathias; Palmblad, Magnus.

J Proteome Res ; 22(3): 681-696, 2023 03 03.

Artículo en Inglés | MEDLINE | ID: mdl-36744821

RESUMEN

In recent years machine learning has made extensive progress in modeling many aspects of mass spectrometry data. We brought together proteomics data generators, repository managers, and machine learning experts in a workshop with the goals to evaluate and explore machine learning applications for realistic modeling of data from multidimensional mass spectrometry-based proteomics analysis of any sample or organism. Following this sample-to-data roadmap helped identify knowledge gaps and define needs. Being able to generate bespoke and realistic synthetic data has legitimate and important uses in system suitability, method development, and algorithm benchmarking, while also posing critical ethical questions. The interdisciplinary nature of the workshop informed discussions of what is currently possible and future opportunities and challenges. In the following perspective we summarize these discussions in the hope of conveying our excitement about the potential of machine learning in proteomics and to inspire future research.

Asunto(s)

Aprendizaje Automático , Proteómica , Proteómica/métodos , Algoritmos , Espectrometría de Masas

10.

Integrating Identification and Quantification Uncertainty for Differential Protein Abundance Analysis with Triqler.

The, Matthew; Käll, Lukas.

Methods Mol Biol ; 2426: 91-117, 2023.

Artículo en Inglés | MEDLINE | ID: mdl-36308686

RESUMEN

Protein quantification for shotgun proteomics is a complicated process where errors can be introduced in each of the steps. Triqler is a Python package that estimates and integrates errors of the different parts of the label-free protein quantification pipeline into a single Bayesian model. Specifically, it weighs the quantitative values by the confidence we have in the correctness of the corresponding PSM. Furthermore, it treats missing values in a way that reflects their uncertainty relative to observed values. Finally, it combines these error estimates in a single differential abundance FDR that not only reflects the errors and uncertainties in quantification but also in identification. In this tutorial, we show how to (1) generate input data for Triqler from quantification packages such as MaxQuant and Quandenser, (2) run Triqler and what the different options are, (3) interpret the results, (4) investigate the posterior distributions of a protein of interest in detail, and (5) verify that the hyperparameter estimations are sensible.

Asunto(s)

Proteínas , Proteómica , Teorema de Bayes , Incertidumbre , Proteómica/métodos , Programas Informáticos

11.

A Comprehensive Evaluation of Consensus Spectrum Generation Methods in Proteomics.

Luo, Xiyang; Bittremieux, Wout; Griss, Johannes; Deutsch, Eric W; Sachsenberg, Timo; Levitsky, Lev I; Ivanov, Mark V; Bubis, Julia A; Gabriels, Ralf; Webel, Henry; Sanchez, Aniel; Bai, Mingze; Käll, Lukas; Perez-Riverol, Yasset.

J Proteome Res ; 21(6): 1566-1574, 2022 06 03.

Artículo en Inglés | MEDLINE | ID: mdl-35549218

RESUMEN

Spectrum clustering is a powerful strategy to minimize redundant mass spectra by grouping them based on similarity, with the aim of forming groups of mass spectra from the same repeatedly measured analytes. Each such group of near-identical spectra can be represented by its so-called consensus spectrum for downstream processing. Although several algorithms for spectrum clustering have been adequately benchmarked and tested, the influence of the consensus spectrum generation step is rarely evaluated. Here, we present an implementation and benchmark of common consensus spectrum algorithms, including spectrum averaging, spectrum binning, the most similar spectrum, and the best-identified spectrum. We have analyzed diverse public data sets using two different clustering algorithms (spectra-cluster and MaRaCluster) to evaluate how the consensus spectrum generation procedure influences downstream peptide identification. The BEST and BIN methods were found the most reliable methods for consensus spectrum generation, including for data sets with post-translational modifications (PTM) such as phosphorylation. All source code and data of the present study are freely available on GitHub at https://github.com/statisticalbiotechnology/representative-spectra-benchmark.

Asunto(s)

Proteómica , Espectrometría de Masas en Tándem , Algoritmos , Análisis por Conglomerados , Consenso , Bases de Datos de Proteínas , Proteómica/métodos , Programas Informáticos , Espectrometría de Masas en Tándem/métodos

12.

Prosit Transformer: A transformer for Prediction of MS2 Spectrum Intensities.

Ekvall, Markus; Truong, Patrick; Gabriel, Wassim; Wilhelm, Mathias; Käll, Lukas.

J Proteome Res ; 21(5): 1359-1364, 2022 05 06.

Artículo en Inglés | MEDLINE | ID: mdl-35413196

RESUMEN

Machine learning has been an integral part of interpreting data from mass spectrometry (MS)-based proteomics for a long time. Relatively recently, a machine-learning structure appeared successful in other areas of bioinformatics, Transformers. Furthermore, the implementation of Transformers within bioinformatics has become relatively convenient due to transfer learning, i.e., adapting a network trained for other tasks to new functionality. Transfer learning makes these relatively large networks more accessible as it generally requires less data, and the training time improves substantially. We implemented a Transformer based on the pretrained model TAPE to predict MS2 intensities. TAPE is a general model trained to predict missing residues from protein sequences. Despite being trained for a different task, we could modify its behavior by adding a prediction head at the end of the TAPE model and fine-tune it using the spectrum intensity from the training set to the well-known predictor Prosit. We demonstrate that the predictor, which we call Prosit Transformer, outperforms the recurrent neural-network-based predictor Prosit, increasing the median angular similarity on its hold-out set from 0.908 to 0.929. We believe that Transformers will significantly increase prediction accuracy for other types of predictions within MS-based proteomics.

Asunto(s)

Aprendizaje Automático , Redes Neurales de la Computación , Secuencia de Aminoácidos , Espectrometría de Masas , Proteómica

13.

Survival analysis of pathway activity as a prognostic determinant in breast cancer.

Jeuken, Gustavo S; Tobin, Nicholas P; Käll, Lukas.

PLoS Comput Biol ; 18(3): e1010020, 2022 03.

Artículo en Inglés | MEDLINE | ID: mdl-35344554

RESUMEN

High throughput biology enables the measurements of relative concentrations of thousands of biomolecules from e.g. tissue samples. The process leaves the investigator with the problem of how to best interpret the potentially large number of differences between samples. Many activities in a cell depend on ordered reactions involving multiple biomolecules, often referred to as pathways. It hence makes sense to study differences between samples in terms of altered pathway activity, using so-called pathway analysis. Traditional pathway analysis gives significance to differences in the pathway components' concentrations between sample groups, however, less frequently used methods for estimating individual samples' pathway activities have been suggested. Here we demonstrate that such a method can be used for pathway-based survival analysis. Specifically, we investigate the pathway activities' association with patients' survival time based on the transcription profiles of the METABRIC dataset. Our implementation shows that pathway activities are better prognostic markers for survival time in METABRIC than the individual transcripts. We also demonstrate that we can regress out the effect of individual pathways on other pathways, which allows us to estimate the other pathways' residual pathway activity on survival. Furthermore, we illustrate how one can visualize the often interdependent measures over hierarchical pathway databases using sunburst plots.

Asunto(s)

Neoplasias de la Mama , Neoplasias de la Mama/metabolismo , Femenino , Humanos , Pronóstico , Análisis de Supervivencia

14.

Putting Humpty Dumpty Back Together Again: What Does Protein Quantification Mean in Bottom-Up Proteomics?

Plubell, Deanna L; Käll, Lukas; Webb-Robertson, Bobbie-Jo; Bramer, Lisa M; Ives, Ashley; Kelleher, Neil L; Smith, Lloyd M; Montine, Thomas J; Wu, Christine C; MacCoss, Michael J.

J Proteome Res ; 21(4): 891-898, 2022 04 01.

Artículo en Inglés | MEDLINE | ID: mdl-35220718

RESUMEN

Bottom-up proteomics provides peptide measurements and has been invaluable for moving proteomics into large-scale analyses. Commonly, a single quantitative value is reported for each protein-coding gene by aggregating peptide quantities into protein groups following protein inference or parsimony. However, given the complexity of both RNA splicing and post-translational protein modification, it is overly simplistic to assume that all peptides that map to a singular protein-coding gene will demonstrate the same quantitative response. By assuming that all peptides from a protein-coding sequence are representative of the same protein, we may miss the discovery of important biological differences. To capture the contributions of existing proteoforms, we need to reconsider the practice of aggregating protein values to a single quantity per protein-coding gene.

Asunto(s)

Proteínas , Proteómica , Péptidos/genética , Péptidos/metabolismo , Procesamiento Proteico-Postraduccional , Proteínas/metabolismo , Proteoma/genética , Proteoma/metabolismo

15.

Interpretation of the DOME Recommendations for Machine Learning in Proteomics and Metabolomics.

Palmblad, Magnus; Böcker, Sebastian; Degroeve, Sven; Kohlbacher, Oliver; Käll, Lukas; Noble, William Stafford; Wilhelm, Mathias.

J Proteome Res ; 21(4): 1204-1207, 2022 04 01.

Artículo en Inglés | MEDLINE | ID: mdl-35119864

RESUMEN

Machine learning is increasingly applied in proteomics and metabolomics to predict molecular structure, function, and physicochemical properties, including behavior in chromatography, ion mobility, and tandem mass spectrometry. These must be described in sufficient detail to apply or evaluate the performance of trained models. Here we look at and interpret the recently published and general DOME (Data, Optimization, Model, Evaluation) recommendations for conducting and reporting on machine learning in the specific context of proteomics and metabolomics.

Asunto(s)

Metabolómica , Proteómica , Aprendizaje Automático , Metabolómica/métodos , Proteómica/métodos , Espectrometría de Masas en Tándem

16.

Finding haplotypic signatures in proteins.

Vasícek, Jakub; Skiadopoulou, Dafni; Kuznetsova, Ksenia G; Wen, Bo; Johansson, Stefan; Njølstad, Pål R; Bruckner, Stefan; Käll, Lukas; Vaudel, Marc.

Gigascience ; 122022 12 28.

Artículo en Inglés | MEDLINE | ID: mdl-37919975

RESUMEN

BACKGROUND: The nonrandom distribution of alleles of common genomic variants produces haplotypes, which are fundamental in medical and population genetic studies. Consequently, protein-coding genes with different co-occurring sets of alleles can encode different amino acid sequences: protein haplotypes. These protein haplotypes are present in biological samples and detectable by mass spectrometry, but they are not accounted for in proteomic searches. Consequently, the impact of haplotypic variation on the results of proteomic searches and the discoverability of peptides specific to haplotypes remain unknown. FINDINGS: Here, we study how common genetic haplotypes influence the proteomic search space and investigate the possibility to match peptides containing multiple amino acid substitutions to a publicly available data set of mass spectra. We found that for 12.42% of the discoverable amino acid substitutions encoded by common haplotypes, 2 or more substitutions may co-occur in the same peptide after tryptic digestion of the protein haplotypes. We identified 352 spectra that matched to such multivariant peptides, and out of the 4,582 amino acid substitutions identified, 6.37% were covered by multivariant peptides. However, the evaluation of the reliability of these matches remains challenging, suggesting that refined error rate estimation procedures are needed for such complex proteomic searches. CONCLUSIONS: As these procedures become available and the ability to analyze protein haplotypes increases, we anticipate that proteomics will provide new information on the consequences of common variation, across tissues and time.

Asunto(s)

Proteínas , Proteómica , Proteómica/métodos , Haplotipos , Reproducibilidad de los Resultados , Proteínas/genética , Péptidos

17.

Triqler for MaxQuant: Enhancing Results from MaxQuant by Bayesian Error Propagation and Integration.

The, Matthew; Käll, Lukas.

J Proteome Res ; 20(4): 2062-2068, 2021 04 02.

Artículo en Inglés | MEDLINE | ID: mdl-33661646

RESUMEN

Error estimation for differential protein quantification by label-free shotgun proteomics is challenging due to the multitude of error sources, each contributing uncertainty to the final results. We have previously designed a Bayesian model, Triqler, to combine such error terms into one combined quantification error. Here we present an interface for Triqler that takes MaxQuant results as input, allowing quick reanalysis of already processed data. We demonstrate that Triqler outperforms the original processing for a large set of both engineered and clinical/biological relevant data sets. Triqler and its interface to MaxQuant are available as a Python module under an Apache 2.0 license from https://pypi.org/project/triqler/.

Asunto(s)

Proteómica , Programas Informáticos , Teorema de Bayes , Proteínas

18.

The one-carbon pool controls mitochondrial energy metabolism via complex I and iron-sulfur clusters.

Rosenberger, Florian A; Moore, David; Atanassov, Ilian; Moedas, Marco F; Clemente, Paula; Végvári, Ákos; Fissi, Najla El; Filograna, Roberta; Bucher, Anna-Lena; Hinze, Yvonne; The, Matthew; Hedman, Erik; Chernogubova, Ekaterina; Begzati, Arjana; Wibom, Rolf; Jain, Mohit; Nilsson, Roland; Käll, Lukas; Wedell, Anna; Freyer, Christoph; Wredenberg, Anna.

Sci Adv ; 7(8)2021 02.

Artículo en Inglés | MEDLINE | ID: mdl-33608280

RESUMEN

Induction of the one-carbon cycle is an early hallmark of mitochondrial dysfunction and cancer metabolism. Vital intermediary steps are localized to mitochondria, but it remains unclear how one-carbon availability connects to mitochondrial function. Here, we show that the one-carbon metabolite and methyl group donor S-adenosylmethionine (SAM) is pivotal for energy metabolism. A gradual decline in mitochondrial SAM (mitoSAM) causes hierarchical defects in fly and mouse, comprising loss of mitoSAM-dependent metabolites and impaired assembly of the oxidative phosphorylation system. Complex I stability and iron-sulfur cluster biosynthesis are directly controlled by mitoSAM levels, while other protein targets are predominantly methylated outside of the organelle before import. The mitoSAM pool follows its cytosolic production, establishing mitochondria as responsive receivers of one-carbon units. Thus, we demonstrate that cellular methylation potential is required for energy metabolism, with direct relevance for pathophysiology, aging, and cancer.

19.

Performing Selection on a Monotonic Function in Lieu of Sorting Using Layer-Ordered Heaps.

Lucke, Kyle; Pennington, Jake; Kreitzberg, Patrick; Käll, Lukas; Serang, Oliver.

J Proteome Res ; 20(4): 1849-1854, 2021 04 02.

Artículo en Inglés | MEDLINE | ID: mdl-33529032

RESUMEN

Nonparametric statistical tests are an integral part of scientific experiments in a diverse range of fields. When performing such tests, it is standard to sort values; however, this requires Ω(n log(n)) time to sort n values. Thus given enough data, sorting becomes the computational bottleneck, even with very optimized implementations such as the C++ standard library routine, std::sort. Frequently, a nonparametric statistical test is only used to partition values above and below a threshold in the sorted ordering, where the threshold corresponds to a significant statistical result. Linear-time selection and partitioning algorithms cannot be directly used because the selection and partitioning are performed on the transformed statistical significance values rather than on the sorted statistics. Usually, those transformed statistical significance values (e.g., the p value when investigating the family-wise error rate and q values when investigating the false discovery rate (FDR)) can only be computed at a threshold. Because this threshold is unknown, this leads to sorting the data. Layer-ordered heaps, which can be constructed in O(n), only partially sort values and thus can be used to get around the slow runtime required to fully sort. Here we introduce a layer-ordering-based method for selection and partitioning on the transformed values (e.g., p values or q values). We demonstrate the use of this method to partition peptides using an FDR threshold. This approach is applied to speed up Percolator, a postprocessing algorithm used in mass-spectrometry-based proteomics to evaluate the quality of peptide-spectrum matches (PSMs), by >70% on data sets with 100 million PSMs.

Asunto(s)

Proteómica , Espectrometría de Masas en Tándem , Algoritmos , Bases de Datos de Proteínas , Péptidos , Programas Informáticos

20.

Parallelized calculation of permutation tests.

Ekvall, Markus; Höhle, Michael; Käll, Lukas.

Bioinformatics ; 36(22-23): 5392-5397, 2021 04 01.

Artículo en Inglés | MEDLINE | ID: mdl-33289531

RESUMEN

MOTIVATION: Permutation tests offer a straightforward framework to assess the significance of differences in sample statistics. A significant advantage of permutation tests are the relatively few assumptions about the distribution of the test statistic are needed, as they rely on the assumption of exchangeability of the group labels. They have great value, as they allow a sensitivity analysis to determine the extent to which the assumed broad sample distribution of the test statistic applies. However, in this situation, permutation tests are rarely applied because the running time of naïve implementations is too slow and grows exponentially with the sample size. Nevertheless, continued development in the 1980s introduced dynamic programming algorithms that compute exact permutation tests in polynomial time. Albeit this significant running time reduction, the exact test has not yet become one of the predominant statistical tests for medium sample size. Here, we propose a computational parallelization of one such dynamic programming-based permutation test, the Green algorithm, which makes the permutation test more attractive. RESULTS: Parallelization of the Green algorithm was found possible by non-trivial rearrangement of the structure of the algorithm. A speed-up-by orders of magnitude-is achievable by executing the parallelized algorithm on a GPU. We demonstrate that the execution time essentially becomes a non-issue for sample sizes, even as high as hundreds of samples. This improvement makes our method an attractive alternative to, e.g. the widely used asymptotic Mann-Whitney U-test. AVAILABILITYAND IMPLEMENTATION: In Python 3 code from the GitHub repository https://github.com/statisticalbiotechnology/parallelPermutationTest under an Apache 2.0 license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Estadísticas no Paramétricas

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA