RESUMO
RNA-binding proteins (RBPs) control RNA metabolism to orchestrate gene expression and, when dysfunctional, underlie human diseases. Proteome-wide discovery efforts predict thousands of RBP candidates, many of which lack canonical RNA-binding domains (RBDs). Here, we present a hybrid ensemble RBP classifier (HydRA), which leverages information from both intermolecular protein interactions and internal protein sequence patterns to predict RNA-binding capacity with unparalleled specificity and sensitivity using support vector machines (SVMs), convolutional neural networks (CNNs), and Transformer-based protein language models. Occlusion mapping by HydRA robustly detects known RBDs and predicts hundreds of uncharacterized RNA-binding associated domains. Enhanced CLIP (eCLIP) for HydRA-predicted RBP candidates reveals transcriptome-wide RNA targets and confirms RNA-binding activity for HydRA-predicted RNA-binding associated domains. HydRA accelerates construction of a comprehensive RBP catalog and expands the diversity of RNA-binding associated domains.
Assuntos
Aprendizado Profundo , Hydra , Animais , Humanos , RNA/metabolismo , Ligação Proteica , Sítios de Ligação/genética , Hydra/genética , Hydra/metabolismoRESUMO
Missing values (MVs) can adversely impact data analysis and machine-learning model development. We propose a novel mixed-model method for missing value imputation (MVI). This method, ProJect (short for Protein inJection), is a powerful and meaningful improvement over existing MVI methods such as Bayesian principal component analysis (PCA), probabilistic PCA, local least squares and quantile regression imputation of left-censored data. We rigorously tested ProJect on various high-throughput data types, including genomics and mass spectrometry (MS)-based proteomics. Specifically, we utilized renal cancer (RC) data acquired using DIA-SWATH, ovarian cancer (OC) data acquired using DIA-MS, bladder (BladderBatch) and glioblastoma (GBM) microarray gene expression dataset. Our results demonstrate that ProJect consistently performs better than other referenced MVI methods. It achieves the lowest normalized root mean square error (on average, scoring 45.92% less error in RC_C, 27.37% in RC_full, 29.22% in OC, 23.65% in BladderBatch and 20.20% in GBM relative to the closest competing method) and the Procrustes sum of squared error (Procrustes SS) (exhibits 79.71% less error in RC_C, 38.36% in RC full, 18.13% in OC, 74.74% in BladderBatch and 30.79% in GBM compared to the next best method). ProJect also leads with the highest correlation coefficient among all types of MV combinations (0.64% higher in RC_C, 0.24% in RC full, 0.55% in OC, 0.39% in BladderBatch and 0.27% in GBM versus the second-best performing method). ProJect's key strength is its ability to handle different types of MVs commonly found in real-world data. Unlike most MVI methods that are designed to handle only one type of MV, ProJect employs a decision-making algorithm that first determines if an MV is missing at random or missing not at random. It then employs targeted imputation strategies for each MV type, resulting in more accurate and reliable imputation outcomes. An R implementation of ProJect is available at https://github.com/miaomiao6606/ProJect.
Assuntos
Algoritmos , Genômica , Teorema de Bayes , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Espectrometria de Massas/métodosRESUMO
Artificial Intelligence (AI) and Machine Learning (ML) models are increasingly deployed on biomedical and health data to shed insights on biological mechanism, predict disease outcomes, and support clinical decision-making. However, ensuring model validity is challenging. The 10 quick tips described here discuss useful practices on how to check AI/ML models from 2 perspectives-the user and the developer.
Assuntos
Biologia Computacional , Aprendizado de Máquina , Humanos , Biologia Computacional/métodos , Inteligência Artificial , Reprodutibilidade dos Testes , AlgoritmosRESUMO
BACKGROUND: With the rise of publicly available genomic data repositories, it is now common for scientists to rely on computational models and preprocessed data, either as control or to discover new knowledge. However, different repositories adhere to the different principles and guidelines, and data processing plays a significant role in the quality of the resulting datasets. Two popular repositories for transcription factor binding sites data - ENCODE and Cistrome - process the same biological samples in alternative ways, and their results are not always consistent. Moreover, the output format of the processing (BED narrowPeak) exposes a feature, the signalValue, which is seldom used in consistency checks, but can offer valuable insight on the quality of the data. RESULTS: We provide evidence that data points with high signalValue(s) (top 25% of values) are more likely to be consistent between ENCODE and Cistrome in human cell lines K562, GM12878, and HepG2. In addition, we show that filtering according to said high values improves the quality of predictions for a machine learning algorithm that detects transcription factor interactions based only on positional information. Finally, we provide a set of practices and guidelines, based on the signalValue feature, for scientists who wish to compare and merge narrowPeaks from ENCODE and Cistrome. CONCLUSIONS: The signalValue feature is an informative feature that can be effectively used to highlight consistent areas of overlap between different sources of TF binding sites that expose it. Its applicability extends to downstream to positional machine learning algorithms, making it a powerful tool for performance tweaking and data aggregation.
Assuntos
Fatores de Transcrição , Humanos , Fatores de Transcrição/metabolismo , Fatores de Transcrição/genética , Sítios de Ligação , Ligação Proteica , Biologia Computacional/métodos , Aprendizado de Máquina , Bases de Dados Genéticas , Algoritmos , Genômica/métodosRESUMO
In mass spectrometry (MS)-based proteomics, protein inference from identified peptides (protein fragments) is a critical step. We present ProInfer (Protein Inference), a novel protein assembly method that takes advantage of information in biological networks. ProInfer assists recovery of proteins supported only by ambiguous peptides (a peptide which maps to more than one candidate protein) and enhances the statistical confidence for proteins supported by both unique and ambiguous peptides. Consequently, ProInfer rescues weakly supported proteins thereby improving proteome coverage. Evaluated across THP1 cell line, lung cancer and RAW267.4 datasets, ProInfer always infers the most numbers of true positives, in comparison to mainstream protein inference tools Fido, EPIFANY and PIA. ProInfer is also adept at retrieving differentially expressed proteins, signifying its usefulness for functional analysis and phenotype profiling. Source codes of ProInfer are available at https://github.com/PennHui2016/ProInfer.
Assuntos
Algoritmos , Peptídeos , Peptídeos/química , Proteoma/análise , Espectrometria de Massas , Proteômica/métodos , Bases de Dados de Proteínas , SoftwareRESUMO
BACKGROUND: Most previous research on the environmental epidemiology of childhood atopic eczema, rhinitis and wheeze is limited in the scope of risk factors studied. Our study adopted a machine learning approach to explore the role of the exposome starting already in the preconception phase. METHODS: We performed a combined analysis of two multi-ethnic Asian birth cohorts, the Growing Up in Singapore Towards healthy Outcomes (GUSTO) and the Singapore PREconception Study of long Term maternal and child Outcomes (S-PRESTO) cohorts. Interviewer-administered questionnaires were used to collect information on demography, lifestyle and childhood atopic eczema, rhinitis and wheeze development. Data training was performed using XGBoost, genetic algorithm and logistic regression models, and the top variables with the highest importance were identified. Additive explanation values were identified and inputted into a final multiple logistic regression model. Generalised structural equation modelling with maternal and child blood micronutrients, metabolites and cytokines was performed to explain possible mechanisms. RESULTS: The final study population included 1151 mother-child pairs. Our findings suggest that these childhood diseases are likely programmed in utero by the preconception and pregnancy exposomes through inflammatory pathways. We identified preconception alcohol consumption and maternal depressive symptoms during pregnancy as key modifiable maternal environmental exposures that increased eczema and rhinitis risk. Our mechanistic model suggested that higher maternal blood neopterin and child blood dimethylglycine protected against early childhood wheeze. After birth, early infection was a key driver of atopic eczema and rhinitis development. CONCLUSION: Preconception and antenatal exposomes can programme atopic eczema, rhinitis and wheeze development in utero. Reducing maternal alcohol consumption during preconception and supporting maternal mental health during pregnancy may prevent atopic eczema and rhinitis by promoting an optimal antenatal environment. Our findings suggest a need to include preconception environmental exposures in future research to counter the earliest precursors of disease development in children.
Assuntos
Dermatite Atópica , Expossoma , Aprendizado de Máquina , Sons Respiratórios , Rinite , Humanos , Dermatite Atópica/epidemiologia , Feminino , Rinite/epidemiologia , Masculino , Pré-Escolar , Singapura/epidemiologia , Gravidez , Exposição Materna , Criança , Adulto , Efeitos Tardios da Exposição Pré-Natal/epidemiologia , Lactente , Estudos de CoortesRESUMO
Tree- and linear-shaped cell differentiation trajectories have been widely observed in developmental biologies and can be also inferred through computational methods from single-cell RNA-sequencing datasets. However, trajectories with complicated topologies such as loops, disparate lineages and bifurcating hierarchy remain difficult to infer accurately. Here, we introduce a density-based trajectory inference method capable of constructing diverse shapes of topological patterns including the most intriguing bifurcations. The novelty of our method is a step to exploit overlapping probability distributions to identify transition states of cells for determining connectability between cell clusters, and another step to infer a stable trajectory through a base-topology guided iterative fitting. Our method precisely re-constructed various benchmark reference trajectories. As a case study to demonstrate practical usefulness, our method was tested on single-cell RNA sequencing profiles of blood cells of SARS-CoV-2-infected patients. We not only re-discovered the linear trajectory bridging the transition from IgM plasmablast cells to developing neutrophils, and also found a previously-undiscovered lineage which can be rigorously supported by differentially expressed gene analysis.
Assuntos
COVID-19 , Análise de Célula Única , Humanos , Análise de Célula Única/métodos , SARS-CoV-2 , COVID-19/genética , Diferenciação Celular/genéticaRESUMO
BACKGROUND: Current protein family modeling methods like profile Hidden Markov Model (pHMM), k-mer based methods, and deep learning-based methods do not provide very accurate protein function prediction for proteins in the twilight zone, due to low sequence similarity to reference proteins with known functions. RESULTS: We present a novel method EnsembleFam, aiming at better function prediction for proteins in the twilight zone. EnsembleFam extracts the core characteristics of a protein family using similarity and dissimilarity features calculated from sequence homology relations. EnsembleFam trains three separate Support Vector Machine (SVM) classifiers for each family using these features, and an ensemble prediction is made to classify novel proteins into these families. Extensive experiments are conducted using the Clusters of Orthologous Groups (COG) dataset and G Protein-Coupled Receptor (GPCR) dataset. EnsembleFam not only outperforms state-of-the-art methods on the overall dataset but also provides a much more accurate prediction for twilight zone proteins. CONCLUSIONS: EnsembleFam, a machine learning method to model protein families, can be used to better identify members with very low sequence homology. Using EnsembleFam protein functions can be predicted using just sequence information with better accuracy than state-of-the-art methods.
Assuntos
Proteínas , Máquina de Vetores de Suporte , Humanos , Proteínas/metabolismoRESUMO
MOTIVATION: Existing genome assembly evaluation metrics provide only limited insight on specific aspects of genome assembly quality, and sometimes even disagree with each other. For better integrative comparison between assemblies, we propose, here, a new genome assembly evaluation metric, Pairwise Distance Reconstruction (PDR). It derives from a common concern in genetic studies, and takes completeness, contiguity, and correctness into consideration. We also propose an approximation implementation to accelerate PDR computation. RESULTS: Our results on publicly available datasets affirm PDR's ability to integratively assess the quality of a genome assembly. In fact, this is guaranteed by its definition. The results also indicated the error introduced by approximation is extremely small and thus negligible. AVAILABILITYAND IMPLEMENTATION: https://github.com/XLuyu/PDR. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Software , Análise de Sequência de DNARESUMO
MOTIVATION: Infection with strains of different subtypes and the subsequent crossover reading between the two strands of genomic RNAs by host cells' reverse transcriptase are the main causes of the vast HIV-1 sequence diversity. Such inter-subtype genomic recombinants can become circulating recombinant forms (CRFs) after widespread transmissions in a population. Complete prediction of all the subtype sources of a CRF strain is a complicated machine learning problem. It is also difficult to understand whether a strain is an emerging new subtype and if so, how to accurately identify the new components of the genetic source. RESULTS: We introduce a multi-label learning algorithm for the complete prediction of multiple sources of a CRF sequence as well as the prediction of its chronological number. The prediction is strengthened by a voting of various multi-label learning methods to avoid biased decisions. In our steps, frequency and position features of the sequences are both extracted to capture signature patterns of pure subtypes and CRFs. The method was applied to 7185 HIV-1 sequences, comprising 5530 pure subtype sequences and 1655 CRF sequences. Results have demonstrated that the method can achieve very high accuracy (reaching 99%) in the prediction of the complete set of labels of HIV-1 recombinant forms. A few wrong predictions are actually incomplete predictions, very close to the complete set of genuine labels. AVAILABILITY AND IMPLEMENTATION: https://github.com/Runbin-tang/The-source-of-HIV-CRFs-prediction. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Infecções por HIV , HIV-1 , Variação Genética , Infecções por HIV/genética , HIV-1/genética , Humanos , Epidemiologia Molecular , FilogeniaRESUMO
BACKGROUND: A pair of genes is defined as synthetically lethal if defects on both cause the death of the cell but a defect in only one of the two is compatible with cell viability. Ideally, if A and B are two synthetic lethal genes, inhibiting B should kill cancer cells with a defect on A, and should have no effects on normal cells. Thus, synthetic lethality can be exploited for highly selective cancer therapies, which need to exploit differences between normal and cancer cells. RESULTS: In this paper, we present a new method for predicting synthetic lethal (SL) gene pairs. As neighbouring genes in the genome have highly correlated profiles of copy number variations (CNAs), our method clusters proximal genes with a similar CNA profile, then predicts mutually exclusive group pairs, and finally identifies the SL gene pairs within each group pairs. For mutual-exclusion testing we use a graph-based method which takes into account the mutation frequencies of different subjects and genes. We use two different methods for selecting the pair of SL genes; the first is based on the gene essentiality measured in various conditions by means of the "Gene Activity Ranking Profile" GARP score; the second leverages the annotations of gene to biological pathways. CONCLUSIONS: This method is unique among current SL prediction approaches, it reduces false-positive SL predictions compared to previous methods, and it allows establishing explicit collateral lethality relationship of gene pairs within mutually exclusive group pairs.
Assuntos
Variações do Número de Cópias de DNA , Genes Letais , DNARESUMO
Mass spectrometry (MS)-based proteomics has undergone rapid advancements in recent years, creating challenging problems for bioinformatics. We focus on four aspects where bioinformatics plays a crucial role (and proteomics is needed for clinical application): peptide-spectra matching (PSM) based on the new data-independent acquisition (DIA) paradigm, resolving missing proteins (MPs), dealing with biological and technical heterogeneity in data and statistical feature selection (SFS). DIA is a brute-force strategy that provides greater width and depth but, because it indiscriminately captures spectra such that signal from multiple peptides is mixed, getting good PSMs is difficult. We consider two strategies: simplification of DIA spectra to pseudo-data-dependent acquisition spectra or, alternatively, brute-force search of each DIA spectra against known reference libraries. The MP problem arises when proteins are never (or inconsistently) detected by MS. When observed in at least one sample, imputation methods can be used to guess the approximate protein expression level. If never observed at all, network/protein complex-based contextualization provides an independent prediction platform. Data heterogeneity is a difficult problem with two dimensions: technical (batch effects), which should be removed, and biological (including demography and disease subpopulations), which should be retained. Simple normalization is seldom sufficient, while batch effect-correction algorithms may create errors. Batch effect-resistant normalization methods are a viable alternative. Finally, SFS is vital for practical applications. While many methods exist, there is no best method, and both upstream (e.g. normalization) and downstream processing (e.g. multiple-testing correction) are performance confounders. We also discuss signal detection when class effects are weak.
Assuntos
Biologia Computacional/métodos , Proteômica/estatística & dados numéricos , Algoritmos , Biologia Computacional/estatística & dados numéricos , Bases de Dados de Proteínas/estatística & dados numéricos , Humanos , Peptídeos/química , Proteínas/química , Software , Espectrometria de Massas em Tandem/estatística & dados numéricosRESUMO
MOTIVATION: A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. RESULTS: We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission. AVAILABILITY AND IMPLEMENTATION: https://github.com/yuansliu/memRGC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Compressão de Dados , Software , Algoritmos , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Mutação , Análise de Sequência de DNARESUMO
MOTIVATION: K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. RESULTS: We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays-one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. AVAILABILITY AND IMPLEMENTATION: The source codes of our algorithm are available at github.com/lzhLab/kmcEx. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Software , Alinhamento de Sequência , Análise de Sequência de DNARESUMO
Accurate risk assignment in childhood acute lymphoblastic leukaemia is essential to avoid under- or over-treatment. We hypothesized that time-series gene expression profiles (GEPs) of bone marrow samples during remission-induction therapy can measure the response and be used for relapse prediction. We computed the time-series changes from diagnosis to Day 8 of remission-induction, termed Effective Response Metric (ERM-D8) and tested its ability to predict relapse against contemporary risk assignment methods, including National Cancer Institutes (NCI) criteria, genetics and minimal residual disease (MRD). ERM-D8 was trained on a set of 131 patients and validated on an independent set of 79 patients. In the independent blinded test set, unfavourable ERM-D8 patients had >3-fold increased risk of relapse compared to favourable ERM-D8 (5-year cumulative incidence of relapse 38·1% vs. 10·6%; P = 2·5 × 10-3 ). ERM-D8 remained predictive of relapse [P = 0·05; Hazard ratio 4·09, 95% confidence interval (CI) 1·03-16·23] after adjusting for NCI criteria, genetics, Day 8 peripheral response and Day 33 MRD. ERM-D8 improved risk stratification in favourable genetics subgroups (P = 0·01) and Day 33 MRD positive patients (P = 1·7 × 10-3 ). We conclude that our novel metric - ERM-D8 - based on time-series GEP after 8 days of remission-induction therapy can independently predict relapse even after adjusting for NCI risk, genetics, Day 8 peripheral blood response and MRD.
Assuntos
Perfilação da Expressão Gênica , Regulação Leucêmica da Expressão Gênica , Leucemia-Linfoma Linfoblástico de Células Precursoras/sangue , Leucemia-Linfoma Linfoblástico de Células Precursoras/mortalidade , Criança , Pré-Escolar , Intervalo Livre de Doença , Feminino , Humanos , Lactente , Masculino , Leucemia-Linfoma Linfoblástico de Células Precursoras/genética , Valor Preditivo dos Testes , Recidiva , Medição de Risco , Taxa de SobrevidaRESUMO
MOTIVATION: The rapidly increasing number of genomes generated by high-throughput sequencing platforms and assembly algorithms is accompanied by problems in data storage, compression and communication. Traditional compression algorithms are unable to meet the demand of high compression ratio due to the intrinsic challenging features of DNA sequences such as small alphabet size, frequent repeats and palindromes. Reference-based lossless compression, by which only the differences between two similar genomes are stored, is a promising approach with high compression ratio. RESULTS: We present a high-performance referential genome compression algorithm named HiRGC. It is based on a 2-bit encoding scheme and an advanced greedy-matching search on a hash table. We compare the performance of HiRGC with four state-of-the-art compression methods on a benchmark dataset of eight human genomes. HiRGC takes <30 min to compress about 21 gigabytes of each set of the seven target genomes into 96-260 megabytes, achieving compression ratios of 217 to 82 times. This performance is at least 1.9 times better than the best competing algorithm on its best case. Our compression speed is also at least 2.9 times faster. HiRGC is stable and robust to deal with different reference genomes. In contrast, the competing methods' performance varies widely on different reference genomes. More experiments on 100 human genomes from the 1000 Genome Project and on genomes of several other species again demonstrate that HiRGC's performance is consistently excellent. AVAILABILITY AND IMPLEMENTATION: The C ++ and Java source codes of our algorithm are freely available for academic and non-commercial use. They can be downloaded from https://github.com/yuansliu/HiRGC. CONTACT: jinyan.li@uts.edu.au. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Compressão de Dados/métodos , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Humanos , Análise de Sequência de DNA/métodosRESUMO
MOTIVATION: Next-generation sequencing platforms have produced huge amounts of sequence data. This is revolutionizing every aspect of genetic and genomic research. However, these sequence datasets contain quite a number of machine-induced errors-e.g. errors due to substitution can be as high as 2.5%. Existing error-correction methods are still far from perfect. In fact, more errors are sometimes introduced than correct corrections, especially by the prevalent k-mer based methods. The existing methods have also made limited exploitation of on-demand cloud computing. RESULTS: We introduce an error-correction method named MEC, which uses a two-layered MapReduce technique to achieve high correction performance. In the first layer, all the input sequences are mapped to groups to identify candidate erroneous bases in parallel. In the second layer, the erroneous bases at the same position are linked together from all the groups for making statistically reliable corrections. Experiments on real and simulated datasets show that our method outperforms existing methods remarkably. Its per-position error rate is consistently the lowest, and the correction gain is always the highest. AVAILABILITY AND IMPLEMENTATION: The source code is available at bioinformatics.gxu.edu.cn/ngs/mec. CONTACTS: wongls@comp.nus.edu.sg or jinyan.li@uts.edu.au. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA/métodos , Sequência de Bases , Reprodutibilidade dos TestesRESUMO
Identifying reproducible yet relevant protein features in proteomics data is a major challenge. Analysis at the level of protein complexes can resolve this issue and we have developed a suite of feature-selection methods collectively referred to as Rank-Based Network Analysis (RBNA). RBNAs differ in their individual statistical test setup but are similar in the sense that they deploy rank-defined weights among proteins per sample. This procedure is known as gene fuzzy scoring. Currently, no RBNA exists for paired-sample scenarios where both control and test tissues originate from the same source (e.g. same patient). It is expected that paired tests, when used appropriately, are more powerful than approaches intended for unpaired samples. We report that the class-paired RBNA, PPFSNET, dominates in both simulated and real data scenarios. Moreover, for the first time, we explicitly incorporate batch-effect resistance as an additional evaluation criterion for feature-selection approaches. Batch effects are class irrelevant variations arising from different handlers or processing times, and can obfuscate analysis. We demonstrate that PPFSNET and an earlier RBNA, PFSNET, are particularly resistant against batch effects, and only select features strongly correlated with class but not batch.
RESUMO
Protein complex-based feature selection (PCBFS) provides unparalleled reproducibility with high phenotypic relevance on proteomics data. Currently, there are five PCBFS paradigms, but not all representative methods have been implemented or made readily available. To allow general users to take advantage of these methods, we developed the R-package NetProt, which provides implementations of representative feature-selection methods. NetProt also provides methods for generating simulated differential data and generating pseudocomplexes for complex-based performance benchmarking. The NetProt open source R package is available for download from https://github.com/gohwils/NetProt/releases/ , and online documentation is available at http://rpubs.com/gohwils/204259 .
Assuntos
Complexos Multiproteicos/análise , Proteômica/métodos , Benchmarking , Biologia Computacional/métodos , Humanos , Métodos , Reprodutibilidade dos Testes , SoftwareRESUMO
BACKGROUND: In proteomics, batch effects are technical sources of variation that confounds proper analysis, preventing effective deployment in clinical and translational research. RESULTS: Using simulated and real data, we demonstrate existing batch effect-correction methods do not always eradicate all batch effects. Worse still, they may alter data integrity, and introduce false positives. Moreover, although Principal component analysis (PCA) is commonly used for detecting batch effects. The principal components (PCs) themselves may be used as differential features, from which relevant differential proteins may be effectively traced. Batch effect are removable by identifying PCs highly correlated with batch but not class effect. However, neither PC-based nor existing batch effect-correction methods address well subtle batch effects, which are difficult to eradicate, and involve data transformation and/or projection which is error-prone. To address this, we introduce the concept of batch-effect resistant methods and demonstrate how such methods incorporating protein complexes are particularly resistant to batch effect without compromising data integrity. CONCLUSIONS: Protein complex-based analyses are powerful, offering unparalleled differential protein-selection reproducibility and high prediction accuracy. We demonstrate for the first time their innate resistance against batch effects, even subtle ones. As complex-based analyses require no prior data transformation (e.g. batch-effect correction), data integrity is protected. Individual checks on top-ranked protein complexes confirm strong association with phenotype classes and not batch. Therefore, the constituent proteins of these complexes are more likely to be clinically relevant.