Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 2.710
Filter
Add more filters

Publication year range
1.
Annu Rev Immunol ; 40: 45-74, 2022 04 26.
Article in English | MEDLINE | ID: mdl-35471840

ABSTRACT

The transformative success of antibodies targeting the PD-1 (programmed death 1)/B7-H1 (B7 homolog 1) pathway (anti-PD therapy) has revolutionized cancer treatment. However, only a fraction of patients with solid tumors and some hematopoietic malignancies respond to anti-PD therapy, and the reason for failure in other patients is less known. By dissecting the mechanisms underlying this resistance, current studies reveal that the tumor microenvironment is a major location for resistance to occur. Furthermore, the resistance mechanisms appear to be highly heterogeneous. Here, we discuss recent human cancer data identifying mechanisms of resistance to anti-PD therapy. We review evidence for immune-based resistance mechanisms such as loss of neoantigens, defects in antigen presentation and interferon signaling, immune inhibitory molecules, and exclusion of T cells. We also review the clinical evidence for emerging mechanisms of resistance to anti-PD therapy, such as alterations in metabolism, microbiota, and epigenetics. Finally, we discuss strategies to overcome anti-PD therapy resistance and emphasize the need to develop additional immunotherapies based on the concept of normalization cancer immunotherapy.


Subject(s)
Neoplasms , Programmed Cell Death 1 Receptor , Animals , B7-H1 Antigen , Humans , Immunotherapy , Neoplasms/drug therapy , Neoplasms/metabolism , T-Lymphocytes , Tumor Microenvironment
2.
Cell ; 179(2): 527-542.e19, 2019 10 03.
Article in English | MEDLINE | ID: mdl-31585086

ABSTRACT

Much of current molecular and cell biology research relies on the ability to purify cell types by fluorescence-activated cell sorting (FACS). FACS typically relies on the ability to label cell types of interest with antibodies or fluorescent transgenic constructs. However, antibody availability is often limited, and genetic manipulation is labor intensive or impossible in the case of primary human tissue. To date, no systematic method exists to enrich for cell types without a priori knowledge of cell-type markers. Here, we propose GateID, a computational method that combines single-cell transcriptomics with FACS index sorting to purify cell types of choice using only native cellular properties such as cell size, granularity, and mitochondrial content. We validate GateID by purifying various cell types from zebrafish kidney marrow and the human pancreas to high purity without resorting to specific antibodies or transgenes.


Subject(s)
Cell Separation/methods , Flow Cytometry/methods , Software , Transcriptome , Animals , Humans , Kidney/cytology , Pancreas/cytology , Single-Cell Analysis , Zebrafish/anatomy & histology
3.
Cell ; 175(2): 313-326, 2018 10 04.
Article in English | MEDLINE | ID: mdl-30290139

ABSTRACT

Harnessing an antitumor immune response has been a fundamental strategy in cancer immunotherapy. For over a century, efforts have primarily focused on amplifying immune activation mechanisms that are employed by humans to eliminate invaders such as viruses and bacteria. This "immune enhancement" strategy often results in rare objective responses and frequent immune-related adverse events (irAEs). However, in the last decade, cancer immunotherapies targeting the B7-H1/PD-1 pathway (anti-PD therapy), have achieved higher objective response rates in patients with much fewer irAEs. This more beneficial tumor response-to-toxicity profile stems from distinct mechanisms of action that restore tumor-induced immune deficiency selectively in the tumor microenvironment, here termed "immune normalization," which has led to its FDA approval in more than 10 cancer indications and facilitated its combination with different therapies. In this article, we wish to highlight the principles of immune normalization and learn from it, with the ultimate goal to guide better designs for future cancer immunotherapies.


Subject(s)
Immunotherapy/methods , Neoplasms/immunology , Neoplasms/therapy , B7-H1 Antigen/drug effects , B7-H1 Antigen/immunology , CTLA-4 Antigen/immunology , Combined Modality Therapy/methods , Humans , Immunotherapy/trends , Programmed Cell Death 1 Receptor/drug effects , Programmed Cell Death 1 Receptor/immunology , Tumor Microenvironment/drug effects
4.
Brief Bioinform ; 25(2)2024 Jan 22.
Article in English | MEDLINE | ID: mdl-38279652

ABSTRACT

Cleavage Under Targets and Release Using Nuclease (CUT&RUN) is a recent development for epigenome mapping, but its unique methodology can hamper proper quantitative analyses. As traditional normalization approaches have been shown to be inaccurate, we sought to determine endogenous normalization factors based on the human genome regions of constant nonspecific signal. This constancy was determined by applying Shannon's information entropy, and the set of normalizer regions, which we named the 'Greenlist', was extensively validated using publicly available datasets. We demonstrate here that the greenlist normalization outperforms the current top standards, and remains consistent across different experimental setups, cell lines and antibodies; the approach can even be applied to different species or to CUT&Tag. Requiring no additional experimental steps and no added cost, this approach can be universally applied to CUT&RUN experiments to greatly minimize the interference of technical variation over the biological epigenome changes of interest.


Subject(s)
Epigenome , Genomics , Humans , Genome
5.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38770720

ABSTRACT

The normalization of RNA sequencing data is a primary step for downstream analysis. The most popular method used for the normalization is the trimmed mean of M values (TMM) and DESeq. The TMM tries to trim away extreme log fold changes of the data to normalize the raw read counts based on the remaining non-deferentially expressed genes. However, the major problem with the TMM is that the values of trimming factor M are heuristic. This paper tries to estimate the adaptive value of M in TMM based on Jaeckel's Estimator, and each sample acts as a reference to find the scale factor of each sample. The presented approach is validated on SEQC, MAQC2, MAQC3, PICKRELL and two simulated datasets with two-group and three-group conditions by varying the percentage of differential expression and the number of replicates. The performance of the present approach is compared with various state-of-the-art methods, and it is better in terms of area under the receiver operating characteristic curve and differential expression.


Subject(s)
RNA-Seq , RNA-Seq/methods , Humans , Algorithms , Sequence Analysis, RNA/methods , Computational Biology/methods , Gene Expression Profiling/methods , ROC Curve , Software
6.
Mol Cell Proteomics ; 23(5): 100768, 2024 May.
Article in English | MEDLINE | ID: mdl-38621647

ABSTRACT

Mass spectrometry (MS)-based single-cell proteomics (SCP) provides us the opportunity to unbiasedly explore biological variability within cells without the limitation of antibody availability. This field is rapidly developed with the main focuses on instrument advancement, sample preparation refinement, and signal boosting methods; however, the optimal data processing and analysis are rarely investigated which holds an arduous challenge because of the high proportion of missing values and batch effect. Here, we introduced a quantification quality control to intensify the identification of differentially expressed proteins (DEPs) by considering both within and across SCP data. Combining quantification quality control with isobaric matching between runs (IMBR) and PSM-level normalization, an additional 12% and 19% of proteins and peptides, with more than 90% of proteins/peptides containing valid values, were quantified. Clearly, quantification quality control was able to reduce quantification variations and q-values with the more apparent cell type separations. In addition, we found that PSM-level normalization performed similar to other protein-level normalizations but kept the original data profiles without the additional requirement of data manipulation. In proof of concept of our refined pipeline, six uniquely identified DEPs exhibiting varied fold-changes and playing critical roles for melanoma and monocyte functionalities were selected for validation using immunoblotting. Five out of six validated DEPs showed an identical trend with the SCP dataset, emphasizing the feasibility of combining the IMBR, cell quality control, and PSM-level normalization in SCP analysis, which is beneficial for future SCP studies.


Subject(s)
Proteomics , Quality Control , Single-Cell Analysis , Single-Cell Analysis/methods , Proteomics/methods , Humans , Mass Spectrometry/methods , Data Analysis , Proteome/metabolism
7.
J Neurosci ; 44(24)2024 Jun 12.
Article in English | MEDLINE | ID: mdl-38641406

ABSTRACT

Faces and bodies are processed in separate but adjacent regions in the primate visual cortex. Yet, the functional significance of dividing the whole person into areas dedicated to its face and body components and their neighboring locations remains unknown. Here we hypothesized that this separation and proximity together with a normalization mechanism generate clutter-tolerant representations of the face, body, and whole person when presented in complex multi-category scenes. To test this hypothesis, we conducted a fMRI study, presenting images of a person within a multi-category scene to human male and female participants and assessed the contribution of each component to the response to the scene. Our results revealed a clutter-tolerant representation of the whole person in areas selective for both faces and bodies, typically located at the border between the two category-selective regions. Regions exclusively selective for faces or bodies demonstrated clutter-tolerant representations of their preferred category, corroborating earlier findings. Thus, the adjacent locations of face- and body-selective areas enable a hardwired machinery for decluttering of the whole person, without the need for a dedicated population of person-selective neurons. This distinct yet proximal functional organization of category-selective brain regions enhances the representation of the socially significant whole person, along with its face and body components, within multi-category scenes.


Subject(s)
Facial Recognition , Magnetic Resonance Imaging , Humans , Male , Female , Adult , Young Adult , Facial Recognition/physiology , Brain Mapping , Pattern Recognition, Visual/physiology , Photic Stimulation/methods , Visual Cortex/physiology , Visual Cortex/diagnostic imaging , Brain/physiology , Brain/diagnostic imaging
8.
Plant J ; 118(5): 1241-1257, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38289828

ABSTRACT

RNA-Sequencing is widely used to investigate changes in gene expression at the transcription level in plants. Most plant RNA-Seq analysis pipelines base the normalization approaches on the assumption that total transcript levels do not vary between samples. However, this assumption has not been demonstrated. In fact, many common experimental treatments and genetic alterations affect transcription efficiency or RNA stability, resulting in unequal transcript abundance. The addition of synthetic RNA controls is a simple correction that controls for variation in total mRNA levels. However, adding spike-ins appropriately is challenging with complex plant tissue, and carefully considering how they are added is essential to their successful use. We demonstrate that adding external RNA spike-ins as a normalization control produces differences in RNA-Seq analysis compared to traditional normalization methods, even between two times of day in untreated plants. We illustrate the use of RNA spike-ins with 3' RNA-Seq and present a normalization pipeline that accounts for differences in total transcriptional levels. We evaluate the effect of normalization methods on identifying differentially expressed genes in the context of identifying the effect of the time of day on gene expression and response to chilling stress in sorghum.


Subject(s)
Gene Expression Regulation, Plant , RNA, Plant , RNA, Plant/genetics , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods , RNA, Messenger/genetics , RNA, Messenger/metabolism , Arabidopsis/genetics
9.
Am J Hum Genet ; 109(11): 1974-1985, 2022 11 03.
Article in English | MEDLINE | ID: mdl-36206757

ABSTRACT

Almost always, the analysis of single-cell RNA-sequencing (scRNA-seq) data begins with the generation of the low dimensional embedding of the data by principal-component analysis (PCA). Because scRNA-seq data are count data, log transformation is routinely applied to correct skewness prior to PCA, which is often argued to have added bias to data. Alternatively, studies have proposed methods that directly assume a count model and use approximately normally distributed count residuals for PCA. Despite their theoretical advantage of directly modeling count data, these methods are extremely slow for large datasets. In fact, when the data size grows, even the standard log normalization becomes inefficient. Here, we present FastRNA, a highly efficient solution for PCA of scRNA-seq data based on a count model accounting for both batches and cell size factors. Although we assume the same general count model as previous methods, our method uses two orders of magnitude less time and memory than the other count-based methods and an order of magnitude less time and memory than the standard log normalization. This achievement results from our unique algebraic optimization that completely avoids the formation of the large dense residual matrix in memory. In addition, our method enjoys a benefit that the batch effects are eliminated from data prior to PCA. Generating a batch-accounted PC of an atlas-scale dataset with 2 million cells takes less than a minute and 1 GB memory with our method.


Subject(s)
RNA , Single-Cell Analysis , Humans , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Principal Component Analysis , Exome Sequencing , Gene Expression Profiling
10.
Brief Bioinform ; 24(2)2023 03 19.
Article in English | MEDLINE | ID: mdl-36759336

ABSTRACT

The chromatin interaction assays, particularly Hi-C, enable detailed studies of genome architecture in multiple organisms and model systems, resulting in a deeper understanding of gene expression regulation mechanisms mediated by epigenetics. However, the analysis and interpretation of Hi-C data remain challenging due to technical biases, limiting direct comparisons of datasets obtained in different experiments and laboratories. As a result, removing biases from Hi-C-generated chromatin contact matrices is a critical data analysis step. Our novel approach, HiConfidence, eliminates biases from the Hi-C data by weighing chromatin contacts according to their consistency between replicates so that low-quality replicates do not substantially influence the result. The algorithm is effective for the analysis of global changes in chromatin structures such as compartments and topologically associating domains. We apply the HiConfidence approach to several Hi-C datasets with significant technical biases, that could not be analyzed effectively using existing methods, and obtain meaningful biological conclusions. In particular, HiConfidence aids in the study of how changes in histone acetylation pattern affect chromatin organization in Drosophila melanogaster S2 cells. The method is freely available at GitHub: https://github.com/victorykobets/HiConfidence.


Subject(s)
Drosophila melanogaster , Genome , Animals , Drosophila melanogaster/genetics , Chromatin/genetics , Chromosomes , Bias
11.
Brief Bioinform ; 24(5)2023 09 20.
Article in English | MEDLINE | ID: mdl-37598422

ABSTRACT

The advent of single-cell RNA sequencing (scRNA-seq) technologies has enabled gene expression profiling at the single-cell resolution, thereby enabling the quantification and comparison of transcriptional variability among individual cells. Although alterations in transcriptional variability have been observed in various biological states, statistical methods for quantifying and testing differential variability between groups of cells are still lacking. To identify the best practices in differential variability analysis of single-cell gene expression data, we propose and compare 12 statistical pipelines using different combinations of methods for normalization, feature selection, dimensionality reduction and variability calculation. Using high-quality synthetic scRNA-seq datasets, we benchmarked the proposed pipelines and found that the most powerful and accurate pipeline performs simple library size normalization, retains all genes in analysis and uses denSNE-based distances to cluster medoids as the variability measure. By applying this pipeline to scRNA-seq datasets of COVID-19 and autism patients, we have identified cellular variability changes between patients with different severity status or between patients and healthy controls.


Subject(s)
COVID-19 , Humans , COVID-19/genetics , Gene Expression Profiling/methods , Gene Expression , Sequence Analysis, RNA/methods , Cluster Analysis
12.
Methods ; 230: 91-98, 2024 Oct.
Article in English | MEDLINE | ID: mdl-39097179

ABSTRACT

DNA N6 methyladenine (6mA) plays an important role in many biological processes, and accurately identifying its sites helps one to understand its biological effects more comprehensively. Previous traditional experimental methods are very labor-intensive and traditional machine learning methods also seem to be somewhat insufficient as the database of 6mA methylation groups becomes progressively larger, so we propose a deep learning-based method called multi-scale convolutional model based on global response normalization (CG6mA) to solve the prediction problem of 6mA site. This method is tested with other methods on three different kinds of benchmark datasets, and the results show that our model can get more excellent prediction results.


Subject(s)
Adenosine , DNA Methylation , Deep Learning , Adenosine/analogs & derivatives , Adenosine/chemistry , Adenosine/genetics , Humans , DNA/chemistry , DNA/genetics , Computational Biology/methods
13.
Methods ; 222: 1-9, 2024 Feb.
Article in English | MEDLINE | ID: mdl-38128706

ABSTRACT

The development of single cell RNA sequencing (scRNA-seq) has provided new perspectives to study biological problems at the single cell level. One of the key issues in scRNA-seq data analysis is to divide cells into several clusters for discovering the heterogeneity and diversity of cells. However, the existing scRNA-seq data are high-dimensional, sparse, and noisy, which challenges the existing single-cell clustering methods. In this study, we propose a joint learning framework (JLONMFSC) for clustering scRNA-seq data. In our method, the dimension of the original data is reduced to minimize the effect of noise. In addition, the graph regularized matrix factorization is used to learn the local features. Further, the Low-Rank Representation (LRR) subspace clustering is utilized to learn the global features. Finally, the joint learning of local features and global features is performed to obtain the results of clustering. We compare the proposed algorithm with eight state-of-the-art algorithms for clustering performance on six datasets, and the experimental results demonstrate that the JLONMFSC achieves better performance in all datasets. The code is avalable at https://github.com/lanbiolab/JLONMFSC.


Subject(s)
Gene Expression Profiling , Single-Cell Gene Expression Analysis , Gene Expression Profiling/methods , Sequence Analysis, RNA/methods , Single-Cell Analysis/methods , Algorithms , Cluster Analysis
14.
Proc Natl Acad Sci U S A ; 119(40): e2120581119, 2022 10 04.
Article in English | MEDLINE | ID: mdl-36161961

ABSTRACT

Divisive normalization is a canonical computation in the brain, observed across neural systems, that is often considered to be an implementation of the efficient coding principle. We provide a theoretical result that makes the conditions under which divisive normalization is an efficient code analytically precise: We show that, in a low-noise regime, encoding an n-dimensional stimulus via divisive normalization is efficient if and only if its prevalence in the environment is described by a multivariate Pareto distribution. We generalize this multivariate analog of histogram equalization to allow for arbitrary metabolic costs of the representation, and show how different assumptions on costs are associated with different shapes of the distributions that divisive normalization efficiently encodes. Our result suggests that divisive normalization may have evolved to efficiently represent stimuli with Pareto distributions. We demonstrate that this efficiently encoded distribution is consistent with stylized features of naturalistic stimulus distributions such as their characteristic conditional variance dependence, and we provide empirical evidence suggesting that it may capture the statistics of filter responses to naturalistic images. Our theoretical finding also yields empirically testable predictions across sensory domains on how the divisive normalization parameters should be tuned to features of the input distribution.


Subject(s)
Brain , Models, Neurological , Neurons , Brain/physiology , Neurons/physiology
15.
BMC Bioinformatics ; 25(1): 221, 2024 Jun 20.
Article in English | MEDLINE | ID: mdl-38902629

ABSTRACT

BACKGROUND: Extracellular vesicle-derived (EV)-miRNAs have potential to serve as biomarkers for the diagnosis of various diseases. miRNA microarrays are widely used to quantify circulating EV-miRNA levels, and the preprocessing of miRNA microarray data is critical for analytical accuracy and reliability. Thus, although microarray data have been used in various studies, the effects of preprocessing have not been studied for Toray's 3D-Gene chip, a widely used measurement method. We aimed to evaluate batch effect, missing value imputation accuracy, and the influence of preprocessing on measured values in 18 different preprocessing pipelines for EV-miRNA microarray data from two cohorts with amyotrophic lateral sclerosis using 3D-Gene technology. RESULTS: Eighteen different pipelines with different types and orders of missing value completion and normalization were used to preprocess the 3D-Gene microarray EV-miRNA data. Notable results were suppressed in the batch effects in all pipelines using the batch effect correction method ComBat. Furthermore, pipelines utilizing missForest for missing value imputation showed high agreement with measured values. In contrast, imputation using constant values for missing data exhibited low agreement. CONCLUSIONS: This study highlights the importance of selecting the appropriate preprocessing strategy for EV-miRNA microarray data when using 3D-Gene technology. These findings emphasize the importance of validating preprocessing approaches, particularly in the context of batch effect correction and missing value imputation, for reliably analyzing data in biomarker discovery and disease research.


Subject(s)
Extracellular Vesicles , MicroRNAs , Oligonucleotide Array Sequence Analysis , Extracellular Vesicles/metabolism , Extracellular Vesicles/genetics , MicroRNAs/genetics , MicroRNAs/metabolism , Humans , Oligonucleotide Array Sequence Analysis/methods , Amyotrophic Lateral Sclerosis/genetics , Amyotrophic Lateral Sclerosis/metabolism , Gene Expression Profiling/methods
16.
BMC Bioinformatics ; 25(1): 19, 2024 Jan 12.
Article in English | MEDLINE | ID: mdl-38216877

ABSTRACT

In experiments with significant perturbations to transcription, nascent RNA sequencing protocols are dependent on external spike-ins for reliable normalization. Unlike in RNA-seq, these spike-ins are not standardized and, in many cases, depend on a run-on reaction that is assumed to have constant efficiency across samples. To assess the validity of this assumption, we analyze a large number of published nascent RNA spike-ins to quantify their variability across existing normalization methods. Furthermore, we develop a new biologically-informed Bayesian model to estimate the error in spike-in based normalization estimates, which we term Virtual Spike-In (VSI). We apply this method both to published external spike-ins as well as using reads at the [Formula: see text] end of long genes, building on prior work from Mahat (Mol Cell 62(1):63-78, 2016. https://doi.org/10.1016/j.molcel.2016.02.025 ) and Vihervaara (Nat Commun 8(1):255, 2017. https://doi.org/10.1038/s41467-017-00151-0 ). We find that spike-ins in existing nascent RNA experiments are typically under sequenced, with high variability between samples. Furthermore, we show that these high variability estimates can have significant downstream effects on analysis, complicating biological interpretations of results.


Subject(s)
RNA , RNA/genetics , Bayes Theorem , Sequence Analysis, RNA , RNA-Seq
17.
BMC Bioinformatics ; 25(1): 304, 2024 Sep 16.
Article in English | MEDLINE | ID: mdl-39285319

ABSTRACT

BACKGROUND: In high-throughput sequencing studies, sequencing depth, which quantifies the total number of reads, varies across samples. Unequal sequencing depth can obscure true biological signals of interest and prevent direct comparisons between samples. To remove variability due to differential sequencing depth, taxa counts are usually normalized before downstream analysis. However, most existing normalization methods scale counts using size factors that are sample specific but not taxa specific, which can result in over- or under-correction for some taxa. RESULTS: We developed TaxaNorm, a novel normalization method based on a zero-inflated negative binomial model. This method assumes the effects of sequencing depth on mean and dispersion vary across taxa. Incorporating the zero-inflation part can better capture the nature of microbiome data. We also propose two corresponding diagnosis tests on the varying sequencing depth effect for validation. We find that TaxaNorm achieves comparable performance to existing methods in most simulation scenarios in downstream analysis and reaches a higher power for some cases. Specifically, it balances power and false discovery control well. When applying the method in a real dataset, TaxaNorm has improved performance when correcting technical bias. CONCLUSION: TaxaNorm both sample- and taxon- specific bias by introducing an appropriate regression framework in the microbiome data, which aids in data interpretation and visualization. The 'TaxaNorm' R package is freely available through the CRAN repository https://CRAN.R-project.org/package=TaxaNorm and the source code can be downloaded at https://github.com/wangziyue57/TaxaNorm .


Subject(s)
High-Throughput Nucleotide Sequencing , Microbiota , Microbiota/genetics , High-Throughput Nucleotide Sequencing/methods , Humans , Algorithms
18.
BMC Bioinformatics ; 25(1): 136, 2024 Mar 29.
Article in English | MEDLINE | ID: mdl-38549046

ABSTRACT

BACKGROUND: Cross-platform normalization seeks to minimize technological bias between microarray and RNAseq whole-transcriptome data. Incorporating multiple gene expression platforms permits external validation of experimental findings, and augments training sets for machine learning models. Here, we compare the performance of Feature Specific Quantile Normalization (FSQN) to a previously used but unvalidated and uncharacterized method we label as Feature Specific Mean Variance Normalization (FSMVN). We evaluate the performance of these methods for bidirectional normalization in the context of nested feature selection. RESULTS: FSQN and FSMVN provided clinically equivalent bidirectional model performance with and without feature selection for colon CMS and breast PAM50 classification. Using principal component analysis, we determine that these methods eliminate batch effects related to technological platforms. Without feature selection, no statistical difference was identified between the performance of FSQN and FSMVN of cross-platform data compared to within-platform distributions. Under optimal feature selection conditions, balanced accuracy was FSQN and FSMVN were statistically equivalent to the within-platform distribution performance in multivariable linear regression analysis. FSQN and FSMVN also provided similar performance to within-platform distributions as the number of selected genes used to create models decreases. CONCLUSIONS: In the context of generating supervised machine learning classifiers for molecular subtypes, FSQN and FSMVN are equally effective. Under optimal modeling conditions, FSQN and FSMVN provide equivalent model accuracy performance on cross-platform normalization data compared to within-platform data. Using cross-platform data should still be approached with caution as subtle performance differences may exist depending on the classification problem, training, and testing distributions.


Subject(s)
Gene Expression Profiling , Transcriptome , Gene Expression Profiling/methods , Microarray Analysis , Linear Models
19.
BMC Bioinformatics ; 25(1): 181, 2024 May 08.
Article in English | MEDLINE | ID: mdl-38720247

ABSTRACT

BACKGROUND: RNA sequencing combined with machine learning techniques has provided a modern approach to the molecular classification of cancer. Class predictors, reflecting the disease class, can be constructed for known tissue types using the gene expression measurements extracted from cancer patients. One challenge of current cancer predictors is that they often have suboptimal performance estimates when integrating molecular datasets generated from different labs. Often, the quality of the data is variable, procured differently, and contains unwanted noise hampering the ability of a predictive model to extract useful information. Data preprocessing methods can be applied in attempts to reduce these systematic variations and harmonize the datasets before they are used to build a machine learning model for resolving tissue of origins. RESULTS: We aimed to investigate the impact of data preprocessing steps-focusing on normalization, batch effect correction, and data scaling-through trial and comparison. Our goal was to improve the cross-study predictions of tissue of origin for common cancers on large-scale RNA-Seq datasets derived from thousands of patients and over a dozen tumor types. The results showed that the choice of data preprocessing operations affected the performance of the associated classifier models constructed for tissue of origin predictions in cancer. CONCLUSION: By using TCGA as a training set and applying data preprocessing methods, we demonstrated that batch effect correction improved performance measured by weighted F1-score in resolving tissue of origin against an independent GTEx test dataset. On the other hand, the use of data preprocessing operations worsened classification performance when the independent test dataset was aggregated from separate studies in ICGC and GEO. Therefore, based on our findings with these publicly available large-scale RNA-Seq datasets, the application of data preprocessing techniques to a machine learning pipeline is not always appropriate.


Subject(s)
Machine Learning , Neoplasms , RNA-Seq , Humans , RNA-Seq/methods , Neoplasms/genetics , Transcriptome/genetics , Sequence Analysis, RNA/methods , Gene Expression Profiling/methods , Computational Biology/methods
20.
J Physiol ; 602(19): 4713-4728, 2024 Oct.
Article in English | MEDLINE | ID: mdl-39234878

ABSTRACT

Physiologists often express the change in the value of a measurement made on two occasions as a ratio of the initial value. This is usually motivated by an assumption that the absolute change fails to capture the true extent of the alteration that has occurred in attaining the final value - if there is initial variation among individual cases. While it may appear reasonable to use ratios to standardize the magnitude of change in this way, the perils of doing so have been widely documented. Ratios frequently have intractable statistical properties, both when taken in isolation and when analysed using techniques such as regression. A new method of computing a standardized metric of change, based on principal components analysis (PCA), is described. It exploits the collinearity within sets of initial, absolute change and final values. When these sets define variables subjected to PCA, the standardized measure of change is obtained as the product of the loading of absolute change onto the first principal component (PC1) and the eigenvalue of PC1. It is demonstrated that a sample drawn from a population of these standardized measures: approximates a normal distribution (unlike the corresponding ratios); lies within the same range; and preserves the rank order of the ratios. It is also shown that this method can be used to express the magnitude of a physiological response in an experimental condition relative to that obtained in a control condition. KEY POINTS: The intractable statistical properties of ratios and the perils of using ratios to standardize the magnitude of change are well known. A new method of computing a standardized metric, based on principal components analysis (PCA), is described, which exploits the collinearity within sets of initial, absolute change and final values. A sample drawn from a population of these PCA-derived measures: approximates a normal distribution (unlike the corresponding ratios); lies within the same range as the ratios; and preserves the rank order of the ratios. The method can also be applied to express the magnitude of a physiological response in an experimental condition relative to a control condition.


Subject(s)
Principal Component Analysis , Principal Component Analysis/methods , Animals , Humans , Physiology/methods , Physiology/standards
SELECTION OF CITATIONS
SEARCH DETAIL