Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 31
Filtrar
1.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38271483

RESUMO

The advent of single-cell sequencing technologies has revolutionized cell biology studies. However, integrative analyses of diverse single-cell data face serious challenges, including technological noise, sample heterogeneity, and different modalities and species. To address these problems, we propose scCorrector, a variational autoencoder-based model that can integrate single-cell data from different studies and map them into a common space. Specifically, we designed a Study Specific Adaptive Normalization for each study in decoder to implement these features. scCorrector substantially achieves competitive and robust performance compared with state-of-the-art methods and brings novel insights under various circumstances (e.g. various batches, multi-omics, cross-species, and development stages). In addition, the integration of single-cell data and spatial data makes it possible to transfer information between different studies, which greatly expand the narrow range of genes covered by MERFISH technology. In summary, scCorrector can efficiently integrate multi-study single-cell datasets, thereby providing broad opportunities to tackle challenges emerging from noisy resources.

2.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34929734

RESUMO

Since its selection as the method of the year in 2013, single-cell technologies have become mature enough to provide answers to complex research questions. With the growth of single-cell profiling technologies, there has also been a significant increase in data collected from single-cell profilings, resulting in computational challenges to process these massive and complicated datasets. To address these challenges, deep learning (DL) is positioned as a competitive alternative for single-cell analyses besides the traditional machine learning approaches. Here, we survey a total of 25 DL algorithms and their applicability for a specific step in the single cell RNA-seq processing pipeline. Specifically, we establish a unified mathematical representation of variational autoencoder, autoencoder, generative adversarial network and supervised DL models, compare the training strategies and loss functions for these models, and relate the loss functions of these models to specific objectives of the data processing step. Such a presentation will allow readers to choose suitable algorithms for their particular objective at each step in the pipeline. We envision that this survey will serve as an important information portal for learning the application of DL for scRNA-seq analysis and inspire innovative uses of DL to address a broader range of new challenges in emerging multi-omics and spatial single-cell sequencing.


Assuntos
Aprendizado Profundo , RNA-Seq/métodos , Análise de Célula Única/métodos , Algoritmos , Análise por Conglomerados , Perfilação da Expressão Gênica/métodos , Humanos , Aprendizado de Máquina , Análise de Sequência de RNA/métodos , Transcriptoma
3.
BMC Bioinformatics ; 24(1): 481, 2023 Dec 16.
Artigo em Inglês | MEDLINE | ID: mdl-38104057

RESUMO

BACKGROUND: The rapid emergence of single-cell RNA-seq (scRNA-seq) data presents remarkable opportunities for broad investigations through integration analyses. However, most integration models are black boxes that lack interpretability or are hard to train. RESULTS: To address the above issues, we propose scInterpreter, a deep learning-based interpretable model. scInterpreter substantially outperforms other state-of-the-art (SOTA) models in multiple benchmark datasets. In addition, scInterpreter is extensible and can integrate and annotate atlas scRNA-seq data. We evaluated the robustness of scInterpreter in a variety of situations. Through comparison experiments, we found that with a knowledge prior, the training process can be significantly accelerated. Finally, we conducted interpretability analysis for each dimension (pathway) of cell representation in the embedding space. CONCLUSIONS: The results showed that the cell representations obtained by scInterpreter are full of biological significance. Through weight sorting, we found several new genes related to pathways in PBMC dataset. In general, scInterpreter is an effective and interpretable integration tool. It is expected that scInterpreter will bring great convenience to the study of single-cell transcriptomics.


Assuntos
Leucócitos Mononucleares , Análise da Expressão Gênica de Célula Única , Análise de Sequência de RNA/métodos , Leucócitos Mononucleares/metabolismo , Análise de Célula Única/métodos , Perfilação da Expressão Gênica/métodos , Análise por Conglomerados
4.
J Proteome Res ; 22(2): 471-481, 2023 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-36695565

RESUMO

Recent surges in large-scale mass spectrometry (MS)-based proteomics studies demand a concurrent rise in methods to facilitate reliable and reproducible data analysis. Quantification of proteins in MS analysis can be affected by variations in technical factors such as sample preparation and data acquisition conditions leading to batch effects, which adds to noise in the data set. This may in turn affect the effectiveness of any biological conclusions derived from the data. Here we present Batch-effect Identification, Representation, and Correction of Heterogeneous data (BIRCH), a workflow for analysis and correction of batch effect through an automated, versatile, and easy to use web-based tool with the goal of eliminating technical variation. BIRCH also supports diagnosis of the data to check for the presence of batch effects, feasibility of batch correction, and imputation to deal with missing values in the data set. To illustrate the relevance of the tool, we explore two case studies, including an iPSC-derived cell study and a Covid vaccine study to show different context-specific use cases. Ultimately this tool can be used as an extremely powerful approach for eliminating technical bias while retaining biological bias, toward understanding disease mechanisms and potential therapeutics.


Assuntos
COVID-19 , Proteômica , Humanos , Proteômica/métodos , Betula , Fluxo de Trabalho , Vacinas contra COVID-19 , Espectrometria de Massas/métodos
5.
BMC Genomics ; 24(1): 228, 2023 May 02.
Artigo em Inglês | MEDLINE | ID: mdl-37131143

RESUMO

BACKGROUND: Single-cell RNA sequencing is a state-of-the-art technology to understand gene expression in complex tissues. With the growing amount of data being generated, the standardization and automation of data analysis are critical to generating hypotheses and discovering biological insights. RESULTS: Here, we present scRNASequest, a semi-automated single-cell RNA-seq (scRNA-seq) data analysis workflow which allows (1) preprocessing from raw UMI count data, (2) harmonization by one or multiple methods, (3) reference-dataset-based cell type label transfer and embedding projection, (4) multi-sample, multi-condition single-cell level differential gene expression analysis, and (5) seamless integration with cellxgene VIP for visualization and with CellDepot for data hosting and sharing by generating compatible h5ad files. CONCLUSIONS: We developed scRNASequest, an end-to-end pipeline for single-cell RNA-seq data analysis, visualization, and publishing. The source code under MIT open-source license is provided at https://github.com/interactivereport/scRNASequest . We also prepared a bookdown tutorial for the installation and detailed usage of the pipeline: https://interactivereport.github.io/scRNAsequest/tutorial/docs/ . Users have the option to run it on a local computer with a Linux/Unix system including MacOS, or interact with SGE/Slurm schedulers on high-performance computing (HPC) clusters.


Assuntos
Ecossistema , Perfilação da Expressão Gênica , Perfilação da Expressão Gênica/métodos , Análise da Expressão Gênica de Célula Única , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Software , Editoração
6.
Brief Bioinform ; 22(6)2021 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-34308480

RESUMO

In single cell analyses, cell types are conventionally identified based on expressions of known marker genes, whose identifications are time-consuming and irreproducible. To solve this issue, many supervised approaches have been developed to identify cell types based on the rapid accumulation of public datasets. However, these approaches are sensitive to batch effects or biological variations since the data distributions are different in cross-platforms or species predictions. In this study, we developed scAdapt, a virtual adversarial domain adaptation network, to transfer cell labels between datasets with batch effects. scAdapt used both the labeled source and unlabeled target data to train an enhanced classifier and aligned the labeled source centroids and pseudo-labeled target centroids to generate a joint embedding. The scAdapt was demonstrated to outperform existing methods for classification in simulated, cross-platforms, cross-species, spatial transcriptomic and COVID-19 immune datasets. Further quantitative evaluations and visualizations for the aligned embeddings confirm the superiority in cell mixing and the ability to preserve discriminative cluster structure present in the original datasets.


Assuntos
COVID-19/genética , SARS-CoV-2/genética , Análise de Célula Única , Transcriptoma/genética , COVID-19/virologia , Humanos , RNA-Seq , SARS-CoV-2/isolamento & purificação , Especificidade da Espécie , Sequenciamento do Exoma
7.
BMC Bioinformatics ; 22(1): 309, 2021 Jun 08.
Artigo em Inglês | MEDLINE | ID: mdl-34103004

RESUMO

BACKGROUND: Single-cell RNA sequencing (scRNA-Seq) experiments are gaining ground to study the molecular processes that drive normal development as well as the onset of different pathologies. Finding an effective and efficient low-dimensional representation of the data is one of the most important steps in the downstream analysis of scRNA-Seq data, as it could provide a better identification of known or putatively novel cell-types. Another step that still poses a challenge is the integration of different scRNA-Seq datasets. Though standard computational pipelines to gain knowledge from scRNA-Seq data exist, a further improvement could be achieved by means of machine learning approaches. RESULTS: Autoencoders (AEs) have been effectively used to capture the non-linearities among gene interactions of scRNA-Seq data, so that the deployment of AE-based tools might represent the way forward in this context. We introduce here scAEspy, a unifying tool that embodies: (1) four of the most advanced AEs, (2) two novel AEs that we developed on purpose, (3) different loss functions. We show that scAEspy can be coupled with various batch-effect removal tools to integrate data by different scRNA-Seq platforms, in order to better identify the cell-types. We benchmarked scAEspy against the most used batch-effect removal tools, showing that our AE-based strategies outperform the existing solutions. CONCLUSIONS: scAEspy is a user-friendly tool that enables using the most recent and promising AEs to analyse scRNA-Seq data by only setting up two user-defined parameters. Thanks to its modularity, scAEspy can be easily extended to accommodate new AEs to further improve the downstream analysis of scRNA-Seq data. Considering the relevant results we achieved, scAEspy can be considered as a starting point to build a more comprehensive toolkit designed to integrate multi single-cell omics.


Assuntos
RNA , Análise de Célula Única , Aprendizado de Máquina , RNA/genética , Análise de Sequência de RNA , Sequenciamento do Exoma
8.
Expert Rev Proteomics ; 18(10): 835-843, 2021 10.
Artigo em Inglês | MEDLINE | ID: mdl-34602016

RESUMO

INTRODUCTION: Mass spectrometry-based proteomics is actively embracing quantitative, single-cell level analyses. Indeed, recent advances in sample preparation and mass spectrometry (MS) have enabled the emergence of quantitative MS-based single-cell proteomics (SCP). While exciting and promising, SCP still has many rough edges. The current analysis workflows are custom and built from scratch. The field is therefore craving for standardized software that promotes principled and reproducible SCP data analyses. AREAS COVERED: This special report is the first step toward the formalization and standardization of SCP data analysis. scp, the software that accompanies this work, successfully replicates one of the landmark SCP studies and is applicable to other experiments and designs. We created a repository containing the replicated workflow with comprehensive documentation in order to favor further dissemination and improvements of SCP data analyses. EXPERT OPINION: Replicating SCP data analyses uncovers important challenges in SCP data analysis. We describe two such challenges in detail: batch correction and data missingness. We provide the current state-of-the-art and illustrate the associated limitations. We also highlight the intimate dependence that exists between batch effects and data missingness and offer avenues for dealing with these exciting challenges.


Assuntos
Proteômica , Software , Biologia Computacional , Espectrometria de Massas , Fluxo de Trabalho
9.
BMC Bioinformatics ; 20(1): 268, 2019 May 28.
Artigo em Inglês | MEDLINE | ID: mdl-31138121

RESUMO

BACKGROUND: Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. RESULTS: We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project. CONCLUSIONS: Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.


Assuntos
Algoritmos , Biologia Computacional/métodos , Bases de Dados Genéticas , Epistasia Genética , Genes , Área Sob a Curva , Regulação da Expressão Gênica , Humanos , Curva ROC , Gordura Subcutânea/metabolismo
10.
BMC Genomics ; 17: 469, 2016 06 22.
Artigo em Inglês | MEDLINE | ID: mdl-27334613

RESUMO

BACKGROUND: Illumina's HumanMethylation450 arrays provide the most cost-effective means of high-throughput DNA methylation analysis. As with other types of microarray platforms, technical artifacts are a concern, including background fluorescence, dye-bias from the use of two color channels, bias caused by type I/II probe design, and batch effects. Several approaches and pipelines have been developed, either targeting a single issue or designed to address multiple biases through a combination of methods. We evaluate the effect of combining separate approaches to improve signal processing. RESULTS: In this study nine processing methods, including both within- and between- array methods, are applied and compared in four datasets. For technical replicates, we found both within- and between-array methods did a comparable job in reducing variance across replicates. For evaluating biological differences, within-array processing always improved differential DNA methylation signal detection over no processing, and always benefitted from performing background correction first. Combinations of within-array procedures were always among the best performing methods, with a slight advantage appearing for the between-array method Funnorm when batch effects explained more variation in the data than the methylation alterations between cases and controls. However, when this occurred, RUVm, a new batch correction method noticeably improved reproducibility of differential methylation results over any of the signal-processing methods alone. CONCLUSIONS: The comparisons in our study provide valuable insights in preprocessing HumanMethylation450 BeadChip data. We found the within-array combination of Noob + BMIQ always improved signal sensitivity, and when combined with the RUVm batch-correction method, outperformed all other approaches in performing differential DNA methylation analysis. The effect of the data processing method, in any given data set, was a function of both the signal and noise.


Assuntos
Metilação de DNA , Epigênese Genética , Epigenômica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Doença de Alzheimer/genética , Encéfalo/metabolismo , Feminino , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Masculino , Curva ROC , Reprodutibilidade dos Testes
11.
Genome Biol ; 25(1): 212, 2024 Aug 09.
Artigo em Inglês | MEDLINE | ID: mdl-39123269

RESUMO

BACKGROUND: Spatial transcriptomics (ST) is advancing our understanding of complex tissues and organisms. However, building a robust clustering algorithm to define spatially coherent regions in a single tissue slice and aligning or integrating multiple tissue slices originating from diverse sources for essential downstream analyses remains challenging. Numerous clustering, alignment, and integration methods have been specifically designed for ST data by leveraging its spatial information. The absence of comprehensive benchmark studies complicates the selection of methods and future method development. RESULTS: In this study, we systematically benchmark a variety of state-of-the-art algorithms with a wide range of real and simulated datasets of varying sizes, technologies, species, and complexity. We analyze the strengths and weaknesses of each method using diverse quantitative and qualitative metrics and analyses, including eight metrics for spatial clustering accuracy and contiguity, uniform manifold approximation and projection visualization, layer-wise and spot-to-spot alignment accuracy, and 3D reconstruction, which are designed to assess method performance as well as data quality. The code used for evaluation is available on our GitHub. Additionally, we provide online notebook tutorials and documentation to facilitate the reproduction of all benchmarking results and to support the study of new methods and new datasets. CONCLUSIONS: Our analyses lead to comprehensive recommendations that cover multiple aspects, helping users to select optimal tools for their specific needs and guide future method development.


Assuntos
Algoritmos , Benchmarking , Análise por Conglomerados , Animais , Perfilação da Expressão Gênica/métodos , Transcriptoma , Humanos , Software , Alinhamento de Sequência/métodos
12.
Genome Biol ; 25(1): 89, 2024 04 08.
Artigo em Inglês | MEDLINE | ID: mdl-38589921

RESUMO

Advancements in cytometry technologies have enabled quantification of up to 50 proteins across millions of cells at single cell resolution. Analysis of cytometry data routinely involves tasks such as data integration, clustering, and dimensionality reduction. While numerous tools exist, many require extensive run times when processing large cytometry data containing millions of cells. Existing solutions, such as random subsampling, are inadequate as they risk excluding rare cell subsets. To address this, we propose SuperCellCyto, an R package that builds on the SuperCell tool which groups highly similar cells into supercells. SuperCellCyto is available on GitHub ( https://github.com/phipsonlab/SuperCellCyto ) and Zenodo ( https://doi.org/10.5281/zenodo.10521294 ).


Assuntos
Pesquisa , Análise de Célula Única , Análise por Conglomerados , Software
13.
Front Genet ; 15: 1369628, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38903761

RESUMO

Genotype-to-phenotype mapping is an essential problem in the current genomic era. While qualitative case-control predictions have received significant attention, less emphasis has been placed on predicting quantitative phenotypes. This emerging field holds great promise in revealing intricate connections between microbial communities and host health. However, the presence of heterogeneity in microbiome datasets poses a substantial challenge to the accuracy of predictions and undermines the reproducibility of models. To tackle this challenge, we investigated 22 normalization methods that aimed at removing heterogeneity across multiple datasets, conducted a comprehensive review of them, and evaluated their effectiveness in predicting quantitative phenotypes in three simulation scenarios and 31 real datasets. The results indicate that none of these methods demonstrate significant superiority in predicting quantitative phenotypes or attain a noteworthy reduction in Root Mean Squared Error (RMSE) of the predictions. Given the frequent occurrence of batch effects and the satisfactory performance of batch correction methods in predicting datasets affected by these effects, we strongly recommend utilizing batch correction methods as the initial step in predicting quantitative phenotypes. In summary, the performance of normalization methods in predicting metagenomic data remains a dynamic and ongoing research area. Our study contributes to this field by undertaking a comprehensive evaluation of diverse methods and offering valuable insights into their effectiveness in predicting quantitative phenotypes.

14.
bioRxiv ; 2024 Feb 28.
Artigo em Inglês | MEDLINE | ID: mdl-37745478

RESUMO

High-throughput image-based profiling platforms are powerful technologies capable of collecting data from billions of cells exposed to thousands of perturbations in a time- and cost-effective manner. Therefore, image-based profiling data has been increasingly used for diverse biological applications, such as predicting drug mechanism of action or gene function. However, batch effects pose severe limitations to community-wide efforts to integrate and interpret image-based profiling data collected across different laboratories and equipment. To address this problem, we benchmarked seven high-performing scRNA-seq batch correction techniques, representing diverse approaches, using a newly released Cell Painting dataset, the largest publicly accessible image-based dataset. We focused on five different scenarios with varying complexity, and we found that Harmony, a mixture-model based method, consistently outperformed the other tested methods. Our proposed framework, benchmark, and metrics can additionally be used to assess new batch correction methods in the future. Overall, this work paves the way for improvements that allow the community to make best use of public Cell Painting data for scientific discovery.

15.
Mol Cells ; 46(2): 106-119, 2023 Feb 28.
Artigo em Inglês | MEDLINE | ID: mdl-36859475

RESUMO

With the increased number of single-cell RNA sequencing (scRNA-seq) datasets in public repositories, integrative analysis of multiple scRNA-seq datasets has become commonplace. Batch effects among different datasets are inevitable because of differences in cell isolation and handling protocols, library preparation technology, and sequencing platforms. To remove these batch effects for effective integration of multiple scRNA-seq datasets, a number of methodologies have been developed based on diverse concepts and approaches. These methods have proven useful for examining whether cellular features, such as cell subpopulations and marker genes, identified from a certain dataset, are consistently present, or whether their condition-dependent variations, such as increases in cell subpopulations in particular disease-related conditions, are consistently observed in different datasets generated under similar or distinct conditions. In this review, we summarize the concepts and approaches of the integration methods and their pros and cons as has been reported in previous literature.


Assuntos
Análise da Expressão Gênica de Célula Única , Biblioteca Gênica
16.
Artigo em Inglês | MEDLINE | ID: mdl-37122388

RESUMO

Large scale -omics datasets can provide new insights into normal and disease-related biology when analyzed through a systems biology framework. However, technical artefacts present in most -omics datasets due to variations in sample preparation, batching, platform settings, personnel, and other experimental procedures prevent useful analyses of such data without prior adjustment for these technical factors. Here, we demonstrate a tunable median polish of ratio (TAMPOR) approach for batch effect correction and agglomeration of multiple, multi-batch, site-specific cohorts into a single analyte abundance data matrix that is suitable for systems biology analyses. We illustrate the utility and versatility of TAMPOR through four distinct use cases where the method has been applied to different proteomic datasets, some of which contain a specific defect that must be addressed prior to analysis. We compare quality control metrics and sources of variance before and after application of TAMPOR to show that TAMPOR is effective at removing batch effects and other unwanted sources of variance in -omics data. We also show how TAMPOR can be used to harmonize -omics datasets even when the data are acquired using different analytical approaches. TAMPOR is a powerful and flexible approach for cleaning and harmonization of -omics data prior to downstream systems biology analysis.

17.
Cell Rep Methods ; 3(9): 100581, 2023 09 25.
Artigo em Inglês | MEDLINE | ID: mdl-37708894

RESUMO

Gene expression dynamics provide directional information for trajectory inference from single-cell RNA sequencing data. Traditional approaches compute RNA velocity using strict modeling assumptions about transcription and splicing of RNA. This can fail in scenarios where multiple lineages have distinct gene dynamics or where rates of transcription and splicing are time dependent. We present "LatentVelo," an approach to compute a low-dimensional representation of gene dynamics with deep learning. LatentVelo embeds cells into a latent space with a variational autoencoder and models differentiation dynamics on this "dynamics-based" latent space with neural ordinary differential equations. LatentVelo infers a latent regulatory state that controls the dynamics of an individual cell to model multiple lineages. LatentVelo can predict latent trajectories, describing the inferred developmental path for individual cells rather than just local RNA velocity vectors. The dynamics-based embedding batch corrects cell states and velocities, outperforming comparable autoencoder batch correction methods that do not consider gene expression dynamics.


Assuntos
Perfilação da Expressão Gênica , Transcriptoma , Transcriptoma/genética , Diferenciação Celular/genética , RNA , Splicing de RNA/genética
18.
Epigenetics Chromatin ; 16(1): 1, 2023 01 07.
Artigo em Inglês | MEDLINE | ID: mdl-36609459

RESUMO

BACKGROUND: Many human disease phenotypes manifest differently by sex, making the development of methods for incorporating X and Y-chromosome data into analyses vital. Unfortunately, X and Y chromosome data are frequently excluded from large-scale analyses of the human genome and epigenome due to analytical complexity associated with sex chromosome dosage differences between XX and XY individuals, and the impact of X-chromosome inactivation (XCI) on the epigenome. As such, little attention has been given to considering the methods by which sex chromosome data may be included in analyses of DNA methylation (DNAme) array data. RESULTS: With Illumina Infinium HumanMethylation450 DNAme array data from 634 placental samples, we investigated the effects of probe filtering, normalization, and batch correction on DNAme data from the X and Y chromosomes. Processing steps were evaluated in both mixed-sex and sex-stratified subsets of the analysis cohort to identify whether including both sexes impacted processing results. We found that identification of probes that have a high detection p-value, or that are non-variable, should be performed in sex-stratified data subsets to avoid over- and under-estimation of the quantity of probes eligible for removal, respectively. All normalization techniques investigated returned X and Y DNAme data that were highly correlated with the raw data from the same samples. We found no difference in batch correction results after application to mixed-sex or sex-stratified cohorts. Additionally, we identify two analytical methods suitable for XY chromosome data, the choice between which should be guided by the research question of interest, and we performed a proof-of-concept analysis studying differential DNAme on the X and Y chromosome in the context of placental acute chorioamnionitis. Finally, we provide an annotation of probe types that may be desirable to filter in X and Y chromosome analyses, including probes in repetitive elements, the X-transposed region, and cancer-testis gene promoters. CONCLUSION: While there may be no single "best" approach for analyzing DNAme array data from the X and Y chromosome, analysts must consider key factors during processing and analysis of sex chromosome data to accommodate the underlying biology of these chromosomes, and the technical limitations of DNA methylation arrays.


Assuntos
Metilação de DNA , Placenta , Masculino , Humanos , Feminino , Gravidez , Cromossomo Y/genética , Inativação do Cromossomo X , Fenótipo
19.
Comput Struct Biotechnol J ; 20: 4369-4375, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36051874

RESUMO

Mass-spectrometry-based proteomics presents some unique challenges for batch effect correction. Batch effects are technical sources of variation, can confound analysis and usually non-biological in nature. As proteomic analysis involves several stages of data transformation from spectra to protein, the decision on when and what to apply batch correction on is often unclear. Here, we explore several relevant issues pertinent to batch effect correct considerations. The first involves applications of batch effect correction requiring prior knowledge on batch factors and exploring data to uncover new/unknown batch factors. The second considers recent literature that suggests there is no single best batch effect correction algorithm---i.e., instead of a best approach, one may instead ask, what is a suitable approach. The third section considers issues of batch effect detection. And finally, we look at potential developments for proteomic-specific batch effect correction methods and how to do better functional evaluations on batch corrected data.

20.
Front Genet ; 13: 1009316, 2022.
Artigo em Inglês | MEDLINE | ID: mdl-36386846

RESUMO

Large-scale comprehensive single-cell experiments are often resource-intensive and require the involvement of many laboratories and/or taking measurements at various times. This inevitably leads to batch effects, and systematic variations in the data that might occur due to different technology platforms, reagent lots, or handling personnel. Such technical differences confound biological variations of interest and need to be corrected during the data integration process. Data integration is a challenging task due to the overlapping of biological and technical factors, which makes it difficult to distinguish their individual contribution to the overall observed effect. Moreover, the choice of integration method may impact the downstream analyses, including searching for differentially expressed genes. From the existing data integration methods, we selected only those that return the full expression matrix. We evaluated six methods in terms of their influence on the performance of differential gene expression analysis in two single-cell datasets with the same biological study design that differ only in the way the measurement was done: one dataset manifests strong batch effects due to the measurements of each sample at a different time. Integrated data were visualized using the UMAP method. The evaluation was done both on individual gene level using parametric and non-parametric approaches for finding differentially expressed genes and on gene set level using gene set enrichment analysis. As an evaluation metric, we used two correlation coefficients, Pearson and Spearman, of the obtained test statistics between reference, test, and corrected studies. Visual comparison of UMAP plots highlighted ComBat-seq, limma, and MNN, which reduced batch effects and preserved differences between biological conditions. Most of the tested methods changed the data distribution after integration, which negatively impacts the use of parametric methods for the analysis. Two algorithms, MNN and Scanorama, gave very poor results in terms of differential analysis on gene and gene set levels. Finally, we highlight ComBat-seq as it led to the highest correlation of test statistics between reference and corrected dataset among others. Moreover, it does not distort the original distribution of gene expression data, so it can be used in all types of downstream analyses.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA