Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 88
Filtrar
1.
Front Cell Infect Microbiol ; 14: 1405699, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-39071165

RESUMEN

Introduction: Microbiome-based clinical applications that improve diagnosis related to oral health are of great interest to precision dentistry. Predictive studies on the salivary microbiome are scarce and of low methodological quality (low sample sizes, lack of biological heterogeneity, and absence of a validation process). None of them evaluates the impact of confounding factors as batch effects (BEs). This is the first 16S multi-batch study to analyze the salivary microbiome at the amplicon sequence variant (ASV) level in terms of differential abundance and machine learning models. This is done in periodontally healthy and periodontitis patients before and after removing BEs. Methods: Saliva was collected from 124 patients (50 healthy, 74 periodontitis) in our setting. Sequencing of the V3-V4 16S rRNA gene region was performed in Illumina MiSeq. In parallel, searches were conducted on four databases to identify previous Illumina V3-V4 sequencing studies on the salivary microbiome. Investigations that met predefined criteria were included in the analysis, and the own and external sequences were processed using the same bioinformatics protocol. The statistical analysis was performed in the R-Bioconductor environment. Results: The elimination of BEs reduced the number of ASVs with differential abundance between the groups by approximately one-third (Before=265; After=190). Before removing BEs, the model constructed using all study samples (796) comprised 16 ASVs (0.16%) and had an area under the curve (AUC) of 0.944, sensitivity of 90.73%, and specificity of 87.16%. The model built using two-thirds of the specimens (training=531) comprised 35 ASVs (0.36%) and had an AUC of 0.955, sensitivity of 86.54%, and specificity of 90.06% after being validated in the remaining one-third (test=265). After removing BEs, the models required more ASVs (all samples=200-2.03%; training=100-1.01%) to obtain slightly lower AUC (all=0.935; test=0.947), lower sensitivity (all=81.79%; test=78.85%), and similar specificity (all=91.51%; test=90.68%). Conclusions: The removal of BEs controls false positive ASVs in the differential abundance analysis. However, their elimination implies a significantly larger number of predictor taxa to achieve optimal performance, creating less robust classifiers. As all the provided models can accurately discriminate health from periodontitis, implying good/excellent sensitivities/specificities, the salivary microbiome demonstrates potential clinical applicability as a precision diagnostic tool for periodontitis.


Asunto(s)
Biomarcadores , Microbiota , Periodontitis , ARN Ribosómico 16S , Saliva , Humanos , Saliva/microbiología , ARN Ribosómico 16S/genética , Periodontitis/microbiología , Periodontitis/diagnóstico , Femenino , Adulto , Masculino , Biomarcadores/análisis , Persona de Mediana Edad , Aprendizaje Automático , Bacterias/aislamiento & purificación , Bacterias/genética , Bacterias/clasificación , Secuenciación de Nucleótidos de Alto Rendimiento , Biología Computacional , Análisis de Secuencia de ADN , ADN Bacteriano/genética
2.
Curr Protoc ; 4(6): e1055, 2024 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38837690

RESUMEN

Data harmonization involves combining data from multiple independent sources and processing the data to produce one uniform dataset. Merging separate genotypes or whole-genome sequencing datasets has been proposed as a strategy to increase the statistical power of association tests by increasing the effective sample size. However, data harmonization is not a widely adopted strategy due to the difficulties with merging data (including confounding produced by batch effects and population stratification). Detailed data harmonization protocols are scarce and are often conflicting. Moreover, data harmonization protocols that accommodate samples of admixed ancestry are practically non-existent. Existing data harmonization procedures must be modified to ensure the heterogeneous ancestry of admixed individuals is incorporated into additional downstream analyses without confounding results. Here, we propose a set of guidelines for merging multi-platform genetic data from admixed samples that can be adopted by any investigator with elementary bioinformatics experience. We have applied these guidelines to aggregate 1544 tuberculosis (TB) case-control samples from six separate in-house datasets and conducted a genome-wide association study (GWAS) of TB susceptibility. The GWAS performed on the merged dataset had improved power over analyzing the datasets individually and produced summary statistics free from bias introduced by batch effects and population stratification. © 2024 Wiley Periodicals LLC. Basic Protocol 1: Processing separate datasets comprising array genotype data Alternate Protocol 1: Processing separate datasets comprising array genotype and whole-genome sequencing data Alternate Protocol 2: Performing imputation using a local reference panel Basic Protocol 2: Merging separate datasets Basic Protocol 3: Ancestry inference using ADMIXTURE and RFMix Basic Protocol 4: Batch effect correction using pseudo-case-control comparisons.


Asunto(s)
Estudio de Asociación del Genoma Completo , Humanos , Estudio de Asociación del Genoma Completo/métodos , Estudio de Asociación del Genoma Completo/normas , Genómica/métodos , Genómica/normas , Tuberculosis/genética , Estudios de Casos y Controles , Guías como Asunto , Predisposición Genética a la Enfermedad
3.
Adv Sci (Weinh) ; 11(26): e2306770, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38711214

RESUMEN

Integrating multiple single-cell datasets is essential for the comprehensive understanding of cell heterogeneity. Batch effect is the undesired systematic variations among technologies or experimental laboratories that distort biological signals and hinder the integration of single-cell datasets. However, existing methods typically rely on a selected dataset as a reference, leading to inconsistent integration performance using different references, or embed cells into uninterpretable low-dimensional feature space. To overcome these limitations, a reference-free method, Beaconet, for integrating multiple single-cell transcriptomic datasets in original molecular space by aligning the global distribution of each batch using an adversarial correction network is presented. Through extensive comparisons with 13 state-of-the-art methods, it is demonstrated that Beaconet can effectively remove batch effect while preserving biological variations and is superior to existing unsupervised methods using all possible references in overall performance. Furthermore, Beaconet performs integration in the original molecular feature space, enabling the characterization of cell types and downstream differential expression analysis directly using integrated data with gene-expression features. Additionally, when applying to large-scale atlas data integration, Beaconet shows notable advantages in both time- and space-efficiencies. In summary, Beaconet serves as an effective and efficient batch effect removal tool that can facilitate the integration of single-cell datasets in a reference-free and molecular feature-preserved mode.


Asunto(s)
Perfilación de la Expresión Génica , Análisis de la Célula Individual , Transcriptoma , Análisis de la Célula Individual/métodos , Transcriptoma/genética , Perfilación de la Expresión Génica/métodos , Humanos , Biología Computacional/métodos , Animales
4.
Front Neurol ; 15: 1306546, 2024.
Artículo en Inglés | MEDLINE | ID: mdl-38440115

RESUMEN

Background: Dopamine transporter single-photon emission computed tomography (DAT-SPECT) is a crucial tool for evaluating patients with Parkinson's disease (PD). However, its implication is limited by inter-site variability in large multisite clinical trials. To overcome the limitation, a conventional prospective correction method employs linear regression with phantom scanning, which is effective yet available only in a prospective manner. An alternative, although relatively underexplored, involves retrospective modeling using a statistical method known as "combatting batch effects when combining batches of gene expression microarray data" (ComBat). Methods: We analyzed DAT-SPECT-specific binding ratios (SBRs) derived from 72 healthy older adults and 81 patients with PD registered in four clinical sites. We applied both the prospective correction and the retrospective ComBat correction to the original SBRs. Next, we compared the performance of the original and two corrected SBRs to differentiate the PD patients from the healthy controls. Diagnostic accuracy was assessed using the area under the receiver operating characteristic curve (AUC-ROC). Results: The original SBRs were 6.13 ± 1.54 (mean ± standard deviation) and 2.03 ± 1.41 in the control and PD groups, respectively. After the prospective correction, the mean SBRs were 6.52 ± 1.06 and 2.40 ± 0.99 in the control and PD groups, respectively. After the retrospective ComBat correction, the SBRs were 5.25 ± 0.89 and 2.01 ± 0.73 in the control and PD groups, respectively, resulting in substantial changes in mean values with fewer variances. The original SBRs demonstrated fair performance in differentiating PD from controls (Hedges's g = 2.76; AUC-ROC = 0.936). Both correction methods improved discrimination performance. The ComBat-corrected SBR demonstrated comparable performance (g = 3.99 and AUC-ROC = 0.987) to the prospectively corrected SBR (g = 4.32 and AUC-ROC = 0.992) for discrimination. Conclusion: Although we confirmed that SBRs fairly discriminated PD from healthy older adults without any correction, the correction methods improved their discrimination performance in a multisite setting. Our results support the utility of harmonization methods with ComBat for consolidating SBR-based diagnosis or stratification of PD in multisite studies. Nonetheless, given the substantial changes in the mean values of ComBat-corrected SBRs, caution is advised when interpreting them.

5.
Comput Struct Biotechnol J ; 23: 1094-1105, 2024 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-38495555

RESUMEN

Most of the complex biological regulatory activities occur in three dimensions (3D). To better analyze biological processes, it is essential not only to decipher the molecular information of numerous cells but also to understand how their spatial contexts influence their behavior. With the development of spatially resolved transcriptomics (SRT) technologies, SRT datasets are being generated to simultaneously characterize gene expression and spatial arrangement information within tissues, organs or organisms. To fully leverage spatial information, the focus extends beyond individual two-dimensional (2D) slices. Two tasks known as slices alignment and data integration have been introduced to establish correlations between multiple slices, enhancing the effectiveness of downstream tasks. Currently, numerous related methods have been developed. In this review, we first elucidate the details and principles behind several representative methods. Then we report the testing results of these methods on various SRT datasets, and assess their performance in representative downstream tasks. Insights into the strengths and weaknesses of each method and the reasons behind their performance are discussed. Finally, we provide an outlook on future developments. The codes and details of experiments are now publicly available at https://github.com/YangLabHKUST/SRT_alignment_and_integration.

6.
bioRxiv ; 2024 Jan 03.
Artículo en Inglés | MEDLINE | ID: mdl-38260566

RESUMEN

Background: Principal component analysis (PCA), a standard approach to analysis and visualization of large datasets, is commonly used in biomedical research for detecting similarities and differences among groups of samples. We initially used conventional PCA as a tool for critical quality control of batch and trend effects in multi-omic profiling data produced by The Cancer Genome Atlas (TCGA) project of the NCI. We found, however, that conventional PCA visualizations were often hard to interpret when inter-batch differences were moderate in comparison with intra-batch differences; it was also difficult to quantify batch effects objectively. We, therefore, sought enhancements to make the method more informative in those and analogous settings. Results: We have developed algorithms and a toolbox of enhancements to conventional PCA that improve the detection, diagnosis, and quantitation of differences between or among groups, e.g., groups of molecularly profiled biological samples. The enhancements include (i) computed group centroids; (ii) sample-dispersion rays; (iii) differential coloring of centroids, rays, and sample data points; (iii) trend trajectories; and (iv) a novel separation index (DSC) for quantitation of differences among groups. Conclusions: PCA-Plus has been our most useful single tool for analyzing, visualizing, and quantitating batch effects, trend effects, and class differences in molecular profiling data of many types: mRNA expression, microRNA expression, DNA methylation, and DNA copy number. An early version of PCA-Plus has been used as the central graphical visualization in our MBatch package for near-real-time surveillance of data for analysis working groups in more than 70 TCGA, PanCancer Atlas, PanCancer Analysis of Whole Genomes, and Genome Data Analysis Network projects of the NCI. The algorithms and software are generic, hence applicable more generally to other types of multivariate data as well. PCA-Plus is freely available in a down-loadable R package at our MBatch website.

7.
BMC Genom Data ; 25(1): 8, 2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38254005

RESUMEN

BACKGROUND: Recent advancements in next-generation sequencing (NGS) technology have ushered in significant improvements in sequencing speed and data throughput, thereby enabling the simultaneous analysis of a greater number of samples within a single sequencing run. This technology has proven particularly valuable in the context of microbial community profiling, offering a powerful tool for characterizing the microbial composition at the species level within a given sample. This profiling process typically involves the sequencing of 16S ribosomal RNA (rRNA) gene fragments. By scaling up the analysis to accommodate a substantial number of samples, sometimes as many as 2,000, it becomes possible to achieve cost-efficiency and minimize the introduction of potential batch effects. Our study was designed with the primary objective of devising an approach capable of facilitating the comprehensive analysis of 1,711 samples sourced from diverse origins, including oropharyngeal swabs, mouth cavity swabs, dental swabs, and human fecal samples. This analysis was based on data obtained from 16S rRNA metagenomic sequencing conducted on the Illumina MiSeq and HiSeq sequencing platforms. RESULTS: We have designed a custom set of 10-base pair indices specifically tailored for the preparation of libraries from amplicons derived from the V3-V4 region of the 16S rRNA gene. These indices are instrumental in the analysis of the microbial composition in clinical samples through sequencing on the Illumina MiSeq and HiSeq platforms. The utilization of our custom index set enables the consolidation of a significant number of libraries, enabling the efficient sequencing of these libraries in a single run. CONCLUSIONS: The unique array of 10-base pair indices that we have developed, in conjunction with our sequencing methodology, will prove highly valuable to laboratories engaged in sequencing on Illumina platforms or utilizing Illumina-compatible kits.


Asunto(s)
Cultura , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , ARN Ribosómico 16S/genética , Heces , Laboratorios
8.
BMC Bioinformatics ; 24(1): 459, 2023 Dec 07.
Artículo en Inglés | MEDLINE | ID: mdl-38057718

RESUMEN

BACKGROUND: Variability in datasets is not only the product of biological processes: they are also the product of technical biases. ComBat and ComBat-Seq are among the most widely used tools for correcting those technical biases, called batch effects, in, respectively, microarray and RNA-Seq expression data. RESULTS: In this technical note, we present a new Python implementation of ComBat and ComBat-Seq. While the mathematical framework is strictly the same, we show here that our implementations: (i) have similar results in terms of batch effects correction; (ii) are as fast or faster than the original implementations in R and; (iii) offer new tools for the bioinformatics community to participate in its development. pyComBat is implemented in the Python language and is distributed under GPL-3.0 ( https://www.gnu.org/licenses/gpl-3.0.en.html ) license as a module of the inmoose package. Source code is available at https://github.com/epigenelabs/inmoose and Python package at https://pypi.org/project/inmoose . CONCLUSIONS: We present a new Python implementation of state-of-the-art tools ComBat and ComBat-Seq for the correction of batch effects in microarray and RNA-Seq data. This new implementation, based on the same mathematical frameworks as ComBat and ComBat-Seq, offers similar power for batch effect correction, at reduced computational cost.


Asunto(s)
Biología Computacional , Programas Informáticos , Teorema de Bayes , Biología Computacional/métodos , RNA-Seq
9.
Epigenetics ; 18(1): 2257437, 2023 12.
Artículo en Inglés | MEDLINE | ID: mdl-37731367

RESUMEN

Background: Recent studies have identified thousands of associations between DNA methylation CpGs and complex diseases/traits, emphasizing the critical role of epigenetics in understanding disease aetiology and identifying biomarkers. However, association analyses based on methylation array data are susceptible to batch/slide effects, which can lead to inflated false positive rates or reduced statistical powerResults: We use multiple DNA methylation datasets based on the popular Illumina Infinium MethylationEPIC BeadChip array to describe consistent patterns and the joint distribution of slide effects across CpGs, confirming and extending previous results. The susceptible CpGs overlap with the Illumina Infinium HumanMethylation450 BeadChip array content.Conclusions: Our findings reveal systematic patterns in slide effects. The observations provide further insights into the characteristics of these effects and can improve existing adjustment approaches.


Asunto(s)
Metilación de ADN , Epigénesis Genética , Epigenómica , Herencia Multifactorial
10.
Imaging Neurosci (Camb) ; 1: 1-16, 2023 Aug 01.
Artículo en Inglés | MEDLINE | ID: mdl-37719839

RESUMEN

Combining data collected from multiple study sites is becoming common and is advantageous to researchers to increase the generalizability and replicability of scientific discoveries. However, at the same time, unwanted inter-scanner biases are commonly observed across neuroimaging data collected from multiple study sites or scanners, rendering difficulties in integrating such data to obtain reliable findings. While several methods for handling such unwanted variations have been proposed, most of them use univariate approaches that could be too simple to capture all sources of scanner-specific variations. To address these challenges, we propose a novel multivariate harmonization method called RELIEF (REmoval of Latent Inter-scanner Effects through Factorization) for estimating and removing both explicit and latent scanner effects. Our method is the first approach to introduce the simultaneous dimension reduction and factorization of interlinked matrices to a data harmonization context, which provides a new direction in methodological research for correcting inter-scanner biases. Analyzing diffusion tensor imaging (DTI) data from the Social Processes Initiative in Neurobiology of the Schizophrenia (SPINS) study and conducting extensive simulation studies, we show that RELIEF outperforms existing harmonization methods in mitigating inter-scanner biases and retaining biological associations of interest to increase statistical power. RELIEF is publicly available as an R package.

11.
Drug Discov Today ; 28(9): 103661, 2023 09.
Artículo en Inglés | MEDLINE | ID: mdl-37301250

RESUMEN

In data-processing pipelines, upstream steps can influence downstream processes because of their sequential nature. Among these data-processing steps, batch effect (BE) correction (BEC) and missing value imputation (MVI) are crucial for ensuring data suitability for advanced modeling and reducing the likelihood of false discoveries. Although BEC-MVI interactions are not well studied, they are ultimately interdependent. Batch sensitization can improve the quality of MVI. Conversely, accounting for missingness also improves proper BE estimation in BEC. Here, we discuss how BEC and MVI are interconnected and interdependent. We show how batch sensitization can improve any MVI and bring attention to the idea of BE-associated missing values (BEAMs). Finally, we discuss how batch-class imbalance problems can be mitigated by borrowing ideas from machine learning.


Asunto(s)
Procesamiento Automatizado de Datos
12.
BMC Bioinformatics ; 24(1): 182, 2023 May 03.
Artículo en Inglés | MEDLINE | ID: mdl-37138207

RESUMEN

Despite the availability of batch effect correcting algorithms (BECA), no comprehensive tool that combines batch correction and evaluation of the results exists for microbiome datasets. This work outlines the Microbiome Batch Effects Correction Suite development that integrates several BECAs and evaluation metrics into a software package for the statistical computation framework R.


Asunto(s)
Microbiota , Programas Informáticos , Algoritmos
13.
Metabolites ; 13(5)2023 May 16.
Artículo en Inglés | MEDLINE | ID: mdl-37233706

RESUMEN

Untargeted metabolomics is an important tool in studying health and disease and is employed in fields such as biomarker discovery and drug development, as well as precision medicine. Although significant technical advances were made in the field of mass-spectrometry driven metabolomics, instrumental drifts, such as fluctuations in retention time and signal intensity, remain a challenge, particularly in large untargeted metabolomics studies. Therefore, it is crucial to consider these variations during data processing to ensure high-quality data. Here, we will provide recommendations for an optimal data processing workflow using intrastudy quality control (QC) samples that identifies errors resulting from instrumental drifts, such as shifts in retention time and metabolite intensities. Furthermore, we provide an in-depth comparison of the performance of three popular batch-effect correction methods of different complexity. By using different evaluation metrics based on QC samples and a machine learning approach based on biological samples, the performance of the batch-effect correction methods were evaluated. Here, the method TIGER demonstrated the overall best performance by reducing the relative standard deviation of the QCs and dispersion-ratio the most, as well as demonstrating the highest area under the receiver operating characteristic with three different probabilistic classifiers (Logistic regression, Random Forest, and Support Vector Machine). In summary, our recommendations will help to generate high-quality data that are suitable for further downstream processing, leading to more accurate and meaningful insights into the underlying biological processes.

14.
J Mol Evol ; 91(3): 293-310, 2023 06.
Artículo en Inglés | MEDLINE | ID: mdl-37237236

RESUMEN

The phrase "survival of the fittest" has become an iconic descriptor of how natural selection works. And yet, precisely measuring fitness, even for single-celled microbial populations growing in controlled laboratory conditions, remains a challenge. While numerous methods exist to perform these measurements, including recently developed methods utilizing DNA barcodes, all methods are limited in their precision to differentiate strains with small fitness differences. In this study, we rule out some major sources of imprecision, but still find that fitness measurements vary substantially from replicate to replicate. Our data suggest that very subtle and difficult to avoid environmental differences between replicates create systematic variation across fitness measurements. We conclude by discussing how fitness measurements should be interpreted given their extreme environment dependence. This work was inspired by the scientific community who followed us and gave us tips as we live tweeted a high-replicate fitness measurement experiment at #1BigBatch.


Asunto(s)
Aptitud Genética , Selección Genética
15.
BMC Bioinformatics ; 24(1): 86, 2023 Mar 07.
Artículo en Inglés | MEDLINE | ID: mdl-36882691

RESUMEN

BACKGROUND: We developed a novel approach to minimize batch effects when assigning samples to batches. Our algorithm selects a batch allocation, among all possible ways of assigning samples to batches, that minimizes differences in average propensity score between batches. This strategy was compared to randomization and stratified randomization in a case-control study (30 per group) with a covariate (case vs control, represented as ß1, set to be null) and two biologically relevant confounding variables (age, represented as ß2, and hemoglobin A1c (HbA1c), represented as ß3). Gene expression values were obtained from a publicly available dataset of expression data obtained from pancreas islet cells. Batch effects were simulated as twice the median biological variation across the gene expression dataset and were added to the publicly available dataset to simulate a batch effect condition. Bias was calculated as the absolute difference between observed betas under the batch allocation strategies and the true beta (no batch effects). Bias was also evaluated after adjustment for batch effects using ComBat as well as a linear regression model. In order to understand performance of our optimal allocation strategy under the alternative hypothesis, we also evaluated bias at a single gene associated with both age and HbA1c levels in the 'true' dataset (CAPN13 gene). RESULTS: Pre-batch correction, under the null hypothesis (ß1), maximum absolute bias and root mean square (RMS) of maximum absolute bias, were minimized using the optimal allocation strategy. Under the alternative hypothesis (ß2 and ß3 for the CAPN13 gene), maximum absolute bias and RMS of maximum absolute bias were also consistently lower using the optimal allocation strategy. ComBat and the regression batch adjustment methods performed well as the bias estimates moved towards the true values in all conditions under both the null and alternative hypotheses. Although the differences between methods were less pronounced following batch correction, estimates of bias (average and RMS) were consistently lower using the optimal allocation strategy under both the null and alternative hypotheses. CONCLUSIONS: Our algorithm provides an extremely flexible and effective method for assigning samples to batches by exploiting knowledge of covariates prior to sample allocation.


Asunto(s)
Algoritmos , Estado de Salud , Puntaje de Propensión , Estudios de Casos y Controles , Hemoglobina Glucada , Humanos
16.
Methods Mol Biol ; 2426: 197-242, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-36308691

RESUMEN

msmsTests is an R/Bioconductor package providing functions for statistical tests in label-free LC-MS/MS data by spectral counts. These functions aim at discovering differentially expressed proteins between two biological conditions. Three tests are available: Poisson GLM regression, quasi-likelihood GLM regression, and the negative binomial of the edgeR package. The three models admit blocking factors to control for nuisance variables. To assure a good level of reproducibility a post-test filter is available, where (1) a minimum effect size considered biologically relevant, and (2) a minimum expression of the most abundant condition, may be set. A companion package, msmsEDA, proposes functions to explore datasets based on msms spectral counts. The provided graphics help in identifying outliers, the presence of eventual batch factors, and check the effects of different normalizing strategies. This protocol illustrates the use of both packages on two examples: A purely spike-in experiment of 48 human proteins in a standard yeast cell lysate; and a cancer cell-line secretome dataset requiring a biological normalization.


Asunto(s)
Proteómica , Programas Informáticos , Humanos , Cromatografía Liquida , Proteómica/métodos , Reproducibilidad de los Resultados , Espectrometría de Masas en Tándem/métodos , Saccharomyces cerevisiae
17.
Statistics (Ber) ; 57(5): 987-1009, 2023.
Artículo en Inglés | MEDLINE | ID: mdl-38283617

RESUMEN

The design of multi-center study is increasingly used for borrowing strength from multiple research groups to obtain broadly applicable and reproducible study findings. Regression analysis is widely used for analyzing multi-group studies, however, some of the large number of regression predictors are nonlinear and/or often measured with batch effects in many large scale collaborative studies. Also, the group compositions of the nonlinear predictors are potentially heterogeneous across different centers. The conventional pooled data analysis ignores the interplay between nonlinearity and batch effect, group composition heterogeneity, measurement error and other data incoherence in multi-center setting that can cause biased regression estimates and misleading outcomes. In this paper, we propose an integrated partially linear regression model (IPLM) based analysis to account for the predictor's nonlinearity, general batch effect, group composition heterogeneity, high-dimensional covariates, potential measurement-error in covariates, and combinations of these complexities simultaneously. A local linear regression based approach is employed to estimate the nonlinear component and a regularization procedure is introduced to identify the predictors' effects that can be either homogeneous or heterogeneous across groups. In particular, when the effects of all predictors are homogeneous across the study centers, the proposed IPLM can automatically reduce to one single parsimonious partially linear model for all centers. The proposed method has asymptotic estimation and variable selection consistency including high-dimensional covariates. Moreover, it has a fast computing algorithm and its effectiveness is supported by numerical simulation studies. A multi-center Alzheimer's disease research project is provided to illustrate the proposed IPLM based analysis.

18.
J Extracell Biol ; 2(6): e91, 2023 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-38938917

RESUMEN

Small RNA (sRNA) profiling of Extracellular Vesicles (EVs) by Next-Generation Sequencing (NGS) often delivers poor outcomes, independently of reagents, platforms or pipelines used, which contributes to poor reproducibility of studies. Here we analysed pre/post-sequencing quality controls (QC) to predict issues potentially biasing biological sRNA-sequencing results from purified human milk EVs, human and mouse EV-enriched plasma and human paraffin-embedded tissues. Although different RNA isolation protocols and NGS platforms were used in these experiments, all datasets had samples characterized by a marked removal of reads after pre-processing. The extent of read loss between individual samples within a dataset did not correlate with isolated RNA quantity or sequenced base quality. Rather, cDNA electropherograms revealed the presence of a constant peak whose intensity correlated with the degree of read loss and, remarkably, with the percentage of adapter dimers, which were found to be overrepresented sequences in high read-loss samples. The analysis through a QC pipeline, which allowed us to monitor quality parameters in a step-by-step manner, provided compelling evidence that adapter dimer contamination was the main factor causing batch effects. We concluded this study by summarising peer-reviewed published workflows that perform consistently well in avoiding adapter dimer contamination towards a greater likelihood of sequencing success.

19.
Front Mol Biosci ; 9: 930204, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36438654

RESUMEN

Untargeted metabolomics studies are unbiased but identifying the same feature across studies is complicated by environmental variation, batch effects, and instrument variability. Ideally, several studies that assay the same set of metabolic features would be used to select recurring features to pursue for identification. Here, we developed an anchored experimental design. This generalizable approach enabled us to integrate three genetic studies consisting of 14 test strains of Caenorhabditis elegans prior to the compound identification process. An anchor strain, PD1074, was included in every sample collection, resulting in a large set of biological replicates of a genetically identical strain that anchored each study. This enables us to estimate treatment effects within each batch and apply straightforward meta-analytic approaches to combine treatment effects across batches without the need for estimation of batch effects and complex normalization strategies. We collected 104 test samples for three genetic studies across six batches to produce five analytical datasets from two complementary technologies commonly used in untargeted metabolomics. Here, we use the model system C. elegans to demonstrate that an augmented design combined with experimental blocks and other metabolomic QC approaches can be used to anchor studies and enable comparisons of stable spectral features across time without the need for compound identification. This approach is generalizable to systems where the same genotype can be assayed in multiple environments and provides biologically relevant features for downstream compound identification efforts. All methods are included in the newest release of the publicly available SECIMTools based on the open-source Galaxy platform.

20.
Comput Struct Biotechnol J ; 20: 4369-4375, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36051874

RESUMEN

Mass-spectrometry-based proteomics presents some unique challenges for batch effect correction. Batch effects are technical sources of variation, can confound analysis and usually non-biological in nature. As proteomic analysis involves several stages of data transformation from spectra to protein, the decision on when and what to apply batch correction on is often unclear. Here, we explore several relevant issues pertinent to batch effect correct considerations. The first involves applications of batch effect correction requiring prior knowledge on batch factors and exploring data to uncover new/unknown batch factors. The second considers recent literature that suggests there is no single best batch effect correction algorithm---i.e., instead of a best approach, one may instead ask, what is a suitable approach. The third section considers issues of batch effect detection. And finally, we look at potential developments for proteomic-specific batch effect correction methods and how to do better functional evaluations on batch corrected data.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA