RESUMO
Here we demonstrate that the large language model GPT-4 can accurately annotate cell types using marker gene information in single-cell RNA sequencing analysis. When evaluated across hundreds of tissue and cell types, GPT-4 generates cell type annotations exhibiting strong concordance with manual annotations. This capability can considerably reduce the effort and expertise required for cell type annotation. Additionally, we have developed an R software package GPTCelltype for GPT-4's automated cell type annotation.
Assuntos
Análise da Expressão Gênica de Célula Única , Software , Animais , Humanos , Camundongos , Anotação de Sequência Molecular/métodos , RNA-Seq/métodos , Análise da Expressão Gênica de Célula Única/métodosRESUMO
PD-1 blockade unleashes CD8 T cells1, including those specific for mutation-associated neoantigens (MANA), but factors in the tumour microenvironment can inhibit these T cell responses. Single-cell transcriptomics have revealed global T cell dysfunction programs in tumour-infiltrating lymphocytes (TIL). However, the majority of TIL do not recognize tumour antigens2, and little is known about transcriptional programs of MANA-specific TIL. Here, we identify MANA-specific T cell clones using the MANA functional expansion of specific T cells assay3 in neoadjuvant anti-PD-1-treated non-small cell lung cancers (NSCLC). We use their T cell receptors as a 'barcode' to track and analyse their transcriptional programs in the tumour microenvironment using coupled single-cell RNA sequencing and T cell receptor sequencing. We find both MANA- and virus-specific clones in TIL, regardless of response, and MANA-, influenza- and Epstein-Barr virus-specific TIL each have unique transcriptional programs. Despite exposure to cognate antigen, MANA-specific TIL express an incompletely activated cytolytic program. MANA-specific CD8 T cells have hallmark transcriptional programs of tissue-resident memory (TRM) cells, but low levels of interleukin-7 receptor (IL-7R) and are functionally less responsive to interleukin-7 (IL-7) compared with influenza-specific TRM cells. Compared with those from responding tumours, MANA-specific clones from non-responding tumours express T cell receptors with markedly lower ligand-dependent signalling, are largely confined to HOBIThigh TRM subsets, and coordinately upregulate checkpoints, killer inhibitory receptors and inhibitors of T cell activation. These findings provide important insights for overcoming resistance to PD-1 blockade.
Assuntos
Antígenos de Neoplasias/imunologia , Carcinoma Pulmonar de Células não Pequenas/tratamento farmacológico , Regulação da Expressão Gênica , Inibidores de Checkpoint Imunológico/uso terapêutico , Neoplasias Pulmonares/tratamento farmacológico , Neoplasias Pulmonares/imunologia , Linfócitos do Interstício Tumoral/imunologia , Linfócitos do Interstício Tumoral/metabolismo , Antígenos de Neoplasias/genética , Linfócitos T CD8-Positivos/imunologia , Carcinoma Pulmonar de Células não Pequenas/genética , Carcinoma Pulmonar de Células não Pequenas/imunologia , Células Cultivadas , Humanos , Memória Imunológica , Neoplasias Pulmonares/genética , Receptor de Morte Celular Programada 1/antagonistas & inibidores , RNA-Seq , Receptores de Interleucina-7/imunologia , Análise de Célula Única , Transcriptoma/genética , Microambiente TumoralRESUMO
SUMMARY: In the exploratory data analysis of single-cell or spatial genomic data, single-cells or spatial spots are often visualized using a two-dimensional plot where cell clusters or spot clusters are marked with different colors. With tens of clusters, current visualization methods often assign visually similar colors to spatially neighboring clusters, making it hard to identify the distinction between clusters. To address this issue, we developed Palo that optimizes the color palette assignment for single-cell and spatial data in a spatially aware manner. Palo identifies pairs of clusters that are spatially neighboring to each other and assigns visually distinct colors to those neighboring pairs. We demonstrate that Palo leads to improved visualization in real single-cell and spatial genomic datasets. AVAILABILITY AND IMPLEMENTATION: Palo R package is freely available at Github (https://github.com/Winnie09/Palo) and Zenodo (https://doi.org/10.5281/zenodo.6562505). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genômica , Software , GenomaRESUMO
BACKGROUND: Maternal prenatal smoking is known to alter offspring DNA methylation (DNAm). However, there are no effective interventions to mitigate smoking-induced DNAm alteration. OBJECTIVES: This study investigated whether 1-carbon nutrients (folate, vitamins B6, and B12) can protect against prenatal smoking-induced offspring DNAm alterations in the aryl hydrocarbon receptor repressor (AHRR) (cg05575921), GFI1 (cg09935388), and CYP1A1 (cg05549655) genes. METHODS: This study included mother-newborn dyads from a racially diverse US birth cohort. The cord blood DNAm at the above 3 sites were derived from a previous study using the Illumina Infinium MethylationEPIC BeadChip. Maternal smoking was assessed by self-report and plasma biomarkers (hydroxycotinine and cotinine). Maternal plasma folate, and vitamins B6 and B12 concentrations were obtained shortly after delivery. Linear regressions, Bayesian kernel machine regression, and quantile g-computation were applied to test the study hypothesis by adjusting for covariables and multiple testing. RESULTS: The study included 834 mother-newborn dyads (16.7% of newborns exposed to maternal smoking). DNAm at cg05575921 (AHRR) and at cg09935388 (GFI1) was inversely associated with maternal smoking biomarkers in a dose-response fashion (all P < 7.01 × 10-13). In contrast, cg05549655 (CYP1A1) was positively associated with maternal smoking biomarkers (P < 2.4 × 10-6). Folate concentrations only affected DNAm levels at cg05575921 (AHRR, P = 0.014). Regression analyses showed that compared with offspring with low hydroxycotinine exposure (<0.494) and adequate maternal folate concentrations (quartiles 2-4), an offspring with high hydroxycotinine exposure (≥0.494) and low folate concentrations (quartile 1) had a significant reduction in DNAm at cg05575921 (M-value, ß ± SE = -0.801 ± 0.117, P = 1.44 × 10-11), whereas adequate folate concentrations could cut smoking-induced hypomethylation by almost half. Exposure mixture models further supported the protective role of adequate folate concentrations against smoking-induced aryl hydrocarbon receptor repressor (AHRR) hypomethylation. CONCLUSIONS: This study found that adequate maternal folate can attenuate maternal smoking-induced offspring AHRR cg05575921 hypomethylation, which has been previously linked to a range of pediatric and adult diseases.
Assuntos
Metilação de DNA , Receptores de Hidrocarboneto Arílico , Adulto , Gravidez , Feminino , Humanos , Recém-Nascido , Criança , Receptores de Hidrocarboneto Arílico/genética , Ácido Fólico , Micronutrientes , Citocromo P-450 CYP1A1/genética , Teorema de Bayes , Fatores de Transcrição Hélice-Alça-Hélice Básicos/genética , Fatores de Transcrição Hélice-Alça-Hélice Básicos/metabolismo , Proteínas Repressoras/genética , Proteínas Repressoras/metabolismo , Fumar , Vitaminas , BiomarcadoresRESUMO
BACKGROUND/OBJECTIVES: Low-level, in-utero exposure to toxic metals such as lead (Pb) and mercury (Hg) is widespread in the US and worldwide; and, individually, was found to be obesogenic in children. To address the literature gaps on the health effects of co-exposure to low-level toxic metals and the lack of intervention strategy, we aimed to investigate the association between in-utero co-exposure to Hg, Pb, cadmium (Cd) and childhood overweight or obesity (OWO) and whether adequate maternal micronutrients (selenium (Se) and folate) can be protective. SUBJECTS/METHODS: This study included 1442 mother-child pairs from the Boston Birth Cohort, a predominantly urban, low-income, Black, and Hispanic population, who were enrolled at birth and followed prospectively up to age 15 years. Bayesian kernel machine regression (BKMR) was applied to estimate individual and joint effects of exposures to metals and micronutrients on childhood OWO while adjusting for pertinent covariables. Stratified analyses by maternal OWO and micronutrient status were performed to identify sensitive subgroups. RESULTS: In this sample of understudied US children, low-level in-utero co-exposure to Hg, Pb, and Cd was widespread. Besides individual positive associations of maternal Hg and Pb exposure with offspring OWO, BKMR clearly indicated a positive dose-response association between in-utero co-exposure to the three toxic metals and childhood OWO. Notably, the metal mixture-OWO association was more pronounced in children born to mothers with OWO; and in such a setting, the association was greatly attenuated if mothers had higher Se and folate levels. CONCLUSIONS: In this prospective cohort of US children at high-risk of toxic metal exposure and OWO, we demonstrated that among children born to mothers with OWO, low-level in-utero co-exposure to Hg, Pb, and Cd increased the risk of childhood OWO; and that adequate maternal Se and folate levels mitigated the risk of childhood OWO. CLINICAL TRIAL REGISTRY NUMBER AND WEBSITE WHERE IT WAS OBTAINED: NCT03228875.
Assuntos
Metais , Micronutrientes , Obesidade Infantil , Efeitos Tardios da Exposição Pré-Natal , Adolescente , Teorema de Bayes , Cádmio/toxicidade , Criança , Pré-Escolar , Feminino , Ácido Fólico , Humanos , Lactente , Recém-Nascido , Chumbo/toxicidade , Mercúrio/toxicidade , Metais/toxicidade , Sobrepeso/epidemiologia , Obesidade Infantil/epidemiologia , Gravidez , Efeitos Tardios da Exposição Pré-Natal/epidemiologia , Estudos ProspectivosRESUMO
It is known that many driver nodes are required to control complex biological networks. Previous studies imply that O(N) driver nodes are required in both linear complex network and Boolean network models with N nodes if an arbitrary state is specified as the target. In order to cope with this intrinsic difficulty, we consider a special case of the control problem in which the targets are restricted to attractors. For this special case, we mathematically prove under the uniform distribution of states in basins that the expected number of driver nodes is only O(log2N+log2M) for controlling Boolean networks, where M is the number of attractors. Since it is expected that M is not very large in many practical networks, the new model requires a much smaller number of driver nodes. This result is based on discovery of novel relationships between control problems on Boolean networks and the coupon collector's problem, a well-known concept in combinatorics. We also provide lower bounds of the number of driver nodes as well as simulation results using artificial and realistic network data, which support our theoretical findings.
Assuntos
Modelos Biológicos , Modelos Teóricos , Algoritmos , Biologia de Sistemas/métodosRESUMO
BACKGROUND: Abnormalities in glycan biosynthesis have been conclusively related to various diseases, whereas the complexity of the glycosylation process has impeded the quantitative analysis of biochemical experimental data for the identification of glycoforms contributing to disease. To overcome this limitation, the automatic construction of glycosylation reaction networks in silico is a critical step. RESULTS: In this paper, a framework K2014 is developed to automatically construct N-glycosylation networks in MATLAB with the involvement of the 27 most-known enzyme reaction rules of 22 enzymes, as an extension of previous model KB2005. A toolbox named Glycosylation Network Analysis Toolbox (GNAT) is applied to define network properties systematically, including linkages, stereochemical specificity and reaction conditions of enzymes. Our network shows a strong ability to predict a wider range of glycans produced by the enzymes encountered in the Golgi Apparatus in human cell expression systems. CONCLUSIONS: Our results demonstrate a better understanding of the underlying glycosylation process and the potential of systems glycobiology tools for analyzing conventional biochemical or mass spectrometry-based experimental data quantitatively in a more realistic and practical way.
Assuntos
Vias Biossintéticas , Simulação por Computador , Glicômica/métodos , Modelos Biológicos , Polissacarídeos/biossíntese , Glicosilação , Humanos , Hidrolases/metabolismo , Espectrometria de Massas , Transferases/metabolismoRESUMO
Image classification plays a pivotal role in analyzing biomedical images, serving as a cornerstone for both biological research and clinical diagnostics. We demonstrate that large multimodal models (LMMs), like GPT-4, excel in one-shot learning, generalization, interpretability, and text-driven image classification across diverse biomedical tasks. These tasks include the classification of tissues, cell types, cellular states, and disease status. LMMs stand out from traditional single-modal classification approaches, which often require large training datasets and offer limited interpretability.
RESUMO
Generative Pre-trained Transformers (GPT) are powerful language models that have great potential to transform biomedical research. However, they are known to suffer from artificial hallucinations and provide false answers that are seemingly correct in some situations. We developed GeneTuring, a comprehensive QA database with 600 questions in genomics, and manually scored 10,800 answers returned by six GPT models, including GPT-3, ChatGPT, and New Bing. New Bing has the best overall performance and significantly reduces the level of AI hallucination compared to other models, thanks to its ability to recognize its incapacity in answering questions. We argue that improving incapacity awareness is equally important as improving model accuracy to address AI hallucination.
RESUMO
Cell type annotation is an essential step in single-cell RNA-seq analysis. However, it is a time-consuming process that often requires expertise in collecting canonical marker genes and manually annotating cell types. Automated cell type annotation methods typically require the acquisition of high-quality reference datasets and the development of additional pipelines. We assessed the performance of GPT-4, a highly potent large language model, for cell type annotation, and demonstrated that it can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines. Evaluated across hundreds of tissue types and cell types, GPT-4 generates cell type annotations exhibiting strong concordance with manual annotations and has the potential to considerably reduce the effort and expertise needed in cell type annotation. We also developed GPTCelltype, an open-source R software package to facilitate cell type annotation by GPT-4.
RESUMO
Cell type annotation is an essential step in single-cell RNA-seq analysis. However, it is a time-consuming process that often requires expertise in collecting canonical marker genes and manually annotating cell types. Automated cell type annotation methods typically require the acquisition of high-quality reference datasets and the development of additional pipelines. We demonstrate that GPT-4, a highly potent large language model, can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines. Evaluated across hundreds of tissue types and cell types, GPT-4 generates cell type annotations exhibiting strong concordance with manual annotations, and has the potential to considerably reduce the effort and expertise needed in cell type annotation.
RESUMO
When analyzing data from in situ RNA detection technologies, cell segmentation is an essential step in identifying cell boundaries, assigning RNA reads to cells, and studying the gene expression and morphological features of cells. We developed a deep-learning-based method, GeneSegNet, that integrates both gene expression and imaging information to perform cell segmentation. GeneSegNet also employs a recursive training strategy to deal with noisy training labels. We show that GeneSegNet significantly improves cell segmentation performances over existing methods that either ignore gene expression information or underutilize imaging information.
Assuntos
Aprendizado Profundo , Tomografia Computadorizada por Raios X , RNA , Expressão Gênica , Processamento de Imagem Assistida por Computador/métodosRESUMO
Pseudotime analysis with single-cell RNA-sequencing (scRNA-seq) data has been widely used to study dynamic gene regulatory programs along continuous biological processes. While many methods have been developed to infer the pseudotemporal trajectories of cells within a biological sample, it remains a challenge to compare pseudotemporal patterns with multiple samples (or replicates) across different experimental conditions. Here, we introduce Lamian, a comprehensive and statistically-rigorous computational framework for differential multi-sample pseudotime analysis. Lamian can be used to identify changes in a biological process associated with sample covariates, such as different biological conditions while adjusting for batch effects, and to detect changes in gene expression, cell density, and topology of a pseudotemporal trajectory. Unlike existing methods that ignore sample variability, Lamian draws statistical inference after accounting for cross-sample variability and hence substantially reduces sample-specific false discoveries that are not generalizable to new samples. Using both real scRNA-seq and simulation data, including an analysis of differential immune response programs between COVID-19 patients with different disease severity levels, we demonstrate the advantages of Lamian in decoding cellular gene expression programs in continuous biological processes.
Assuntos
Perfilação da Expressão Gênica , Análise da Expressão Gênica de Célula Única , Humanos , Perfilação da Expressão Gênica/métodos , Análise de Célula Única/métodos , Análise de Sequência de RNA/métodos , Simulação por ComputadorRESUMO
The diversity of genetic programs and cellular plasticity of glioma-associated myeloid cells, and thus their contribution to tumor growth and immune evasion, is poorly understood. We performed single cell RNA-sequencing of immune and tumor cells from 33 glioma patients of varying tumor grades. We identified two populations characteristic of myeloid derived suppressor cells (MDSC), unique to glioblastoma (GBM) and absent in grades II and III tumors: i) an early progenitor population (E-MDSC) characterized by strong upregulation of multiple catabolic, anabolic, oxidative stress, and hypoxia pathways typically observed within tumor cells themselves, and ii) a monocytic MDSC (M-MDSC) population. The E-MDSCs geographically co-localize with a subset of highly metabolic glioma stem-like tumor cells with a mesenchymal program in the pseudopalisading region, a pathognomonic feature of GBMs associated with poor prognosis. Ligand-receptor interaction analysis revealed symbiotic cross-talk between the stemlike tumor cells and E-MDSCs in GBM, whereby glioma stem cells produce chemokines attracting E-MDSCs, which in turn produce growth and survival factors for the tumor cells. Our large-scale single-cell analysis elucidated unique MDSC populations as key facilitators of GBM progression and mediators of tumor immunosuppression, suggesting that targeting these specific myeloid compartments, including their metabolic programs, may be a promising therapeutic intervention in this deadly cancer. One-Sentence Summary: Aggressive glioblastoma harbors two unique myeloid populations capable of promoting stem-like properties of tumor cells and suppressing T cell function in the tumor microenvironment.
RESUMO
Regulatory T cells (Treg) are conventionally viewed as suppressors of endogenous and therapy-induced antitumor immunity; however, their role in modulating responses to immune checkpoint blockade (ICB) is unclear. In this study, we integrated single-cell RNA-seq/T cell receptor sequencing (TCRseq) of >73,000 tumor-infiltrating Treg (TIL-Treg) from anti-PD-1-treated and treatment-naive non-small cell lung cancers (NSCLC) with single-cell analysis of tumor-associated antigen (TAA)-specific Treg derived from a murine tumor model. We identified 10 subsets of human TIL-Treg, most of which have high concordance with murine TIL-Treg subsets. Only one subset selectively expresses high levels of TNFRSF4 (OX40) and TNFRSF18 (GITR), whose engangement by cognate ligand mediated proliferative programs and NF-κB activation, as well as multiple genes involved in Treg suppression, including LAG3. Functionally, the OX40hiGITRhi subset is the most highly suppressive ex vivo, and its higher representation among total TIL-Treg correlated with resistance to PD-1 blockade. Unexpectedly, in the murine tumor model, we found that virtually all TIL-Treg-expressing T cell receptors that are specific for TAA fully develop a distinct TH1-like signature over a 2-week period after entry into the tumor, down-regulating FoxP3 and up-regulating expression of TBX21 (Tbet), IFNG, and certain proinflammatory granzymes. Transfer learning of a gene score from the murine TAA-specific TH1-like Treg subset to the human single-cell dataset revealed a highly analogous subcluster that was enriched in anti-PD-1-responding tumors. These findings demonstrate that TIL-Treg partition into multiple distinct transcriptionally defined subsets with potentially opposing effects on ICB-induced antitumor immunity and suggest that TAA-specific TIL-Treg may positively contribute to antitumor responses.
Assuntos
Carcinoma Pulmonar de Células não Pequenas , Neoplasias Pulmonares , Humanos , Animais , Camundongos , Neoplasias Pulmonares/genética , Granzimas , Transdução de Sinais , Análise de Célula ÚnicaRESUMO
Visualizing low-dimensional representations with scatterplots is a crucial step in analyzing single-cell genomic data. However, this visualization has significant biases. The first bias arises when visualizing the gene expression levels or the cell identities. The scatterplot only shows a subset of cells plotted last, and the cells plotted earlier are masked and unseen. The second bias arises when comparing the cell-type compositions across samples. The scatterplot is biased by the unbalanced total number of cells across samples. We developed SCUBI, an unbiased method that visualizes the aggregated information of cells within non-overlapping squares to address the first bias and visualizes the differences of cell proportions across samples to address the second bias. We show that SCUBI presents a more faithful visual representation of the information in a real single-cell RNA sequencing (RNA-seq) dataset and has the potential to change how low-dimensional representations are visualized in single-cell genomic data.
Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Genômica , ViésRESUMO
Background: Most studies on the association of in utero exposure to cigarette smoking and childhood overweight or obesity (OWO) were based on maternal self-reported smoking status, and few were based on objective biomarkers. The concordance of self-report smoking, and maternal and cord blood biomarkers of cigarette smoking as well as their effects on children's long-term risk of overweight and obesity are unclear. Methods: In this study, we analyzed data from 2351 mother-child pairs in the Boston Birth Cohort, a sample of US predominantly Black, indigenous, and people of color (BIPOC) that enrolled children at birth and followed prospectively up to age 18 years. In utero smoking exposure was measured by maternal self-report and by maternal and cord plasma biomarkers of smoking: cotinine and hydroxycotinine. We assessed the individual and joint associations of each smoking exposure measure and maternal OWO with childhood OWO using multinomial logistic regressions. We used nested logistic regressions to investigate the childhood OWO prediction performance when adding maternal and cord plasma biomarkers as input covariates on top of self-reported data. Results: Our results demonstrated that in utero cigarette smoking exposure defined by self-report and by maternal or cord metabolites was consistently associated with increased risk of long-term child OWO. Children with cord hydroxycotinine in the fourth quartile (vs. first quartile) had 1.66 (95% confidence interval [CI] 1.03-2.66) times the odds for overweight and 1.57 (95% CI 1.05-2.36) times the odds for obesity. The combined effect of maternal OWO and smoking on offspring risk of obesity is 3.66 (95% CI 2.37-5.67) if using self-reported smoking. Adding maternal and cord plasma biomarker information to self-reported data improved the prediction accuracy of long-term child OWO risk. Conclusions: This longitudinal birth cohort study of US BIPOC underscored the role of maternal smoking as an obesogen for offspring OWO risk. Our findings call for public health intervention strategies to focus on maternal smoking - as a highly modifiable target, including smoking cessation and countermeasures (such as optimal nutrition) that may alleviate the increasing obesity burden in the United States and globally.
RESUMO
Pseudotime analysis with single-cell RNA-sequencing (scRNA-seq) data has been widely used to study dynamic gene regulatory programs along continuous biological processes. While many computational methods have been developed to infer the pseudo-temporal trajectories of cells within a biological sample, methods that compare pseudo-temporal patterns with multiple samples (or replicates) across different experimental conditions are lacking. Lamian is a comprehensive and statistically-rigorous computational framework for differential multi-sample pseudotime analysis. It can be used to identify changes in a biological process associated with sample covariates, such as different biological conditions, and also to detect changes in gene expression, cell density, and topology of a pseudotemporal trajectory. Unlike existing methods that ignore sample variability, Lamian draws statistical inference after accounting for cross-sample variability and hence substantially reduces sample-specific false discoveries that are not generalizable to new samples. Using both simulations and real scRNA-seq data, including an analysis of differential immune response programs between COVID-19 patients with different disease severity levels, we demonstrate the advantages of Lamian in decoding cellular gene expression programs in continuous biological processes.
RESUMO
BACKGROUND: Maternal smoking affects more than half a million pregnancies each year in the US and is known to result in fetal growth restriction as measured by lower birthweight and its associated long-term consequences. Maternal smoking also has been linked to altered fetal DNA methylation (DNAm). However, what remains largely unexplored is whether these DNAm alterations are merely markers of smoking exposure or if they also have implications for health outcomes. This study tested the hypothesis that fetal DNAm mediates the effect of maternal smoking on newborn birthweight. METHODS: This study included mother-newborn pairs from a US predominantly urban, low-income multi-ethnic birth cohort. DNAm in cord blood were determined using the Illumina Infinium MethylationEPIC BeadChip. After standard quality control and normalization procedures, an epigenome-wide association study (EWAS) of maternal smoking was performed using linear regression models, controlling for maternal age, education, race, parity, pre-pregnancy body mass index, alcohol consumption, gestational age, maternal pregestational/gestational diabetes, child sex, cord blood cell compositions and batch effects. To quantify the degree to which cord DNAm mediates the smoking-birthweight association, the VanderWeele-Vansteelandt approach for single mediator and structural equational model for multiple mediators were used, adjusting for pertinent covariates. RESULTS: The study included 954 mother-newborn pairs. Among mothers, 165 (17.3%) ever smoked before or during pregnancy. Newborns with smoking exposure had on average 258 g lower birthweight than newborns without exposure (P < 0.001). Using a false discovery rate (FDR) < 0.05 as the significance cutoff, the EWAS identified 38 differentially methylated CpG sites associated with maternal smoking. Of those, 17 CpG sites were mapped to previously reported genes: GFI1, AHRR, CYP1A1, and CNTNAP2; 8 of those, located in the first three genes, were Bonferroni significantly associated with newborn birthweight and mediated the smoking-birthweight association. The combined mediation effect of the three genes explained 67.8% of the smoking-birthweight association. CONCLUSIONS: Our study not only lends further support that maternal smoking alters fetal DNAm in a multiethnic population, but also suggests that fetal DNAm substantially mediates the maternal smoking-birthweight association. Our findings, if further validated, indicate that DNAm modification is likely an important pathway by which maternal smoking impairs fetal growth and, perhaps, even long-term health outcomes.