RESUMEN
Transfer RNAs (tRNAs) are a central component for the biological synthesis of proteins, and they are among the most highly conserved and frequently transcribed genes in all living things. Despite their clear significance for fundamental cellular processes, the forces governing tRNA evolution are poorly understood. We present evidence that transcription-associated mutagenesis and strong purifying selection are key determinants of patterns of sequence variation within and surrounding tRNA genes in humans and diverse model organisms. Remarkably, the mutation rate at broadly expressed cytosolic tRNA loci is likely between 7 and 10 times greater than the nuclear genome average. Furthermore, evolutionary analyses provide strong evidence that tRNA genes, but not their flanking sequences, experience strong purifying selection acting against this elevated mutation rate. We also find a strong correlation between tRNA expression levels and the mutation rates in their immediate flanking regions, suggesting a simple method for estimating individual tRNA gene activity. Collectively, this study illuminates the extreme competing forces in tRNA gene evolution and indicates that mutations at tRNA loci contribute disproportionately to mutational load and have unexplored fitness consequences in human populations.
Asunto(s)
Arabidopsis/genética , Genes de Helminto , Genes de Plantas , Mutación , ARN de Helminto/genética , ARN de Planta/genética , ARN de Transferencia/genética , Animales , Drosophila melanogaster , RatonesRESUMEN
Identification of Alzheimer's disease (AD) onset risk can facilitate interventions before irreversible disease progression. We demonstrate that electronic health records from the University of California, San Francisco, followed by knowledge networks (for example, SPOKE) allow for (1) prediction of AD onset and (2) prioritization of biological hypotheses, and (3) contextualization of sex dimorphism. We trained random forest models and predicted AD onset on a cohort of 749 individuals with AD and 250,545 controls with a mean area under the receiver operating characteristic of 0.72 (7 years prior) to 0.81 (1 day prior). We further harnessed matched cohort models to identify conditions with predictive power before AD onset. Knowledge networks highlight shared genes between multiple top predictors and AD (for example, APOE, ACTB, IL6 and INS). Genetic colocalization analysis supports AD association with hyperlipidemia at the APOE locus, as well as a stronger female AD association with osteoporosis at a locus near MS4A6A. We therefore show how clinical data can be utilized for early AD prediction and identification of personalized biological hypotheses.
Asunto(s)
Enfermedad de Alzheimer , Masculino , Humanos , Femenino , Enfermedad de Alzheimer/diagnóstico , Registros Electrónicos de Salud , Apolipoproteínas E/genética , San FranciscoRESUMEN
Background: Preterm birth (PTB) is the leading cause of infant mortality and follows multiple biological pathways, many of which are poorly understood. Some PTBs result from medically indicated labor following complications from hypertension and/or diabetes, while many others are spontaneous with unknown causes. Previously, investigation of potential risk factors has been limited by lack of data on maternal medical history and the difficulty of classifying PTBs as indicated or spontaneous. Here, we leverage electronic health record (EHR) data (patient health information including demographics, diagnoses, and medications) and a supplemental curated pregnancy database to overcome these limitations. Novel associations may provide new insight into the pathophysiology of PTB as well as help identify individuals who would be at risk of PTB. Methods: We quantified associations between maternal diagnoses and preterm birth using logistic regression controlling for maternal age and socioeconomic factors within a University of California, San Francisco (UCSF), EHR cohort with 10,643 births ( nterm = 9692, nspontaneous_preterm = 449, nindicated_preterm = 418) and maternal pre-conception diagnosis phenotypes derived from International Classification of Diseases (ICD) 9 and 10 codes. Results: Eighteen conditions significantly and robustly (False Discovery Rate (FDR)<0.05) associated with PTBs compared to term. We discovered known (hypertension, diabetes, and chronic kidney disease) and less established (blood, cardiac, gynecological, and liver conditions) associations. Type 1 diabetes was the most significant overall association (adjusted p = 1.6×10 -14 , adjusted OR = 7 (95% CI 5, 12)), and the odds ratios for the significant phenotypes ranged from 3 to 13. We further carried out analysis stratified by spontaneous vs. indicated PTB. No phenotypes significantly associated with spontaneous PTB; however, the results for indicated PTB largely recapitulated the phenotype associations with all PTBs. Conclusions: Our study underscores the limitations of approaches that combine indicated and spontaneous births together. When combined, significant associations were almost entirely driven by indicated PTBs, although our spontaneous and indicated groups were of a similar size. Investigating the spontaneous population has the potential to reveal new pathways and understanding of the heterogeneity of PTB.
RESUMEN
Despite the occurrence of wildfires quadrupling over the past four decades, the health effects associated with wildfire smoke exposures during pregnancy remains unknown. Particulate matter less than 2.5 µms (PM2.5) is among the major pollutants emitted in wildfire smoke. Previous studies found PM2.5 associated with lower birthweight, however, the relationship between wildfire-specific PM2.5 and birthweight is uncertain. Our study of 7923 singleton births in San Francisco between January 1, 2017 and March 12, 2020 examines associations between wildfire smoke exposure during pregnancy and birthweight. We linked daily estimates of wildfire-specific PM2.5 to maternal residence at the ZIP code level. We used linear and log-binomial regression to examine the relationship between wildfire smoke exposure by trimester and birthweight and adjusted for gestational age, maternal age, race/ethnicity, and educational attainment. We stratified by infant sex to examine potential effect modification. Exposure to wildfire-specific PM2.5 during the second trimester of pregnancy was positively associated with increased risk of large for gestational age (OR = 1.13; 95% CI: 1.03, 1.24), as was the number of days of wildfire-specific PM2.5 above 5 µg m-3 in the second trimester (OR = 1.03; 95% CI: 1.01, 1.06). We found consistent results with wildfire smoke exposure in the second trimester and increased continuous birthweight-for-gestational age z-score. Differences by infant sex were not consistent. Counter to our hypothesis, results suggest that wildfire smoke exposures are associated with increased risk for higher birthweight. We observed strongest associations during the second trimester. These investigations should be expanded to other populations exposed to wildfire smoke and aim to identify vulnerable communities. Additional research is needed to clarify the biological mechanisms in this relationship between wildfire smoke exposure and adverse birth outcomes.
RESUMEN
Recurrent pregnancy loss (RPL), defined as 2 or more pregnancy losses, affects 5-6% of ever-pregnant individuals. Approximately half of these cases have no identifiable explanation. To generate hypotheses about RPL etiologies, we implemented a case-control study comparing the history of over 1,600 diagnoses between RPL and live-birth patients, leveraging the University of California San Francisco (UCSF) and Stanford University electronic health record databases. In total, our study included 8,496 RPL (UCSF: 3,840, Stanford: 4,656) and 53,278 Control (UCSF: 17,259, Stanford: 36,019) patients. Menstrual abnormalities and infertility-associated diagnoses were significantly positively associated with RPL in both medical centers. Age-stratified analysis revealed that the majority of RPL-associated diagnoses had higher odds ratios for patients <35 compared with 35+ patients. While Stanford results were sensitive to control for healthcare utilization, UCSF results were stable across analyses with and without utilization. Intersecting significant results between medical centers was an effective filter to identify associations that are robust across center-specific utilization patterns.
RESUMEN
Aggressive breast cancers portend a poor prognosis, but current polygenic risk scores (PRSs) for breast cancer do not reliably predict aggressive cancers. Aggressiveness can be effectively recapitulated using tumor gene expression profiling. Thus, we sought to develop a PRS for the risk of recurrence score weighted on proliferation (ROR-P), an established prognostic signature. Using 2363 breast cancers with tumor gene expression data and single nucleotide polymorphism (SNP) genotypes, we examined the associations between ROR-P and known breast cancer susceptibility SNPs using linear regression models. We constructed PRSs based on varying p-value thresholds and selected the optimal PRS based on model r2 in 5-fold cross-validation. We then used Cox proportional hazards regression to test the ROR-P PRS's association with breast cancer-specific survival in two independent cohorts totaling 10,196 breast cancers and 785 events. In meta-analysis of these cohorts, higher ROR-P PRS was associated with worse survival, HR per SD = 1.13 (95% CI 1.06-1.21, p = 4.0 × 10-4). The ROR-P PRS had a similar magnitude of effect on survival as a comparator PRS for estrogen receptor (ER)-negative versus positive cancer risk (PRSER-/ER+). Furthermore, its effect was minimally attenuated when adjusted for PRSER-/ER+, suggesting that the ROR-P PRS provides additional prognostic information beyond ER status. In summary, we used integrated analysis of germline SNP and tumor gene expression data to construct a PRS associated with aggressive tumor biology and worse survival. These findings could potentially enhance risk stratification for breast cancer screening and prevention.
RESUMEN
Although prematurity is the single largest cause of death in children under 5 years of age, the current definition of prematurity, based on gestational age, lacks the precision needed for guiding care decisions. Here, we propose a longitudinal risk assessment for adverse neonatal outcomes in newborns based on a deep learning model that uses electronic health records (EHRs) to predict a wide range of outcomes over a period starting shortly before conception and ending months after birth. By linking the EHRs of the Lucile Packard Children's Hospital and the Stanford Healthcare Adult Hospital, we developed a cohort of 22,104 mother-newborn dyads delivered between 2014 and 2018. Maternal and newborn EHRs were extracted and used to train a multi-input multitask deep learning model, featuring a long short-term memory neural network, to predict 24 different neonatal outcomes. An additional cohort of 10,250 mother-newborn dyads delivered at the same Stanford Hospitals from 2019 to September 2020 was used to validate the model. Areas under the receiver operating characteristic curve at delivery exceeded 0.9 for 10 of the 24 neonatal outcomes considered and were between 0.8 and 0.9 for 7 additional outcomes. Moreover, comprehensive association analysis identified multiple known associations between various maternal and neonatal features and specific neonatal outcomes. This study used linked EHRs from more than 30,000 mother-newborn dyads and would serve as a resource for the investigation and prediction of neonatal outcomes. An interactive website is available for independent investigators to leverage this unique dataset: https://maternal-child-health-associations.shinyapps.io/shiny_app/.
Asunto(s)
Salud del Lactante , Recien Nacido Prematuro , Adulto , Niño , Recién Nacido , Humanos , Preescolar , Edad Gestacional , Morbilidad , Medición de RiesgoRESUMEN
OBJECTIVES: Over the past few years, challenges from the pandemic have led to an explosion of data sharing and algorithmic development efforts in the areas of molecular measurements, clinical data, and digital health. We aim to characterize and describe recent advanced computational approaches in translational bioinformatics across these domains in the context of issues or progress related to equity and inclusion. METHODS: We conducted a literature assessment of the trends and approaches in translational bioinformatics in the past few years. RESULTS: We present a review of recent computational approaches across molecular, clinical, and digital realms. We discuss applications of phenotyping, disease subtype characterization, predictive modeling, biomarker discovery, and treatment selection. We consider these methods and applications through the lens of equity and inclusion in biomedicine. CONCLUSION: Equity and inclusion should be incorporated at every step of translational bioinformatics projects, including project design, data collection, model creation, and clinical implementation. These considerations, coupled with the exciting breakthroughs in big data and machine learning, are pivotal to reach the goals of precision medicine for all.
Asunto(s)
Investigación Biomédica , Medicina de Precisión , Biología Computacional , Macrodatos , Aprendizaje AutomáticoRESUMEN
BACKGROUND: The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. We show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis. FINDINGS: In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1-77% of all reads (median [IQR], 3% [3-6%]); duplicate reads constitute 3-100% of mapped reads (median [IQR], 27% [13-43%]); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (median [IQR], 25% [16-37%]). MEND reads constitute 0-79% of total reads (median [IQR], 50% [30-61%]). CONCLUSIONS: Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, we propose reporting a dataset's sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. We provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. We recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.