RESUMEN
We tested the hypothesis that underrepresented students in active-learning classrooms experience narrower achievement gaps than underrepresented students in traditional lecturing classrooms, averaged across all science, technology, engineering, and mathematics (STEM) fields and courses. We conducted a comprehensive search for both published and unpublished studies that compared the performance of underrepresented students to their overrepresented classmates in active-learning and traditional-lecturing treatments. This search resulted in data on student examination scores from 15 studies (9,238 total students) and data on student failure rates from 26 studies (44,606 total students). Bayesian regression analyses showed that on average, active learning reduced achievement gaps in examination scores by 33% and narrowed gaps in passing rates by 45%. The reported proportion of time that students spend on in-class activities was important, as only classes that implemented high-intensity active learning narrowed achievement gaps. Sensitivity analyses showed that the conclusions are robust to sampling bias and other issues. To explain the extensive variation in efficacy observed among studies, we propose the heads-and-hearts hypothesis, which holds that meaningful reductions in achievement gaps only occur when course designs combine deliberate practice with inclusive teaching. Our results support calls to replace traditional lecturing with evidence-based, active-learning course designs across the STEM disciplines and suggest that innovations in instructional strategies can increase equity in higher education.
Asunto(s)
Logro , Grupos Minoritarios/educación , Aprendizaje Basado en Problemas , Evaluación Educacional , Ingeniería/educación , Humanos , Matemática/educación , Ciencia/educación , Estudiantes , Tecnología/educación , Estados Unidos , UniversidadesRESUMEN
BACKGROUND: Research infrastructures such as biorepositories are essential to facilitate genomics and its growing applications in health research and translational medicine in Africa. Using a cervical cancer cohort, this study describes the establishment of a biorepository consisting of biospecimens and matched phenotype data for use in genomic association analysis and pharmacogenomics research. METHOD: Women aged > 18 years with a recent histologically confirmed cervical cancer diagnosis were recruited. A workflow pipeline was developed to collect, store, and analyse biospecimens comprising donor recruitment and informed consent, followed by data and biospecimen collection, nucleic acid extraction, storage of genomic DNA, genetic characterization, data integration, data analysis and data interpretation. The biospecimen and data storage infrastructure included shared -20 °C to -80 °C freezers, lockable cupboards, secured access-controlled laptop, password protected online data storage on OneDrive software. The biospecimen or data storage, transfer and sharing were compliant with the local and international biospecimen and data protection laws and policies, to ensure donor privacy, trust, and benefits for the wider community. RESULTS: This initial establishment of the biorepository recruited 410 women with cervical cancer. The mean (± SD) age of the donors was 52 (± 12) years, comprising stage I (15%), stage II (44%), stage III (47%) and stage IV (6%) disease. The biorepository includes whole blood and corresponding genomic DNA from 311 (75.9%) donors, and tumour biospecimens and corresponding tumour DNA from 258 (62.9%) donors. Datasets included information on sociodemographic characteristics, lifestyle, family history, clinical information, and HPV genotype. Treatment response was followed up for 12 months, namely, treatment-induced toxicities, survival vs. mortality, and disease status, that is disease-free survival, progression or relapse, 12 months after therapy commencement. CONCLUSION: The current work highlights a framework for developing a cancer genomics cohort-based biorepository on a limited budget. Such a resource plays a central role in advancing genomics research towards the implementation of personalised management of cancer.
Asunto(s)
Investigación Biomédica , Neoplasias del Cuello Uterino , Humanos , Femenino , Neoplasias del Cuello Uterino/tratamiento farmacológico , Neoplasias del Cuello Uterino/genética , Farmacogenética , Zimbabwe , Recurrencia Local de Neoplasia , Bancos de Muestras Biológicas , Manejo de EspecímenesRESUMEN
Preterm birth (PTB) complications are the leading cause of long-term morbidity and mortality in children. By using whole blood samples, we integrated whole-genome sequencing (WGS), RNA sequencing (RNA-seq), and DNA methylation data for 270 PTB and 521 control families. We analyzed this combined dataset to identify genomic variants associated with PTB and secondary analyses to identify variants associated with very early PTB (VEPTB) as well as other subcategories of disease that may contribute to PTB. We identified differentially expressed genes (DEGs) and methylated genomic loci and performed expression and methylation quantitative trait loci analyses to link genomic variants to these expression and methylation changes. We performed enrichment tests to identify overlaps between new and known PTB candidate gene systems. We identified 160 significant genomic variants associated with PTB-related phenotypes. The most significant variants, DEGs, and differentially methylated loci were associated with VEPTB. Integration of all data types identified a set of 72 candidate biomarker genes for VEPTB, encompassing genes and those previously associated with PTB. Notably, PTB-associated genes RAB31 and RBPJ were identified by all three data types (WGS, RNA-seq, and methylation). Pathways associated with VEPTB include EGFR and prolactin signaling pathways, inflammation- and immunity-related pathways, chemokine signaling, IFN-γ signaling, and Notch1 signaling. Progress in identifying molecular components of a complex disease is aided by integrated analyses of multiple molecular data types and clinical data. With these data, and by stratifying PTB by subphenotype, we have identified associations between VEPTB and the underlying biology.
Asunto(s)
Predisposición Genética a la Enfermedad/genética , Nacimiento Prematuro/genética , Metilación de ADN/genética , Femenino , Genómica/métodos , Humanos , Recién Nacido , Masculino , Fenotipo , Polimorfismo de Nucleótido Simple/genética , Transducción de Señal/genética , Secuenciación Completa del Genoma/métodosRESUMEN
The biological role of extracellular vesicles (EVs) in diffuse large B-cell lymphoma (DLBCL) initiation and progression remains largely unknown. We characterized EVs secreted by 5 DLBCL cell lines, a primary DLBCL tumor, and a normal control B-cell sample, optimized their purification, and analyzed their content. We found that DLBCLs secreted large quantities of CD63, Alix, TSG101, and CD81 EVs, which can be extracted using an ultracentrifugation-based method and traced by their cell of origin surface markers. We also showed that tumor-derived EVs can be exchanged between lymphoma cells, normal tonsillar cells, and HK stromal cells. We then examined the content of EVs, focusing on isolation of high-quality total RNA. We sequenced the total RNA and analyzed the nature of RNA species, including coding and noncoding RNAs. We compared whole-cell and EV-derived RNA composition in benign and malignant B cells and discovered that transcripts from EVs were involved in many critical cellular functions. Finally, we performed mutational analysis and found that mutations detected in EVs exquisitely represented mutations in the cell of origin. These results enhance our understanding and enable future studies of the role that EVs may play in the pathogenesis of DLBCL, particularly with regards to the exchange of genomic information. Current findings open a new strategy for liquid biopsy approaches in disease monitoring.
Asunto(s)
Vesículas Extracelulares/metabolismo , Linfoma de Células B Grandes Difuso/metabolismo , Proteínas de Neoplasias/metabolismo , ARN Neoplásico/metabolismo , Línea Celular Tumoral , Vesículas Extracelulares/genética , Vesículas Extracelulares/patología , Humanos , Linfoma de Células B Grandes Difuso/genética , Linfoma de Células B Grandes Difuso/patología , Proteínas de Neoplasias/genética , ARN Neoplásico/genéticaRESUMEN
Diffuse large B-cell lymphoma (DLBCL) is the most common aggressive form of non-Hodgkin lymphoma with variable biology and clinical behavior. The current classification does not fully explain the biological and clinical heterogeneity of DLBCLs. In this study, we carried out genomewide DNA methylation profiling of 140 DLBCL samples and 10 normal germinal center B cells using the HpaII tiny fragment enrichment by ligation-mediated polymerase chain reaction assay and hybridization to a custom Roche NimbleGen promoter array. We defined methylation disruption as a main epigenetic event in DLBCLs and designed a method for measuring the methylation variability of individual cases. We then used a novel approach for unsupervised hierarchical clustering based on the extent of DNA methylation variability. This approach identified 6 clusters (A-F). The extent of methylation variability was associated with survival outcomes, with significant differences in overall and progression-free survival. The novel clusters are characterized by disruption of specific biological pathways such as cytokine-mediated signaling, ephrin signaling, and pathways associated with apoptosis and cell-cycle regulation. In a subset of patients, we profiled gene expression and genomic variation to investigate their interplay with methylation changes. This study is the first to identify novel epigenetic clusters of DLBCLs and their aberrantly methylated genes, molecular associations, and survival.
Asunto(s)
Metilación de ADN/genética , Epigénesis Genética , Regulación Neoplásica de la Expresión Génica , Variación Genética/genética , Linfoma de Células B Grandes Difuso/genética , Linfoma de Células B Grandes Difuso/mortalidad , Proteínas de Neoplasias/genética , Estudios de Casos y Controles , Células Cultivadas , Estudios de Seguimiento , Humanos , Linfoma de Células B Grandes Difuso/clasificación , Pronóstico , Tasa de SupervivenciaRESUMEN
Importance: Racially minoritized and socioeconomically disadvantaged populations are currently underrepresented in clinical trials. Data-driven, quantitative analyses and strategies are required to help address this inequity. Objective: To systematically analyze the geographical distribution of self-identified racial and socioeconomic demographics within commuting distance to cancer clinical trial centers and other hospitals in the US. Design, Setting, and Participants: This longitudinal quantitative study used data from the US Census 2020 Decennial and American community survey (which collects data from all US residents), OpenStreetMap, National Cancer Institute-designated Cancer Centers list, Nature Index of Cancer Research Health Institutions, National Trial registry, and National Homeland Infrastructure Foundation-Level Data. Statistical analyses were performed on data collected between 2006 and 2020. Main Outcomes and Measures: Population distributions of socioeconomic deprivation indices and self-identified race within 30-, 60-, and 120-minute 1-way driving commute times from US cancer trial sites. Map overlay of high deprivation index and high diversity areas with existing hospitals, existing major cancer trial centers, and commuting distance to the closest cancer trial center. Results: The 78 major US cancer trial centers that are involved in 94% of all US cancer trials and included in this study were found to be located in areas with socioeconomically more affluent populations with higher proportions of self-identified White individuals (+10.1% unpaired mean difference; 95% CI, +6.8% to +13.7%) compared with the national average. The top 10th percentile of all US hospitals has catchment populations with a range of absolute sum difference from 2.4% to 35% from one-third each of Asian/multiracial/other (Asian alone, American Indian or Alaska Native alone, Native Hawaiian or Other Pacific Islander alone, some other race alone, population of 2 or more races), Black or African American, and White populations. Currently available data are sufficient to identify diverse census tracks within preset commuting times (30, 60, or 120 minutes) from all hospitals in the US (N = 7623). Maps are presented for each US city above 500â¯000 inhabitants, which display all prospective hospitals and major cancer trial sites within commutable distance to racially diverse and socioeconomically disadvantaged populations. Conclusion and Relevance: This study identified biases in the sociodemographics of populations living within commuting distance to US-based cancer trial sites and enables the determination of more equitably commutable prospective satellite hospital sites that could be mobilized for enhanced racial and socioeconomic representation in clinical trials. The maps generated in this work may inform the design of future clinical trials or investigations in enrollment and retention strategies for clinical trials; however, other recruitment barriers still need to be addressed to ensure racial and socioeconomic demographics within the geographical vicinity of a clinical site can translate to equitable trial participant representation.
Asunto(s)
Ensayos Clínicos como Asunto , Accesibilidad a los Servicios de Salud , Neoplasias , Viaje , Humanos , Estados Unidos , Viaje/estadística & datos numéricos , Accesibilidad a los Servicios de Salud/estadística & datos numéricos , Ensayos Clínicos como Asunto/estadística & datos numéricos , Neoplasias/terapia , Neoplasias/etnología , Factores Socioeconómicos , Factores de Tiempo , Instituciones Oncológicas/estadística & datos numéricos , Estudios LongitudinalesRESUMEN
Hepatocellular carcinoma (HCC) is the third leading cause of death from cancer worldwide but is often diagnosed at an advanced incurable stage. Yet, despite the urgent need for blood-based biomarkers for early detection, few studies capture ongoing biology to identify risk-stratifying biomarkers. We address this gap using the TGF-ß pathway because of its biological role in liver disease and cancer, established through rigorous animal models and human studies. Using machine learning methods with blood levels of 108 proteomic markers in the TGF-ß family, we found a pattern that differentiates HCC from non-HCC in a cohort of 216 patients with cirrhosis, which we refer to as TGF-ß based Protein Markers for Early Detection of HCC (TPEARLE) comprising 31 markers. Notably, 20 of the patients with cirrhosis alone presented an HCC-like pattern, suggesting that they may be a group with as yet undetected HCC or at high risk for developing HCC. In addition, we found two other biologically relevant markers, Myostatin and Pyruvate Kinase M2 (PKM2), which were significantly associated with HCC. We tested these for risk stratification of HCC in multivariable models adjusted for demographic and clinical variables, as well as batch and site. These markers reflect ongoing biology in the liver. They potentially indicate the presence of HCC early in its evolution and before it is manifest as a detectable lesion, thereby providing a set of markers that may be able to stratify risk for HCC.
RESUMEN
Genetic ancestry-oriented cancer research requires the ability to perform accurate and robust genetic ancestry inference from existing cancer-derived data, including whole-exome sequencing, transcriptome sequencing, and targeted gene panels, very often in the absence of matching cancer-free genomic data. Here we examined the feasibility and accuracy of computational inference of genetic ancestry relying exclusively on cancer-derived data. A data synthesis framework was developed to optimize and assess the performance of the ancestry inference for any given input cancer-derived molecular profile. In its core procedure, the ancestral background of the profiled patient is replaced with one of any number of individuals with known ancestry. The data synthesis framework is applicable to multiple profiling platforms, making it possible to assess the performance of inference specifically for a given molecular profile and separately for each continental-level ancestry; this ability extends to all ancestries, including those without statistically sufficient representation in the existing cancer data. The inference procedure was demonstrated to be accurate and robust in a wide range of sequencing depths. Testing of the approach in four representative cancer types and across three molecular profiling modalities showed that continental-level ancestry of patients can be inferred with high accuracy, as quantified by its agreement with the gold standard of deriving ancestry from matching cancer-free molecular data. This study demonstrates that vast amounts of existing cancer-derived molecular data are potentially amenable to ancestry-oriented studies of the disease without requiring matching cancer-free genomes or patient self-reported ancestry. SIGNIFICANCE: The development of a computational approach that enables accurate and robust ancestry inference from cancer-derived molecular profiles without matching cancer-free data provides a valuable methodology for genetic ancestry-oriented cancer research.
Asunto(s)
Neoplasias , Transcriptoma , Humanos , Genoma Humano , Genómica , Perfilación de la Expresión Génica , Polimorfismo de Nucleótido Simple , Neoplasias/genéticaRESUMEN
Synthetic lethal interactions (SLIs), genetic interactions in which the simultaneous inactivation of two genes leads to a lethal phenotype, are promising targets for therapeutic intervention in cancer, as exemplified by the recent success of PARP inhibitors in treating BRCA1/2-deficient tumors. We present SL-Cloud, a new component of the Institute for Systems Biology Cancer Gateway in the Cloud (ISB-CGC), that provides an integrated framework of cloud-hosted data resources and curated workflows to enable facile prediction of SLIs. This resource addresses two main challenges related to SLI inference: the need to wrangle and preprocess large multi-omic datasets and the availability of multiple comparable prediction approaches. SL-Cloud enables customizable computational inference of SLIs and testing of prediction approaches across multiple datasets. We anticipate that cancer researchers will find utility in this tool for discovery of SLIs to support further investigation into potential drug targets for anticancer therapies.
Asunto(s)
Nube Computacional , Neoplasias , Humanos , Neoplasias/genética , Biología de Sistemas , MultiómicaRESUMEN
Differential mRNA expression between ancestry groups can be explained by both genetic and environmental factors. We outline a computational workflow to determine the extent to which germline genetic variation explains cancer-specific molecular differences across ancestry groups. Using multi-omics datasets from The Cancer Genome Atlas (TCGA), we enumerate ancestry-informative markers colocalized with cancer-type-specific expression quantitative trait loci (e-QTLs) at ancestry-associated genes. This approach is generalizable to other settings with paired germline genotyping and mRNA expression data for a multi-ethnic cohort. For complete details on the use and execution of this protocol, please refer to Carrot-Zhang et al. (2020), Robertson et al. (2021), and Sayaman et al. (2021).
Asunto(s)
Neoplasias , Sitios de Carácter Cuantitativo , Expresión Génica , Células Germinativas , Humanos , Neoplasias/genética , Sitios de Carácter Cuantitativo/genética , ARN MensajeroRESUMEN
Hepatocellular carcinoma (HCC) is the most common primary liver cancer whose incidence continues to rise in many parts of the world due to a concomitant rise in many associated risk factors, such as alcohol use and obesity. Although early-stage HCC can be potentially curable through liver resection, liver-directed therapies, or transplantation, patients usually present with intermediate to advanced disease, which continues to be associated with a poor prognosis. This is because HCC is a cancer with significant complexities, including substantial clinical, histopathologic, and genomic heterogeneity. However, the scientific community has made a major effort to better characterize HCC in those aspects via utilizing tissue sampling and histological classification, whole genome sequencing, and developing viable animal models. These efforts ultimately aim to develop clinically relevant biomarkers and discover molecular targets for new therapies. For example, until recently, there was only one approved systemic therapy for advanced or metastatic HCC in the form of sorafenib. Through these efforts, several additional targeted therapies have gained approval in the United States, although much progress remains to be desired. This review will focus on the link between characterizing the pathogenesis of HCC with current and future HCC management.
RESUMEN
UNLABELLED: High-throughput data can be used in conjunction with clinical information to develop predictive models. Automating the process of developing, evaluating and testing such predictive models on different datasets would minimize operator errors and facilitate the comparison of different modeling approaches on the same dataset. Complete automation would also yield unambiguous documentation of the process followed to develop each model. We present the BDVal suite of programs that fully automate the construction of predictive classification models from high-throughput data and generate detailed reports about the model construction process. We have used BDVal to construct models from microarray and proteomics data, as well as from DNA-methylation datasets. The programs are designed for scalability and support the construction of thousands of alternative models from a given dataset and prediction task. AVAILABILITY AND IMPLEMENTATION: The BDVal programs are implemented in Java, provided under the GNU General Public License and freely available at http://bdval.campagnelab.org.
Asunto(s)
Biología Computacional/métodos , Modelos Biológicos , Programas Informáticos , Algoritmos , Metilación de ADN , Bases de Datos GenéticasRESUMEN
Cellular and molecular aberrations contribute to the disparity of human cancer incidence and etiology between ancestry groups. Multiomics profiling in The Cancer Genome Atlas (TCGA) allows for querying of the molecular underpinnings of ancestry-specific discrepancies in human cancer. Here, we provide a protocol for integrative associative analysis of ancestry with molecular correlates, including somatic mutations, DNA methylation, mRNA transcription, miRNA transcription, and pathway activity, using TCGA data. This protocol can be generalized to analyze other cancer cohorts and human diseases. For complete details on the use and execution of this protocol, please refer to Carrot-Zhang et al. (2020).
Asunto(s)
Genómica/métodos , Modelos Genéticos , Neoplasias/genética , Metilación de ADN/genética , Bases de Datos Genéticas , Femenino , Humanos , Masculino , MicroARNs/genética , Transcripción Genética/genéticaRESUMEN
We evaluated ancestry effects on mutation rates, DNA methylation, and mRNA and miRNA expression among 10,678 patients across 33 cancer types from The Cancer Genome Atlas. We demonstrated that cancer subtypes and ancestry-related technical artifacts are important confounders that have been insufficiently accounted for. Once accounted for, ancestry-associated differences spanned all molecular features and hundreds of genes. Biologically significant differences were usually tissue specific but not specific to cancer. However, admixture and pathway analyses suggested some of these differences are causally related to cancer. Specific findings included increased FBXW7 mutations in patients of African origin, decreased VHL and PBRM1 mutations in renal cancer patients of African origin, and decreased immune activity in bladder cancer patients of East Asian origin.
Asunto(s)
Metilación de ADN , Etnicidad/genética , Predisposición Genética a la Enfermedad , MicroARNs/genética , Mutación , Proteínas de Neoplasias/genética , Neoplasias/genética , Proteínas de Unión al ADN/genética , Proteína 7 que Contiene Repeticiones F-Box-WD/genética , Regulación Neoplásica de la Expresión Génica , Genética de Población , Genoma Humano , Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Neoplasias/etnología , Neoplasias/patología , Factores de Transcripción/genética , Proteína Supresora de Tumores del Síndrome de Von Hippel-Lindau/genéticaRESUMEN
Formaldehyde is a ubiquitous DNA damaging agent, with human exposures occurring from both exogenous and endogenous sources. Formaldehyde exposure can result in multiple types of DNA damage, including DNA-protein crosslinks and thus, is representative of other exposures that induce DNA-protein crosslinks such as cigarette smoke, automobile exhaust, wood smoke, metals, ionizing radiation, and certain chemotherapeutics. Our objective in this study was to identify the genes necessary to mitigate formaldehyde toxicity following chronic exposure in human cells. We used siRNAs that targeted 320 genes representing all major human DNA repair and damage response pathways, in order to assess cell proliferation following siRNA depletion and subsequent formaldehyde treatment. Three unrelated human cell lines frequently used in genotoxicity studies (SW480, U-2 OS and GM00639) were used to identify common pathways involved in mitigating formaldehyde sensitivity. Although there were gene-specific differences among the cell lines, four inter-related cellular pathways were determined to mitigate formaldehyde toxicity: homologous recombination, DNA double-strand break repair, ionizing radiation response and DNA replication. Additional insight into cell line-specific response patterns was obtained by using a combination of exome sequencing and Cancer Cell Line Encyclopedia genomic data. The results of this DNA damage repair pathway-focused siRNA screen for formaldehyde toxicity in human cells provide a foundation for detailed mechanistic analyses of pathway-specific involvement in the response to environmentally-induced DNA-protein crosslinks and, more broadly, genotoxicity studies using human and other mammalian cell lines.
Asunto(s)
Daño del ADN , Reparación del ADN/efectos de los fármacos , Reparación del ADN/genética , Formaldehído/toxicidad , Interferencia de ARN , Línea Celular , Proliferación Celular/efectos de los fármacos , Proliferación Celular/genética , Genómica , HumanosRESUMEN
DNA damage repair (DDR) pathways modulate cancer risk, progression, and therapeutic response. We systematically analyzed somatic alterations to provide a comprehensive view of DDR deficiency across 33 cancer types. Mutations with accompanying loss of heterozygosity were observed in over 1/3 of DDR genes, including TP53 and BRCA1/2. Other prevalent alterations included epigenetic silencing of the direct repair genes EXO5, MGMT, and ALKBH3 in â¼20% of samples. Homologous recombination deficiency (HRD) was present at varying frequency in many cancer types, most notably ovarian cancer. However, in contrast to ovarian cancer, HRD was associated with worse outcomes in several other cancers. Protein structure-based analyses allowed us to predict functional consequences of rare, recurrent DDR mutations. A new machine-learning-based classifier developed from gene expression data allowed us to identify alterations that phenocopy deleterious TP53 mutations. These frequent DDR gene alterations in many human cancers have functional consequences that may determine cancer progression and guide therapy.
Asunto(s)
Genoma Humano , Neoplasias/genética , Reparación del ADN por Recombinación , Línea Celular Tumoral , Daño del ADN , Silenciador del Gen , Humanos , Pérdida de Heterocigocidad , Aprendizaje Automático , Mutación , Neoplasias/clasificación , Proteínas Supresoras de Tumor/genética , Proteínas Supresoras de Tumor/metabolismoRESUMEN
Changes in DNA methylation are required for the formation of germinal centers (GCs), but the mechanisms of such changes are poorly understood. Activation-induced cytidine deaminase (AID) has been recently implicated in DNA demethylation through its deaminase activity coupled with DNA repair. We investigated the epigenetic function of AID in vivo in germinal center B cells (GCBs) isolated from wild-type (WT) and AID-deficient (Aicda(-/-)) mice. We determined that the transit of B cells through the GC is associated with marked locus-specific loss of methylation and increased methylation diversity, both of which are lost in Aicda(-/-) animals. Differentially methylated cytosines (DMCs) between GCBs and naive B cells (NBs) are enriched in genes that are targeted for somatic hypermutation (SHM) by AID, and these genes form networks required for B cell development and proliferation. Finally, we observed significant conservation of AID-dependent epigenetic reprogramming between mouse and human B cells.
Asunto(s)
Linfocitos B/metabolismo , Citidina Desaminasa/metabolismo , Epigénesis Genética , Centro Germinal/metabolismo , Animales , Linfocitos B/citología , Linfocitos B/inmunología , Diferenciación Celular , Movimiento Celular , Proliferación Celular , Secuencia Conservada , Citidina Desaminasa/genética , Citidina Desaminasa/inmunología , Citosina/metabolismo , Metilación de ADN , Centro Germinal/citología , Centro Germinal/inmunología , Humanos , Activación de Linfocitos , Ratones , Ratones Endogámicos BALB C , Ratones NoqueadosRESUMEN
Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (because they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file with perfect data fidelity. Compared to the previous compression state of the art, these methods reduce dataset size more than 40% when storing exome, gene expression or DNA methylation datasets. The approaches have been integrated in a comprehensive suite of software tools (http://goby.campagnelab.org) that support common analyses for a range of high-throughput sequencing assays.