RESUMO
Accurate indel calling plays an important role in precision medicine. A benchmarking indel set is essential for thoroughly evaluating the indel calling performance of bioinformatics pipelines. A reference sample with a set of known-positive variants was developed in the FDA-led Sequencing Quality Control Phase 2 (SEQC2) project, but the known indels in the known-positive set were limited. This project sought to provide an enriched set of known indels that would be more translationally relevant by focusing on additional cancer related regions. A thorough manual review process completed by 42 reviewers, two advisors, and a judging panel of three researchers significantly enriched the known indel set by an additional 516 indels. The extended benchmarking indel set has a large range of variant allele frequencies (VAFs), with 87% of them having a VAF below 20% in reference Sample A. The reference Sample A and the indel set can be used for comprehensive benchmarking of indel calling across a wider range of VAF values in the lower range. Indel length was also variable, but the majority were under 10 base pairs (bps). Most of the indels were within coding regions, with the remainder in the gene regulatory regions. Although high confidence can be derived from the robust study design and meticulous human review, this extensive indel set has not undergone orthogonal validation. The extended benchmarking indel set, along with the indels in the previously published known-positive set, was the truth set used to benchmark indel calling pipelines in a community challenge hosted on the precisionFDA platform. This benchmarking indel set and reference samples can be utilized for a comprehensive evaluation of indel calling pipelines. Additionally, the insights and solutions obtained during the manual review process can aid in improving the performance of these pipelines.
Assuntos
Benchmarking , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Biologia Computacional , Controle de Qualidade , Mutação INDEL , Polimorfismo de Nucleotídeo ÚnicoRESUMO
BACKGROUND: Structural brain imaging metrics and gene expression biomarkers have previously been used for Alzheimer's disease (AD) diagnosis and prognosis, but none of these studies explored integration of imaging and gene expression biomarkers for predicting mild cognitive impairment (MCI)-to-AD conversion 1-2 years into the future. OBJECTIVE: We investigated advantages of combining gene expression and structural brain imaging features for predicting MCI-to-AD conversion. Selection of the differentially expressed genes (DEGs) for classifying cognitively normal (CN) controls and AD patients was benchmarked against previously reported results. METHODS: The current work proposes integrating brain imaging and blood gene expression data from two public datasets (ADNI and ANM) to predict MCI-to-AD conversion. A novel pipeline for combining gene expression data from multiple platforms is proposed and evaluated in the two independents patient cohorts. RESULTS: Combining DEGs and imaging biomarkers for predicting MCI-to-AD conversion yielded 0.832-0.876 receiver operating characteristic (ROC) area under the curve (AUC), which exceeded the 0.808-0.840 AUC from using the imaging features alone. With using only three DEGs, the CN versus AD predictive model achieved 0.718, 0.858, and 0.873 cross-validation AUC for the ADNI, ANM1, and ANM2 datasets. CONCLUSION: For the first time we show that combining gene expression and imaging biomarkers yields better predictive performance than using imaging metrics alone. A novel pipeline for combining gene expression data from multiple platforms is proposed and evaluated to produce consistent results in the two independents patient cohorts. Using an improved feature selection, we show that predictive models with fewer gene expression probes can achieve competitive performance.
Assuntos
Doença de Alzheimer , Disfunção Cognitiva , Doença de Alzheimer/diagnóstico por imagem , Doença de Alzheimer/genética , Biomarcadores , Encéfalo/diagnóstico por imagem , Disfunção Cognitiva/diagnóstico por imagem , Disfunção Cognitiva/genética , Progressão da Doença , Expressão Gênica , Humanos , Imageamento por Ressonância Magnética/métodosRESUMO
BACKGROUND: Cytosine modifications in DNA such as 5-methylcytosine (5mC) underlie a broad range of developmental processes, maintain cellular lineage specification, and can define or stratify types of cancer and other diseases. However, the wide variety of approaches available to interrogate these modifications has created a need for harmonized materials, methods, and rigorous benchmarking to improve genome-wide methylome sequencing applications in clinical and basic research. Here, we present a multi-platform assessment and cross-validated resource for epigenetics research from the FDA's Epigenomics Quality Control Group. RESULTS: Each sample is processed in multiple replicates by three whole-genome bisulfite sequencing (WGBS) protocols (TruSeq DNA methylation, Accel-NGS MethylSeq, and SPLAT), oxidative bisulfite sequencing (TrueMethyl), enzymatic deamination method (EMSeq), targeted methylation sequencing (Illumina Methyl Capture EPIC), single-molecule long-read nanopore sequencing from Oxford Nanopore Technologies, and 850k Illumina methylation arrays. After rigorous quality assessment and comparison to Illumina EPIC methylation microarrays and testing on a range of algorithms (Bismark, BitmapperBS, bwa-meth, and BitMapperBS), we find overall high concordance between assays, but also differences in efficiency of read mapping, CpG capture, coverage, and platform performance, and variable performance across 26 microarray normalization algorithms. CONCLUSIONS: The data provided herein can guide the use of these DNA reference materials in epigenomics research, as well as provide best practices for experimental design in future studies. By leveraging seven human cell lines that are designated as publicly available reference materials, these data can be used as a baseline to advance epigenomics research.
Assuntos
Epigênese Genética , Epigenômica/métodos , Controle de Qualidade , 5-Metilcitosina , Algoritmos , Ilhas de CpG , DNA/genética , Metilação de DNA , Epigenoma , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Alinhamento de Sequência , Análise de Sequência de DNA/métodos , Sulfitos , Sequenciamento Completo do Genoma/métodosRESUMO
Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis.