RESUMEN
INTRODUCTION: Whole Exome Sequencing (WES) has emerged as an efficient tool in clinical cancer diagnostics to broaden the scope from panel-based diagnostics to screening of all genes and enabling robust determination of complex biomarkers in a single analysis. METHODS: To assess concordance, six formalin-fixed paraffin-embedded (FFPE) tissue specimens and four commercial reference standards were analyzed by WES as matched tumor-normal DNA at 21 NGS centers in Germany, each employing local wet-lab and bioinformatics. Somatic and germline variants, copy-number alterations (CNAs), and complex biomarkers were investigated. Somatic variant calling was performed in 494 diagnostically relevant cancer genes. The raw data were collected and re-analyzed with a central bioinformatic pipeline to separate wet- and dry-lab variability. RESULTS: The mean positive percentage agreement (PPA) of somatic variant calling was 76 % while the positive predictive value (PPV) was 89 % in relation to a consensus list of variants found by at least five centers. Variant filtering was identified as the main cause for divergent variant calls. Adjusting filter criteria and re-analysis increased the PPA to 88 % for all and 97 % for the clinically relevant variants. CNA calls were concordant for 82 % of genomic regions. Homologous recombination deficiency (HRD), tumor mutational burden (TMB), and microsatellite instability (MSI) status were concordant for 94 %, 93 %, and 93 % of calls, respectively. Variability of CNAs and complex biomarkers did not decrease considerably after harmonization of the bioinformatic processing and was hence attributed mainly to wet-lab differences. CONCLUSION: Continuous optimization of bioinformatic workflows and participating in round robin tests are recommended.
Asunto(s)
Benchmarking , Variaciones en el Número de Copia de ADN , Secuenciación del Exoma , Neoplasias , Medicina de Precisión , Humanos , Secuenciación del Exoma/métodos , Alemania , Medicina de Precisión/métodos , Medicina de Precisión/normas , Neoplasias/genética , Biomarcadores de Tumor/genética , Biología Computacional/métodosRESUMEN
BACKGROUND: A major obstacle faced by families with rare diseases is obtaining a genetic diagnosis. The average "diagnostic odyssey" lasts over five years and causal variants are identified in under 50%, even when capturing variants genome-wide. To aid in the interpretation and prioritization of the vast number of variants detected, computational methods are proliferating. Knowing which tools are most effective remains unclear. To evaluate the performance of computational methods, and to encourage innovation in method development, we designed a Critical Assessment of Genome Interpretation (CAGI) community challenge to place variant prioritization models head-to-head in a real-life clinical diagnostic setting. METHODS: We utilized genome sequencing (GS) data from families sequenced in the Rare Genomes Project (RGP), a direct-to-participant research study on the utility of GS for rare disease diagnosis and gene discovery. Challenge predictors were provided with a dataset of variant calls and phenotype terms from 175 RGP individuals (65 families), including 35 solved training set families with causal variants specified, and 30 unlabeled test set families (14 solved, 16 unsolved). We tasked teams to identify causal variants in as many families as possible. Predictors submitted variant predictions with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on the rank position of causal variants, and the maximum F-measure, based on precision and recall of causal variants across all EPCR values. RESULTS: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performers recalled causal variants in up to 13 of 14 solved families within the top 5 ranked variants. Newly discovered diagnostic variants were returned to two previously unsolved families following confirmatory RNA sequencing, and two novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant in an unsolved proband with phenotypes consistent with asparagine synthetase deficiency. CONCLUSIONS: Model methodology and performance was highly variable. Models weighing call quality, allele frequency, predicted deleteriousness, segregation, and phenotype were effective in identifying causal variants, and models open to phenotype expansion and non-coding variants were able to capture more difficult diagnoses and discover new diagnoses. Overall, computational models can significantly aid variant prioritization. For use in diagnostics, detailed review and conservative assessment of prioritized variants against established criteria is needed.
Asunto(s)
Enfermedades Raras , Humanos , Enfermedades Raras/genética , Enfermedades Raras/diagnóstico , Genoma Humano/genética , Variación Genética/genética , Biología Computacional/métodos , FenotipoRESUMEN
Small blue round cell sarcomas (SBRCSs) are a heterogeneous group of tumors with overlapping morphologic features but markedly varying prognosis. They are characterized by distinct chromosomal alterations, particularly rearrangements leading to gene fusions, whose detection currently represents the most reliable diagnostic marker. Ewing sarcomas are the most common SBRCSs, defined by gene fusions involving EWSR1 and transcription factors of the ETS family, and the most frequent non-EWSR1-rearranged SBRCSs harbor a CIC rearrangement. Unfortunately, currently the identification of CIC::DUX4 translocation events, the most common CIC rearrangement, is challenging. Here, we present a machine-learning approach to support SBRCS diagnosis that relies on gene expression profiles measured via targeted sequencing. The analyses on a curated cohort of 69 soft-tissue tumors showed markedly distinct expression patterns for SBRCS subgroups. A random forest classifier trained on Ewing sarcoma and CIC-rearranged cases predicted probabilities of being CIC-rearranged >0.9 for CIC-rearranged-like sarcomas and <0.6 for other SBRCSs. Testing on a retrospective cohort of 1335 routine diagnostic cases identified 15 candidate CIC-rearranged tumors with a probability >0.75, all of which were supported by expert histopathologic reassessment. Furthermore, the multigene random forest classifier appeared advantageous over using high ETV4 expression alone, previously proposed as a surrogate to identify CIC rearrangement. Taken together, the expression-based classifier can offer valuable support for SBRCS pathologic diagnosis.
Asunto(s)
Sarcoma de Células Pequeñas , Sarcoma , Neoplasias de los Tejidos Blandos , Humanos , Estudios Retrospectivos , Sarcoma de Células Pequeñas/diagnóstico , Sarcoma de Células Pequeñas/genética , Sarcoma de Células Pequeñas/patología , Factores de Transcripción/genética , Sarcoma/genética , Neoplasias de los Tejidos Blandos/genética , Análisis de Secuencia de ARN , Proteínas de Fusión Oncogénica/genética , Biomarcadores de Tumor/genética , Biomarcadores de Tumor/análisisRESUMEN
Background: A major obstacle faced by rare disease families is obtaining a genetic diagnosis. The average "diagnostic odyssey" lasts over five years, and causal variants are identified in under 50%. The Rare Genomes Project (RGP) is a direct-to-participant research study on the utility of genome sequencing (GS) for diagnosis and gene discovery. Families are consented for sharing of sequence and phenotype data with researchers, allowing development of a Critical Assessment of Genome Interpretation (CAGI) community challenge, placing variant prioritization models head-to-head in a real-life clinical diagnostic setting. Methods: Predictors were provided a dataset of phenotype terms and variant calls from GS of 175 RGP individuals (65 families), including 35 solved training set families, with causal variants specified, and 30 test set families (14 solved, 16 unsolved). The challenge tasked teams with identifying the causal variants in as many test set families as possible. Ranked variant predictions were submitted with estimated probability of causal relationship (EPCR) values. Model performance was determined by two metrics, a weighted score based on rank position of true positive causal variants and maximum F-measure, based on precision and recall of causal variants across EPCR thresholds. Results: Sixteen teams submitted predictions from 52 models, some with manual review incorporated. Top performing teams recalled the causal variants in up to 13 of 14 solved families by prioritizing high quality variant calls that were rare, predicted deleterious, segregating correctly, and consistent with reported phenotype. In unsolved families, newly discovered diagnostic variants were returned to two families following confirmatory RNA sequencing, and two prioritized novel disease gene candidates were entered into Matchmaker Exchange. In one example, RNA sequencing demonstrated aberrant splicing due to a deep intronic indel in ASNS, identified in trans with a frameshift variant, in an unsolved proband with phenotype overlap with asparagine synthetase deficiency. Conclusions: By objective assessment of variant predictions, we provide insights into current state-of-the-art algorithms and platforms for genome sequencing analysis for rare disease diagnosis and explore areas for future optimization. Identification of diagnostic variants in unsolved families promotes synergy between researchers with clinical and computational expertise as a means of advancing the field of clinical genome interpretation.