RESUMO
canSAR (http://cansar.icr.ac.uk) is the largest, public, freely available, integrative translational research and drug discovery knowledgebase for oncology. canSAR integrates vast multidisciplinary data from across genomic, protein, pharmacological, drug and chemical data with structural biology, protein networks and more. It also provides unique data, curation and annotation and crucially, AI-informed target assessment for drug discovery. canSAR is widely used internationally by academia and industry. Here we describe significant developments and enhancements to the data, web interface and infrastructure of canSAR in the form of the new implementation of the system: canSARblack. We demonstrate new functionality in aiding translation hypothesis generation and experimental design, and show how canSAR can be adapted and utilised outside oncology.
Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Descoberta de Drogas/métodos , Bases de Conhecimento , Neoplasias/genética , Pesquisa Translacional Biomédica/métodos , Antineoplásicos/química , Antineoplásicos/uso terapêutico , Mineração de Dados/métodos , Genômica/métodos , Humanos , Internet , Oncologia/métodos , Estrutura Molecular , Neoplasias/metabolismo , Proteômica/métodos , Interface Usuário-ComputadorRESUMO
Large-scale population analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short-read whole-genome sequencing. However, these short-read approaches fail to give a complete picture of a genome. They struggle to identify structural events, cannot access repetitive regions, and fail to resolve the human genome into haplotypes. Here, we describe an approach that retains long range information while maintaining the advantages of short reads. Starting from â¼1 ng of high molecular weight DNA, we produce barcoded short-read libraries. Novel informatic approaches allow for the barcoded short reads to be associated with their original long molecules producing a novel data type known as "Linked-Reads". This approach allows for simultaneous detection of small and large variants from a single library. In this manuscript, we show the advantages of Linked-Reads over standard short-read approaches for reference-based analysis. Linked-Reads allow mapping to 38 Mb of sequence not accessible to short reads, adding sequence in 423 difficult-to-sequence genes including disease-relevant genes STRC, SMN1, and SMN2 Both Linked-Read whole-genome and whole-exome sequencing identify complex structural variations, including balanced events and single exon deletions and duplications. Further, Linked-Reads extend the region of high-confidence calls by 68.9 Mb. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.
Assuntos
Estudo de Associação Genômica Ampla/métodos , Polimorfismo Genético , Sequenciamento Completo do Genoma/métodos , Linhagem Celular , Genoma Humano , Humanos , Peptídeos e Proteínas de Sinalização Intercelular , Proteínas de Membrana/genética , Proteína 1 de Sobrevivência do Neurônio Motor/genética , Proteína 2 de Sobrevivência do Neurônio Motor/genéticaRESUMO
BACKGROUND: Wilms tumour is the most common childhood renal cancer and is genetically heterogeneous. While several Wilms tumour predisposition genes have been identified, there is strong evidence that further predisposition genes are likely to exist. Our study aim was to identify new predisposition genes for Wilms tumour. METHODS: In this exome sequencing study, we analysed lymphocyte DNA from 890 individuals with Wilms tumour, including 91 affected individuals from 49 familial Wilms tumour pedigrees. We used the protein-truncating variant prioritisation method to prioritise potential disease-associated genes for further assessment. We evaluated new predisposition genes in exome sequencing data that we generated in 334 individuals with 27 other childhood cancers and in exome data from The Cancer Genome Atlas obtained from 7632 individuals with 28 adult cancers. FINDINGS: We identified constitutional cancer-predisposing mutations in 33 individuals with childhood cancer. The three identified genes with the strongest signal in the protein-truncating variant prioritisation analyses were TRIM28, FBXW7, and NYNRIN. 21 of 33 individuals had a mutation in TRIM28; there was a strong parent-of-origin effect, with all ten inherited mutations being maternally transmitted (p=0·00098). We also found a strong association with the rare epithelial subtype of Wilms tumour, with 14 of 16 tumours being epithelial or epithelial predominant. There were no TRIM28 mutations in individuals with other childhood or adult cancers. We identified truncating FBXW7 mutations in four individuals with Wilms tumour and a de-novo non-synonymous FBXW7 mutation in a child with a rhabdoid tumour. Biallelic truncating mutations in NYNRIN were identified in three individuals with Wilms tumour, which is highly unlikely to have occurred by chance (p<0·0001). Finally, we identified two de-novo KDM3B mutations, supporting the role of KDM3B as a childhood cancer predisposition gene. INTERPRETATION: The four new Wilms tumour predisposition genes identified-TRIM28, FBXW7, NYNRIN, and KDM3B-are involved in diverse biological processes and, together with the other 17 known Wilms tumour predisposition genes, account for about 10% of Wilms tumour cases. The overlap between these 21 constitutionally mutated predisposition genes and 20 genes somatically mutated in Wilms tumour is limited, consisting of only four genes. We recommend that all individuals with Wilms tumour should be offered genetic testing and particularly, those with epithelial Wilms tumour should be offered TRIM28 genetic testing. Only a third of the familial Wilms tumour clusters we analysed were attributable to known genes, indicating that further Wilms tumour predisposition factors await discovery. FUNDING: Wellcome Trust.
Assuntos
Genes do Tumor de Wilms , Tumor de Wilms/genética , Adolescente , Adulto , Criança , Pré-Escolar , Proteína 7 com Repetições F-Box-WD/genética , Feminino , Marcadores Genéticos , Predisposição Genética para Doença , Humanos , Histona Desmetilases com o Domínio Jumonji/genética , Masculino , Pessoa de Meia-Idade , Mutação , Prognóstico , Proteína 28 com Motivo Tripartido/genética , Reino Unido/epidemiologia , Sequenciamento do Exoma , Tumor de Wilms/diagnóstico , Tumor de Wilms/mortalidade , Adulto JovemRESUMO
Evaluating, optimising and benchmarking of next generation sequencing (NGS) variant calling performance are essential requirements for clinical, commercial and academic NGS pipelines. Such assessments should be performed in a consistent, transparent and reproducible fashion, using independently, orthogonally generated data. Here we present ICR142 Benchmarker, a tool to generate outputs for assessing germline base substitution and indel calling performance using the ICR142 NGS validation series, a dataset of Illumina platform-based exome sequence data from 142 samples together with Sanger sequence data at 704 sites. ICR142 Benchmarker provides summary and detailed information on the sensitivity, specificity and false detection rates of variant callers. ICR142 Benchmarker also automatically generates a single page report highlighting key performance metrics and how performance compares to widely-used open-source tools. We used ICR142 Benchmarker with VCF files outputted by GATK, OpEx and DeepVariant to create a benchmark for variant calling performance. This evaluation revealed pipeline-specific differences and shared challenges in variant calling, for example in detecting indels in short repeating sequence motifs. We next used ICR142 Benchmarker to perform regression testing with DeepVariant versions 0.5.2 and 0.6.1. This showed that v0.6.1 improves variant calling performance, but there was evidence of minor changes in indel calling behaviour that may benefit from attention. The data also allowed us to evaluate filters to optimise DeepVariant calling, and we recommend using 30 as the QUAL threshold for base substitution calls when using DeepVariant v0.6.1. Finally, we used ICR142 Benchmarker with VCF files from two commercial variant calling providers to facilitate optimisation of their in-house pipelines and to provide transparent benchmarking of their performance. ICR142 Benchmarker consistently and transparently analyses variant calling performance based on the ICR142 NGS validation series, using the standard VCF input and outputting informative metrics to enable user understanding of pipeline performance. ICR142 Benchmarker is freely available at https://github.com/RahmanTeamDevelopment/ICR142_Benchmarker/releases.
RESUMO
The analytical sensitivity of a next generation sequencing (NGS) test reflects the ability of the test to detect real sequence variation. The evaluation of analytical sensitivity relies on the availability of gold-standard, validated, benchmarking datasets. For NGS analysis the availability of suitable datasets has been limited. Most laboratories undertake small scale evaluations using in-house data, and/or rely on in silico generated datasets to evaluate the performance of NGS variant detection pipelines. Cancer predisposition genes (CPGs), such as BRCA1 and BRCA2, are amongst the most widely tested genes in clinical practice today. Hundreds of providers across the world are now offering CPG testing using NGS methods. Validating and comparing the analytical sensitivity of CPG tests has proved difficult, due to the absence of comprehensive, orthogonally validated, benchmarking datasets of CPG pathogenic variants. To address this we present the ICR639 CPG NGS validation series. This dataset comprises data from 639 individuals. Each individual has sequencing data generated using the TruSight Cancer Panel (TSCP), a targeted NGS assay for the analysis of CPGs, together with orthogonally generated data showing the presence of at least one CPG pathogenic variant per individual. The set consists of 645 pathogenic variants in total. There is strong representation of the most challenging types of variants to detect, with 339 indels, including 16 complex indels and 24 with length greater than five base pairs and 74 exon copy number variations (CNVs) including 23 single exon CNVs. The series includes pathogenic variants in 31 CPGs, including 502 pathogenic variants in BRCA1 or BRCA2, making this an important comprehensive validation dataset for providers of BRCA1 and BRCA2 NGS testing. We have deposited the TSCP FASTQ files of the ICR639 series in the European Genome-phenome Archive (EGA) under accession number EGAD00001004134.