RESUMO
The current state of much of the Wuhan pneumonia virus (severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2]) research shows a regrettable lack of data sharing and considerable analytical obfuscation. This impedes global research cooperation, which is essential for tackling public health emergencies and requires unimpeded access to data, analysis tools, and computational infrastructure. Here, we show that community efforts in developing open analytical software tools over the past 10 years, combined with national investments into scientific computational infrastructure, can overcome these deficiencies and provide an accessible platform for tackling global health emergencies in an open and transparent manner. Specifically, we use all SARS-CoV-2 genomic data available in the public domain so far to (1) underscore the importance of access to raw data and (2) demonstrate that existing community efforts in curation and deployment of biomedical software can reliably support rapid, reproducible research during global health crises. All our analyses are fully documented at https://github.com/galaxyproject/SARS-CoV-2.
Assuntos
Betacoronavirus/patogenicidade , Infecções por Coronavirus/virologia , Pneumonia Viral/virologia , Saúde Pública , Síndrome Respiratória Aguda Grave/virologia , COVID-19 , Análise de Dados , Humanos , Pandemias , SARS-CoV-2RESUMO
Many initiatives have addressed the global need to upskill biologists in bioinformatics tools and techniques. Australia is not unique in its requirement for such training, but due to its large size and relatively small and geographically dispersed population, Australia faces specific challenges. A combined training approach was implemented by the authors to overcome these challenges. The "hybrid" method combines guidance from experienced trainers with the benefits of both webinar-style delivery and concurrent face-to-face hands-on practical exercises in classrooms. Since 2017, the hybrid method has been used to conduct 9 hands-on bioinformatics training sessions at international scale in which over 800 researchers have been trained in diverse topics on a range of software platforms. The method has become a key tool to ensure scalable and more equitable delivery of short-course bioinformatics training across Australia and can be easily adapted to other locations, topics, or settings.
Assuntos
Biologia Computacional/educação , Educação a Distância/métodos , Austrália , Pesquisa Biomédica/educação , Pesquisa Biomédica/métodos , Pesquisa Biomédica/organização & administração , Biologia Computacional/métodos , Biologia Computacional/organização & administração , HumanosRESUMO
EMBL Australia Bioinformatics Resource (EMBL-ABR) is a developing national research infrastructure, providing bioinformatics resources and support to life science and biomedical researchers in Australia. EMBL-ABR comprises 10 geographically distributed national nodes with one coordinating hub, with current funding provided through Bioplatforms Australia and the University of Melbourne for its initial 2-year development phase. The EMBL-ABR mission is to: (1) increase Australia's capacity in bioinformatics and data sciences; (2) contribute to the development of training in bioinformatics skills; (3) showcase Australian data sets at an international level and (4) enable engagement in international programs. The activities of EMBL-ABR are focussed in six key areas, aligning with comparable international initiatives such as ELIXIR, CyVerse and NIH Commons. These key areas-Tools, Data, Standards, Platforms, Compute and Training-are described in this article.
Assuntos
Disciplinas das Ciências Biológicas , Pesquisa Biomédica , Biologia Computacional/educação , Biologia Computacional/métodos , Curadoria de Dados/métodos , Austrália , HumanosRESUMO
Cloud computing is a common platform for delivering software to end users. However, the process of making complex-to-deploy applications available across different cloud providers requires isolated and uncoordinated application-specific solutions, often locking-in developers to a particular cloud provider. Here, we present the CloudLaunch application as a uniform platform for discovering and deploying applications for different cloud providers. CloudLaunch allows arbitrary applications to be added to a catalog with each application having its own customizable user interface and control over the launch process, while preserving cloud-agnosticism so that authors can easily make their applications available on multiple clouds with minimal effort. It then provides a uniform interface for launching available applications by end users across different cloud providers. Architecture details are presented along with examples of different deployable applications that highlight architectural features.
RESUMO
BACKGROUND: Breast cancer risk for BRCA1 and BRCA2 pathogenic mutation carriers is modified by risk factors that cluster in families, including genetic modifiers of risk. We considered genetic modifiers of risk for carriers of high-risk mutations in other breast cancer susceptibility genes. METHODS: In a family known to carry the high-risk mutation PALB2:c.3113G>A (p.Trp1038*), whole-exome sequencing was performed on germline DNA from four affected women, three of whom were mutation carriers. RESULTS: RNASEL:p.Glu265* was identified in one of the PALB2 carriers who had two primary invasive breast cancer diagnoses before 50 years. Gene-panel testing of BRCA1, BRCA2, PALB2 and RNASEL in the Australian Breast Cancer Family Registry identified five carriers of RNASEL:p.Glu265* in 591 early onset breast cancer cases. Three of the five women (60%) carrying RNASEL:p.Glu265* also carried a pathogenic mutation in a breast cancer susceptibility gene compared with 30 carriers of pathogenic mutations in the 586 non-carriers of RNASEL:p.Glu265* (5%) (p < 0.002). Taqman genotyping demonstrated that the allele frequency of RNASEL:p.Glu265* was similar in affected and unaffected Australian women, consistent with other populations. CONCLUSION: Our study suggests that RNASEL:p.Glu265* may be a genetic modifier of risk for early-onset breast cancer predisposition in carriers of high-risk mutations. Much larger case-case and case-control studies are warranted to test the association observed in this report.
Assuntos
Neoplasias da Mama/genética , Endorribonucleases/genética , Predisposição Genética para Doença/genética , Adulto , Idade de Início , Austrália , Proteína BRCA1/genética , Proteína BRCA2/genética , Feminino , Heterozigoto , Humanos , Pessoa de Meia-Idade , Mutação , Linhagem , Adulto JovemRESUMO
BACKGROUND: Computational bioinformatics workflows are extensively used to analyse genomics data, with different approaches available to support implementation and execution of these workflows. Reproducibility is one of the core principles for any scientific workflow and remains a challenge, which is not fully addressed. This is due to incomplete understanding of reproducibility requirements and assumptions of workflow definition approaches. Provenance information should be tracked and used to capture all these requirements supporting reusability of existing workflows. RESULTS: We have implemented a complex but widely deployed bioinformatics workflow using three representative approaches to workflow definition and execution. Through implementation, we identified assumptions implicit in these approaches that ultimately produce insufficient documentation of workflow requirements resulting in failed execution of the workflow. This study proposes a set of recommendations that aims to mitigate these assumptions and guides the scientific community to accomplish reproducible science, hence addressing reproducibility crisis. CONCLUSIONS: Reproducing, adapting or even repeating a bioinformatics workflow in any environment requires substantial technical knowledge of the workflow execution environment, resolving analysis assumptions and rigorous compliance with reproducibility requirements. Towards these goals, we propose conclusive recommendations that along with an explicit declaration of workflow specification would result in enhanced reproducibility of computational genomic analyses.
Assuntos
Biologia Computacional/métodos , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Internet , Reprodutibilidade dos Testes , Análise de Sequência de DNA , Interface Usuário-ComputadorRESUMO
BACKGROUND: Global measures of peripheral blood DNA methylation have been associated with risk of some malignancies, including breast, bladder, and gastric cancer. Here, we examined genome-wide measures of peripheral blood DNA methylation in prostate cancer and its non-aggressive and aggressive disease forms. METHODS: We used a matched, case-control study of 687 incident prostate cancer samples, nested within a larger prospective cohort study. DNA methylation was measured in pre-diagnostic, peripheral blood samples using the Illumina Infinium HM450K BeadChip. Genome-wide measures of DNA methylation were computed as the median M-value of all CpG sites and according to CpG site location and regulatory function. We used conditional logistic regression to test for associations between genome-wide measures of DNA methylation and risk of prostate cancer and its subtypes, and by time between blood draw and diagnosis. RESULTS: We observed no associations between the genome-wide measure of DNA methylation based on all CpG sites and risk of prostate cancer or aggressive disease. Risk of non-aggressive disease was associated with higher methylation of CpG islands (OR = 0.80; 95%CI = 0.68-0.94), promoter regions (OR = 0.79; 95%CI = 0.66-0.93), and high density CpG regions (OR = 0.80; 95%CI = 0.68-0.94). Additionally, higher methylation of all CpGs (OR = 0.66; 95%CI = 0.48-0.89), CpG shores (OR = 0.62; 95%CI = 0.45-0.84), and regulatory regions (OR = 0.68; 95% CI = 0.51-0.91) was associated with a reduced risk of overall prostate cancer within 5 years of blood draw but not thereafter. CONCLUSIONS: A reduced risk of overall prostate cancer within 5 years of blood draw and non-aggressive prostate cancer was associated with higher genome-wide methylation of peripheral blood DNA. While these data have no immediate clinical utility, with further work they may provide insight into the early events of prostate carcinogenesis. Prostate 77:471-478, 2017. © 2017 Wiley Periodicals, Inc.
Assuntos
Ilhas de CpG/fisiologia , Metilação de DNA/fisiologia , Estudo de Associação Genômica Ampla/métodos , Neoplasias da Próstata/sangue , Neoplasias da Próstata/genética , Adulto , Idoso , Estudos de Casos e Controles , Estudos de Coortes , Humanos , Masculino , Pessoa de Meia-Idade , Estudos Prospectivos , Neoplasias da Próstata/diagnóstico , Fatores de RiscoRESUMO
BACKGROUND: Global DNA methylation has been reported to be associated with urothelial cell carcinoma (UCC) by studies using blood samples collected at diagnosis. Using the Illumina HumanMethylation450 assay, we derived genome-wide measures of blood DNA methylation and assessed them for their prospective association with UCC risk. METHODS: We used 439 case-control pairs from the Melbourne Collaborative Cohort Study matched on age, sex, country of birth, DNA sample type, and collection period. Conditional logistic regression was used to compute odds ratios (OR) of UCC risk per s.d. of each genome-wide measure of DNA methylation and 95% confidence intervals (CIs), adjusted for potential confounders. We also investigated associations by disease subtype, sex, smoking, and time since blood collection. RESULTS: The risk of superficial UCC was decreased for individuals with higher levels of our genome-wide DNA methylation measure (OR=0.71, 95% CI: 0.54-0.94; P=0.02). This association was particularly strong for current smokers at sample collection (OR=0.47, 95% CI: 0.27-0.83). Intermediate levels of our genome-wide measure were associated with decreased risk of invasive UCC. Some variation was observed between UCC subtypes and the location and regulatory function of the CpGs included in the genome-wide measures of methylation. CONCLUSIONS: Higher levels of our genome-wide DNA methylation measure were associated with decreased risk of superficial UCC and intermediate levels were associated with reduced risk of invasive disease. These findings require replication by other prospective studies.
Assuntos
Carcinoma de Células de Transição/genética , Metilação de DNA , DNA/sangue , Neoplasias Urológicas/genética , Adulto , Idoso , Coleta de Amostras Sanguíneas , Carcinoma de Células de Transição/sangue , Carcinoma de Células de Transição/epidemiologia , Carcinoma de Células de Transição/patologia , Estudos de Casos e Controles , Ilhas de CpG , Dieta , Feminino , Seguimentos , Estudo de Associação Genômica Ampla , Humanos , Incidência , Masculino , Pessoa de Meia-Idade , Invasividade Neoplásica , Estudos Prospectivos , Risco , Fatores de Risco , Fumar/epidemiologia , Fatores de Tempo , Neoplasias Urológicas/sangue , Neoplasias Urológicas/epidemiologia , Neoplasias Urológicas/patologia , Vitória/epidemiologiaRESUMO
Aberrant DNA methylation is a key feature of breast carcinoma. We aimed to test the association between breast cancer risk and epigenome-wide methylation in DNA from peripheral blood. Nested case-control study within the prospective Melbourne Collaborative Cohort Study. DNA was extracted from before-diagnosis blood samples (420 incident cases and matched controls). Methylation was measured with the Illumina Infinium Human Methylation 450 BeadChip array. Odds ratio (OR) for epigenome-wide methylation, quantified as the mean beta values across the CpGs, in relation to breast cancer risk were estimated using conditional logistic regression. Overall, the OR for breast cancer was 0.42 (95% CI 0.20-0.90) for the top versus bottom quartile of epigenome-wide DNA methylation and the OR for a one standard deviation increment was 0.69 (95% CI 0.50-0.95; test for linear trend, p = 0.02). Epigenome-wide DNA methylation of CpGs within functional promoters was associated with an increased risk, whereas epigenome-wide DNA methylation of genomic regions outside promoters was associated with decreased risk (test for heterogeneity, p = 0.0002). The increased risk associated with epigenome-wide DNA methylation in functional promoters did not vary by time between blood collection and diagnosis, whereas the inverse association with epigenome-wide DNA methylation outside functional promoters was strongest when the interval from blood collection to diagnosis was less than 5 years and weakest for the longest interval. Epigenome-wide methylation in DNA extracted from peripheral blood collected before diagnosis may have potential utility as markers of breast cancer risk and for early detection.
Assuntos
Biomarcadores Tumorais/sangue , Neoplasias da Mama/sangue , Neoplasias da Mama/genética , Metilação de DNA/genética , Adulto , Idoso , Neoplasias da Mama/patologia , Estudos de Casos e Controles , Ilhas de CpG/genética , Epigenômica , Feminino , Estudo de Associação Genômica Ampla , Humanos , Pessoa de Meia-Idade , Estadiamento de Neoplasias , Regiões Promotoras GenéticasRESUMO
BACKGROUND: Characterising genetic diversity through the analysis of massively parallel sequencing (MPS) data offers enormous potential to significantly improve our understanding of the genetic basis for observed phenotypes, including predisposition to and progression of complex human disease. Great challenges remain in resolving genetic variants that are genuine from the millions of artefactual signals. RESULTS: FAVR is a suite of new methods designed to work with commonly used MPS analysis pipelines to assist in the resolution of some of the issues related to the analysis of the vast amount of resulting data, with a focus on relatively rare genetic variants. To the best of our knowledge, no equivalent method has previously been described. The most important and novel aspect of FAVR is the use of signatures in comparator sequence alignment files during variant filtering, and annotation of variants potentially shared between individuals. The FAVR methods use these signatures to facilitate filtering of (i) platform and/or mapping-specific artefacts, (ii) common genetic variants, and, where relevant, (iii) artefacts derived from imbalanced paired-end sequencing, as well as annotation of genetic variants based on evidence of co-occurrence in individuals. We applied conventional variant calling applied to whole-exome sequencing datasets, produced using both SOLiD and TruSeq chemistries, with or without downstream processing by FAVR methods. We demonstrate a 3-fold smaller rare single nucleotide variant shortlist with no detected reduction in sensitivity. This analysis included Sanger sequencing of rare variant signals not evident in dbSNP131, assessment of known variant signal preservation, and comparison of observed and expected rare variant numbers across a range of first cousin pairs. The principles described herein were applied in our recent publication identifying XRCC2 as a new breast cancer risk gene and have been made publically available as a suite of software tools. CONCLUSIONS: FAVR is a platform-agnostic suite of methods that significantly enhances the analysis of large volumes of sequencing data for the study of rare genetic variants and their influence on phenotypes.
Assuntos
Variação Genética , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Software , Neoplasias da Mama/genética , Exoma , Feminino , Humanos , Anotação de Sequência Molecular , Fenótipo , Alinhamento de SequênciaRESUMO
The COVID-19 pandemic is the first global health crisis to occur in the age of big genomic data.Although data generation capacity is well established and sufficiently standardized, analytical capacity is not. To establish analytical capacity it is necessary to pull together global computational resources and deliver the best open source tools and analysis workflows within a ready to use, universally accessible resource. Such a resource should not be controlled by a single research group, institution, or country. Instead it should be maintained by a community of users and developers who ensure that the system remains operational and populated with current tools. A community is also essential for facilitating the types of discourse needed to establish best analytical practices. Bringing together public computational research infrastructure from the USA, Europe, and Australia, we developed a distributed data analysis platform that accomplishes these goals. It is immediately accessible to anyone in the world and is designed for the analysis of rapidly growing collections of deep sequencing datasets. We demonstrate its utility by detecting allelic variants in high-quality existing SARS-CoV-2 sequencing datasets and by continuous reanalysis of COG-UK data. All workflows, data, and documentation is available at https://covid19.galaxyproject.org .
RESUMO
BACKGROUND: The automation of data analysis in the form of scientific workflows has become a widely adopted practice in many fields of research. Computationally driven data-intensive experiments using workflows enable automation, scaling, adaptation, and provenance support. However, there are still several challenges associated with the effective sharing, publication, and reproducibility of such workflows due to the incomplete capture of provenance and lack of interoperability between different technical (software) platforms. RESULTS: Based on best-practice recommendations identified from the literature on workflow design, sharing, and publishing, we define a hierarchical provenance framework to achieve uniformity in provenance and support comprehensive and fully re-executable workflows equipped with domain-specific information. To realize this framework, we present CWLProv, a standard-based format to represent any workflow-based computational analysis to produce workflow output artefacts that satisfy the various levels of provenance. We use open source community-driven standards, interoperable workflow definitions in Common Workflow Language (CWL), structured provenance representation using the W3C PROV model, and resource aggregation and sharing as workflow-centric research objects generated along with the final outputs of a given workflow enactment. We demonstrate the utility of this approach through a practical implementation of CWLProv and evaluation using real-life genomic workflows developed by independent groups. CONCLUSIONS: The underlying principles of the standards utilized by CWLProv enable semantically rich and executable research objects that capture computational workflows with retrospective provenance such that any platform supporting CWL will be able to understand the analysis, reuse the methods for partial reruns, or reproduce the analysis to validate the published findings.
Assuntos
Genômica , Modelos Teóricos , Fluxo de Trabalho , Humanos , SoftwareRESUMO
We have previously reported Hi-Plex, a multiplex PCR methodology for building targeted DNA sequencing libraries that offers a low-cost protocol compatible with high-throughput processing. Here, we detail an improved protocol, Hi-Plex2, that more effectively enables the robust construction of small-to-medium panel-size libraries while maintaining low cost, simplicity and accuracy benefits of the Hi-Plex platform. Hi-Plex2 was applied to three panels, comprising 291, 740 and 1193 amplicons, targeting genes associated with risk for breast and/or colon cancer. We show substantial reduction of off-target amplification to enable library construction for small-to-medium-sized design panels not possible using the previous Hi-Plex chemistry.
Assuntos
Reação em Cadeia da Polimerase Multiplex/métodos , Análise de Sequência de DNA/métodos , Primers do DNA/genética , Eletroforese em Gel de Ágar/métodos , Biblioteca Gênica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , HumanosRESUMO
With clouds becoming a standard target for deploying applications, it is more important than ever to be able to seamlessly utilise resources and services from multiple providers. Proprietary vendor APIs make this challenging and lead to conditional code being written to accommodate various API differences, requiring application authors to deal with these complexities and to test their applications against each supported cloud. In this paper, we describe an open source Python library called CloudBridge that provides a simple, uniform, and extensible API for multiple clouds. The library defines a standard 'contract' that all supported providers must implement, and an extensive suite of conformance tests to ensure that any exposed behavior is uniform across cloud providers, thus allowing applications to confidently utilise any of the supported clouds without any cloud-specific code or testing.
RESUMO
BACKGROUND: Ion channels are well characterised in model organisms, principally because of the availability of functional genomic tools and datasets for these species. This contrasts the situation, for example, for parasites of humans and animals, whose genomic and biological uniqueness means that many genes and their products cannot be annotated. As ion channels are recognised as important drug targets in mammals, the accurate identification and classification of parasite channels could provide major prospects for defining unique targets for designing novel and specific anti-parasite therapies. Here, we established a reliable bioinformatic pipeline for the identification and classification of ion channels encoded in the genome of the cancer-causing liver fluke Opisthorchis viverrini, and extended its application to related flatworms affecting humans. METHODS: We built an ion channel identification + classification pipeline (called MuSICC), employing an optimised support vector machine (SVM) model and using the Kyoto Encyclopaedia of Genes and Genomes (KEGG) classification system. Ion channel proteins were first identified and grouped according to amino acid sequence similarity to classified ion channels and the presence and number of ion channel-like conserved and transmembrane domains. Predicted ion channels were then classified to sub-family using a SVM model, trained using ion channel features. RESULTS: Following an evaluation of this pipeline (MuSICC), which demonstrated a classification sensitivity of 95.2 % and accuracy of 70.5 % for known ion channels, we applied it to effectively identify and classify ion channels in selected parasitic flatworms. CONCLUSIONS: MuSICC provides a practical and effective tool for the identification and classification of ion channels of parasitic flatworms, and should be applicable to a broad range of organisms that are evolutionarily distant from taxa whose ion channels are functionally characterised.
Assuntos
Biologia Computacional/métodos , Canais Iônicos/classificação , Canais Iônicos/genética , Parasitologia/métodos , Platelmintos/enzimologia , Platelmintos/genética , AnimaisRESUMO
Small non-coding RNAs have been significantly recognized as the key modulators in many biological processes, and are emerging as promising biomarkers for several diseases. These RNA species are transcribed in cells and can be packaged in extracellular vesicles, which are small vesicles released from many biotypes, and are involved in intercellular communication. Currently, the advent of next-generation sequencing (NGS) technology for high-throughput profiling has further advanced the biological insights of non-coding RNA on a genome-wide scale and has become the preferred approach for the discovery and quantification of non-coding RNA species. Despite the routine practice of NGS, the processing of large data sets poses difficulty for analysis before conducting downstream experiments. Often, the current analysis tools are designed for specific RNA species, such as microRNA, and are limited in flexibility for modifying parameters for optimization. An analysis tool that allows for maximum control of different software is essential for drawing concrete conclusions for differentially expressed transcripts. Here, we developed a one-touch integrated small RNA analysis pipeline (iSRAP) research tool that is composed of widely used tools for rapid profiling of small RNAs. The performance test of iSRAP using publicly and in-house available data sets shows its ability of comprehensive profiling of small RNAs of various classes, and analysis of differentially expressed small RNAs. iSRAP offers comprehensive analysis of small RNA sequencing data that leverage informed decisions on the downstream analyses of small RNA studies, including extracellular vesicles such as exosomes.
RESUMO
BACKGROUND: Analyzing high throughput genomics data is a complex and compute intensive task, generally requiring numerous software tools and large reference data sets, tied together in successive stages of data transformation and visualisation. A computational platform enabling best practice genomics analysis ideally meets a number of requirements, including: a wide range of analysis and visualisation tools, closely linked to large user and reference data sets; workflow platform(s) enabling accessible, reproducible, portable analyses, through a flexible set of interfaces; highly available, scalable computational resources; and flexibility and versatility in the use of these resources to meet demands and expertise of a variety of users. Access to an appropriate computational platform can be a significant barrier to researchers, as establishing such a platform requires a large upfront investment in hardware, experience, and expertise. RESULTS: We designed and implemented the Genomics Virtual Laboratory (GVL) as a middleware layer of machine images, cloud management tools, and online services that enable researchers to build arbitrarily sized compute clusters on demand, pre-populated with fully configured bioinformatics tools, reference datasets and workflow and visualisation options. The platform is flexible in that users can conduct analyses through web-based (Galaxy, RStudio, IPython Notebook) or command-line interfaces, and add/remove compute nodes and data resources as required. Best-practice tutorials and protocols provide a path from introductory training to practice. The GVL is available on the OpenStack-based Australian Research Cloud (http://nectar.org.au) and the Amazon Web Services cloud. The principles, implementation and build process are designed to be cloud-agnostic. CONCLUSIONS: This paper provides a blueprint for the design and implementation of a cloud-based Genomics Virtual Laboratory. We discuss scope, design considerations and technical and logistical constraints, and explore the value added to the research community through the suite of services and resources provided by our implementation.
Assuntos
Computação em Nuvem , Biologia Computacional/métodos , Genômica/métodos , Interface Usuário-Computador , Animais , Bases de Dados Genéticas , Humanos , SoftwareRESUMO
The benefits of implementing high throughput sequencing in the clinic are quickly becoming apparent. However, few freely available bioinformatics pipelines have been built from the ground up with clinical genomics in mind. Here we present Cpipe, a pipeline designed specifically for clinical genetic disease diagnostics. Cpipe was developed by the Melbourne Genomics Health Alliance, an Australian initiative to promote common approaches to genomics across healthcare institutions. As such, Cpipe has been designed to provide fast, effective and reproducible analysis, while also being highly flexible and customisable to meet the individual needs of diverse clinical settings. Cpipe is being shared with the clinical sequencing community as an open source project and is available at http://cpipeline.org.
RESUMO
Tumour heterogeneity in primary prostate cancer is a well-established phenomenon. However, how the subclonal diversity of tumours changes during metastasis and progression to lethality is poorly understood. Here we reveal the precise direction of metastatic spread across four lethal prostate cancer patients using whole-genome and ultra-deep targeted sequencing of longitudinally collected primary and metastatic tumours. We find one case of metastatic spread to the surgical bed causing local recurrence, and another case of cross-metastatic site seeding combining with dynamic remoulding of subclonal mixtures in response to therapy. By ultra-deep sequencing end-stage blood, we detect both metastatic and primary tumour clones, even years after removal of the prostate. Analysis of mutations associated with metastasis reveals an enrichment of TP53 mutations, and additional sequencing of metastases from 19 patients demonstrates that acquisition of TP53 mutations is linked with the expansion of subclones with metastatic potential which we can detect in the blood.