RESUMO
BACKGROUND: Blood-based methods using cell-free DNA (cfDNA) are under development as an alternative to existing screening tests. However, early-stage detection of cancer using tumor-derived cfDNA has proven challenging because of the small proportion of cfDNA derived from tumor tissue in early-stage disease. A machine learning approach to discover signatures in cfDNA, potentially reflective of both tumor and non-tumor contributions, may represent a promising direction for the early detection of cancer. METHODS: Whole-genome sequencing was performed on cfDNA extracted from plasma samples (N = 546 colorectal cancer and 271 non-cancer controls). Reads aligning to protein-coding gene bodies were extracted, and read counts were normalized. cfDNA tumor fraction was estimated using IchorCNA. Machine learning models were trained using k-fold cross-validation and confounder-based cross-validations to assess generalization performance. RESULTS: In a colorectal cancer cohort heavily weighted towards early-stage cancer (80% stage I/II), we achieved a mean AUC of 0.92 (95% CI 0.91-0.93) with a mean sensitivity of 85% (95% CI 83-86%) at 85% specificity. Sensitivity generally increased with tumor stage and increasing tumor fraction. Stratification by age, sequencing batch, and institution demonstrated the impact of these confounders and provided a more accurate assessment of generalization performance. CONCLUSIONS: A machine learning approach using cfDNA achieved high sensitivity and specificity in a large, predominantly early-stage, colorectal cancer cohort. The possibility of systematic technical and institution-specific biases warrants similar confounder analyses in other studies. Prospective validation of this machine learning method and evaluation of a multi-analyte approach are underway.
Assuntos
Biomarcadores Tumorais , DNA Tumoral Circulante , Neoplasias Colorretais/genética , Neoplasias Colorretais/patologia , Genoma Humano , Genômica , Aprendizado de Máquina , Idoso , Idoso de 80 Anos ou mais , Neoplasias Colorretais/sangue , Biologia Computacional/métodos , Feminino , Perfilação da Expressão Gênica , Genômica/métodos , Humanos , Masculino , Pessoa de Meia-Idade , Estadiamento de Neoplasias , Curva ROC , Reprodutibilidade dos Testes , TranscriptomaRESUMO
SUMMARY: Efficient workflows to shepherd clinically generated genomic data through the multiple stages of a next-generation sequencing pipeline are of critical importance in translational biomedical science. Here we present COSMOS, a Python library for workflow management that allows formal description of pipelines and partitioning of jobs. In addition, it includes a user interface for tracking the progress of jobs, abstraction of the queuing system and fine-grained control over the workflow. Workflows can be created on traditional computing clusters as well as cloud-based services. AVAILABILITY AND IMPLEMENTATION: Source code is available for academic non-commercial research purposes. Links to code and documentation are provided at http://lpm.hms.harvard.edu and http://wall-lab.stanford.edu. CONTACT: dpwall@stanford.edu or peter_tonellato@hms.harvard.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Linguagens de ProgramaçãoRESUMO
In this overview to biomedical computing in the cloud, we discussed two primary ways to use the cloud (a single instance or cluster), provided a detailed example using NGS mapping, and highlighted the associated costs. While many users new to the cloud may assume that entry is as straightforward as uploading an application and selecting an instance type and storage options, we illustrated that there is substantial up-front effort required before an application can make full use of the cloud's vast resources. Our intention was to provide a set of best practices and to illustrate how those apply to a typical application pipeline for biomedical informatics, but also general enough for extrapolation to other types of computational problems. Our mapping example was intended to illustrate how to develop a scalable project and not to compare and contrast alignment algorithms for read mapping and genome assembly. Indeed, with a newer aligner such as Bowtie, it is possible to map the entire African genome using one m2.2xlarge instance in 48 hours for a total cost of approximately $48 in computation time. In our example, we were not concerned with data transfer rates, which are heavily influenced by the amount of available bandwidth, connection latency, and network availability. When transferring large amounts of data to the cloud, bandwidth limitations can be a major bottleneck, and in some cases it is more efficient to simply mail a storage device containing the data to AWS (http://aws.amazon.com/importexport/). More information about cloud computing, detailed cost analysis, and security can be found in references.
Assuntos
Armazenamento e Recuperação da Informação/métodos , Internet , Software , Biologia Computacional , Segurança Computacional , Armazenamento e Recuperação da Informação/economiaRESUMO
A cell-free DNA (cfDNA) assay would be a promising approach to early cancer diagnosis, especially for patients with dense tissues. Consistent cfDNA signatures have been observed for many carcinogens. Recently, investigations of cfDNA as a reliable early detection bioassay have presented a powerful opportunity for detecting dense tissue screening complications early. We performed a prospective study to evaluate the potential of characterizing cfDNA as a central element in the early detection of dense tissue breast cancer (BC). Plasma samples were collected from 32 consenting subjects with dense tissue and positive mammograms, 20 with positive biopsies and 12 with negative biopsies. After screening and before biopsy, cfDNA was extracted, and whole-genome next-generation sequencing (NGS) was performed on all samples. Copy number alteration (CNA) and single nucleotide polymorphism (SNP)/insertion/deletion (Indel) analyses were performed to characterize cfDNA. In the positive-positive subjects (cases), a total of 5 CNAs overlapped with 5 previously reported BC-related oncogenes (KSR2, MAP2K4, MSI2, CANT1 and MSI2). In addition, 1 SNP was detected in KMT2C, a BC oncogene, and 9 others were detected in or near 10 genes (SERAC1, DAGLB, MACF1, NVL, FBXW4, FANK1, KCTD4, CAVIN1; ATP6V0A1 and ZBTB20-AS1) previously associated with non-BC cancers. For the positive-negative subjects (screening), 3 CNAs were detected in BC genes (ACVR2A, CUL3 and PIK3R1), and 5 SNPs were identified in 6 non-BC cancer genes (SNIP1, TBC1D10B, PANK1, PRKCA and RUNX2; SUPT3H). This study presents evidence of the potential of using cfDNA somatic variants as dense tissue BC biomarkers from a noninvasive liquid bioassay for early cancer detection.
Assuntos
Neoplasias da Mama , Ácidos Nucleicos Livres , Proteínas Adaptadoras de Transdução de Sinal/genética , Bioensaio , Biomarcadores Tumorais/genética , Neoplasias da Mama/diagnóstico , Neoplasias da Mama/genética , Ácidos Nucleicos Livres/genética , Detecção Precoce de Câncer , Feminino , Humanos , Mutação , Estudos Prospectivos , Proteínas de Ligação a RNA/genéticaRESUMO
BACKGROUND: Some clinically important genetic variants are not easily evaluated with next-generation sequencing (NGS) methods due to technical challenges arising from high- similarity copies (e.g., PMS2, SMN1/SMN2, GBA1, HBA1/HBA2, CYP21A2), repetitive short sequences (e.g., ARX polyalanine repeats, FMR1 AGG interruptions in CGG repeats, CFTR poly-T/TG repeats), and other complexities (e.g., MSH2 Boland inversions). METHODS: We customized our NGS processes to detect the technically challenging variants mentioned above with adaptations including target enrichment and bioinformatic masking of similar sequences. Adaptations were validated with samples of known genotypes. RESULTS: Our adaptations provided high-sensitivity and high-specificity detection for most of the variants and provided a high-sensitivity primary assay to be followed with orthogonal disambiguation for the others. The sensitivity of the NGS adaptations was 100% for all of the technically challenging variants. Specificity was 100% for those in PMS2, GBA1, SMN1/SMN2, and HBA1/HBA2, and for the MSH2 Boland inversion; 97.8%-100% for CYP21A2 variants; and 85.7% for ARX polyalanine repeats. CONCLUSIONS: NGS assays can detect technically challenging variants when chemistries and bioinformatics are jointly refined. The adaptations described support a scalable, cost-effective path to identifying all clinically relevant variants within a single sample.