ABSTRACT
Recent advances in the field of immuno-oncology have brought transformative changes in the management of cancer patients. The immune profile of tumours has been found to have key value in predicting disease prognosis and treatment response in various cancers. Multiplex immunohistochemistry and immunofluorescence have emerged as potent tools for the simultaneous detection of multiple protein biomarkers in a single tissue section, thereby expanding opportunities for molecular and immune profiling while preserving tissue samples. By establishing the phenotype of individual tumour cells when distributed within a mixed cell population, the identification of clinically relevant biomarkers with high-throughput multiplex immunophenotyping of tumour samples has great potential to guide appropriate treatment choices. Moreover, the emergence of novel multi-marker imaging approaches can now provide unprecedented insights into the tumour microenvironment, including the potential interplay between various cell types. However, there are significant challenges to widespread integration of these technologies in daily research and clinical practice. This review addresses the challenges and potential solutions within a structured framework of action from a regulatory and clinical trial perspective. New developments within the field of immunophenotyping using multiplexed tissue imaging platforms and associated digital pathology are also described, with a specific focus on translational implications across different subtypes of cancer. Ā© 2024 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.
Subject(s)
Breast Neoplasms , Humans , Female , Biomarkers, Tumor/genetics , Prognosis , Phenotype , United Kingdom , Tumor MicroenvironmentABSTRACT
Modern histologic imaging platforms coupled with machine learning methods have provided new opportunities to map the spatial distribution of immune cells in the tumor microenvironment. However, there exists no standardized method for describing or analyzing spatial immune cell data, and most reported spatial analyses are rudimentary. In this review, we provide an overview of two approaches for reporting and analyzing spatial data (raster versus vector-based). We then provide a compendium of spatial immune cell metrics that have been reported in the literature, summarizing prognostic associations in the context of a variety of cancers. We conclude by discussing two well-described clinical biomarkers, the breast cancer stromal tumor infiltrating lymphocytes score and the colon cancer Immunoscore, and describe investigative opportunities to improve clinical utility of these spatial biomarkers. Ā© 2023 The Pathological Society of Great Britain and Ireland.
Subject(s)
Colonic Neoplasms , Humans , Biomarkers , Benchmarking , Lymphocytes, Tumor-Infiltrating , Spatial Analysis , Tumor MicroenvironmentABSTRACT
The clinical significance of the tumor-immune interaction in breast cancer is now established, and tumor-infiltrating lymphocytes (TILs) have emerged as predictive and prognostic biomarkers for patients with triple-negative (estrogen receptor, progesterone receptor, and HER2-negative) breast cancer and HER2-positive breast cancer. How computational assessments of TILs might complement manual TIL assessment in trial and daily practices is currently debated. Recent efforts to use machine learning (ML) to automatically evaluate TILs have shown promising results. We review state-of-the-art approaches and identify pitfalls and challenges of automated TIL evaluation by studying the root cause of ML discordances in comparison to manual TIL quantification. We categorize our findings into four main topics: (1) technical slide issues, (2) ML and image analysis aspects, (3) data challenges, and (4) validation issues. The main reason for discordant assessments is the inclusion of false-positive areas or cells identified by performance on certain tissue patterns or design choices in the computational implementation. To aid the adoption of ML for TIL assessment, we provide an in-depth discussion of ML and image analysis, including validation issues that need to be considered before reliable computational reporting of TILs can be incorporated into the trial and routine clinical management of patients with triple-negative breast cancer. Ā© 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.
Subject(s)
Mammary Neoplasms, Animal , Triple Negative Breast Neoplasms , Humans , Animals , Lymphocytes, Tumor-Infiltrating , Biomarkers , Machine LearningABSTRACT
Data sharing is essential for reproducibility of epidemiologic research, replication of findings, pooled analyses in consortia efforts, and maximizing study value to address multiple research questions. However, barriers related to confidentiality, costs, and incentives often limit the extent and speed of data sharing. Epidemiological practices that follow Findable, Accessible, Interoperable, Reusable (FAIR) principles can address these barriers by making data resources findable with the necessary metadata, accessible to authorized users, and interoperable with other data, to optimize the reuse of resources with appropriate credit to its creators. We provide an overview of these principles and describe approaches for implementation in epidemiology. Increasing degrees of FAIRness can be achieved by moving data and code from on-site locations to remote, accessible ("Cloud") data servers, using machine-readable and nonproprietary files, and developing open-source code. Adoption of these practices will improve daily work and collaborative analyses and facilitate compliance with data sharing policies from funders and scientific journals. Achieving a high degree of FAIRness will require funding, training, organizational support, recognition, and incentives for sharing research resources, both data and code. However, these costs are outweighed by the benefits of making research more reproducible, impactful, and equitable by facilitating the reuse of precious research resources by the scientific community.
Subject(s)
Confidentiality , Information Dissemination , Humans , Reproducibility of Results , Software , Epidemiologic StudiesABSTRACT
MOTIVATION: The Division of Cancer Epidemiology and Genetics (DCEG) and the Division of Cancer Prevention (DCP) at the National Cancer Institute (NCI) have recently generated genome-wide association study (GWAS) data for multiple traits in the Prostate, Lung, Colorectal, and Ovarian (PLCO) Genomic Atlas project. The GWAS included 110Ā 000 participants. The dissemination of the genetic association data through a data portal called GWAS Explorer, in a manner that addresses the modern expectations of FAIR reusability by data scientists and engineers, is the main motivation for the development of the open-source JavaScript software development kit (SDK) reported here. RESULTS: The PLCO GWAS Explorer resource relies on a public stateless HTTP application programming interface (API) deployed as the sole backend service for both the landing page's web application and third-party analytical workflows. The core PLCOjs SDK is mapped to each of the API methods, and also to each of the reference graphic visualizations in the GWAS Explorer. A few additional visualization methods extend it. As is the norm with web SDKs, no download or installation is needed and modularization supports targeted code injection for web applications, reactive notebooks (Observable) and node-based web services. AVAILABILITY AND IMPLEMENTATION: code at https://github.com/episphere/plco; project page at https://episphere.github.io/plco.
Subject(s)
Colorectal Neoplasms , Ovarian Neoplasms , United States , Male , Humans , Female , Genome-Wide Association Study , National Cancer Institute (U.S.) , Prostate , Software , Ovarian Neoplasms/genetics , LungABSTRACT
BACKGROUND: Online questionnaires are commonly used to collect information from participants in epidemiological studies. This requires building questionnaires using machine-readable formats that can be delivered to study participants using web-based technologies such as progressive web applications. However, the paucity of open-source markup standards with support for complex logic make collaborative development of web-based questionnaire modules difficult. This often prevents interoperability and reusability of questionnaire modules across epidemiological studies. RESULTS: We developed an open-source markup language for presentation of questionnaire content and logic, Quest, within a real-time renderer that enables the user to test logic (e.g., skip patterns) and view the structure of data collection. We provide the Quest markup language, an in-browser markup rendering tool, questionnaire development tool and an example web application that embeds the renderer, developed for The Connect for Cancer Prevention Study. CONCLUSION: A markup language can specify both the content and logic of a questionnaire as plain text. Questionnaire markup, such as Quest, can become a standard format for storing questionnaires or sharing questionnaires across the web. Quest is a step towards generation of FAIR data in epidemiological studies by facilitating reusability of questionnaires and data interoperability using open-source tools.
Subject(s)
Software , Humans , Surveys and Questionnaires , Epidemiologic StudiesABSTRACT
Cancer heterogeneities hold the key to a deeper understanding of cancer etiology and progression and the discovery of more precise cancer therapy. Modern pathological and molecular technologies offer a powerful set of tools to profile tumor heterogeneities at multiple levels in large patient populations, from DNA to RNA, protein and epigenetics, and from tumor tissues to tumor microenvironment and liquid biopsy. When coupled with well-validated epidemiologic methodology and well-characterized epidemiologic resources, the rich tumor pathological and molecular tumor information provide new research opportunities at an unprecedented breadth and depth. This is the research space where Molecular Pathological Epidemiology (MPE) emerged over a decade ago and has been thriving since then. As a truly multidisciplinary field, MPE embraces collaborations from diverse fields including epidemiology, pathology, immunology, genetics, biostatistics, bioinformatics, and data science. Since first convened in 2013, the International MPE Meeting series has grown into a dynamic and dedicated platform for experts from these disciplines to communicate novel findings, discuss new research opportunities and challenges, build professional networks, and educate the next-generation scientists. Herein, we share the proceedings of the Fifth International MPE meeting, held virtually online, on May 24 and 25, 2021. The meeting consisted of 21 presentations organized into the three main themes, which were recent integrative MPE studies, novel cancer profiling technologies, and new statistical and data science approaches. Looking forward to the near future, the meeting attendees anticipated continuous expansion and fruition of MPE research in many research fronts, particularly immune-epidemiology, mutational signatures, liquid biopsy, and health disparities.
Subject(s)
Neoplasms , Pathology, Molecular , Humans , Mutation , Neoplasms/epidemiology , Neoplasms/genetics , Neoplasms/therapy , Pathology, Molecular/methods , Tumor MicroenvironmentABSTRACT
MOTIVATION: Mortality Tracker is an in-browser application for data wrangling, analysis, dissemination and visualization of public time series of mortality in the United States. It was developed in response to requests by epidemiologists for portable real time assessment of the effect of COVID-19 on other causes of death and all-cause mortality. This is performed by comparing 2020 real time values with observations from the same week in the previous 5 years, and by enabling the extraction of temporal snapshots of mortality series that facilitate modeling the interdependence between its causes. RESULTS: Our solution employs a scalable 'Data Commons at Web Scale' approach that abstracts all stages of the data cycle as in-browser components. Specifically, the data wrangling computation, not just the orchestration of data retrieval, takes place in the browser, without any requirement to download or install software. This approach, where operations that would normally be computed server-side are mapped to in-browser SDKs, is sometimes loosely described as Web APIs, a designation adopted here. AVAILABILITYAND IMPLEMENTATION: https://episphere.github.io/mortalitytracker; webcast demo: youtu.be/ZsvCe7cZzLo. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Subject(s)
COVID-19 , Computers , Humans , Information Storage and Retrieval , SARS-CoV-2 , SoftwareABSTRACT
BACKGROUND: Excess death estimates quantify the full impact of the coronavirus disease 2019 (COVID-19) pandemic. Widely reported U.S. excess death estimates have not accounted for recent population changes, especially increases in the population older than 65 years. OBJECTIVE: To estimate excess deaths in the United States in 2020, after accounting for population changes. DESIGN: Surveillance study. SETTING: United States, March to August 2020. PARTICIPANTS: All decedents. MEASUREMENTS: Age-specific excess deaths in the United States from 1 March to 31 August 2020 compared with 2015 to 2019 were estimated, after changes in population size and age were taken into account, by using Centers for Disease Control and Prevention provisional death data and U.S. Census Bureau population estimates. Cause-specific excess deaths were estimated by month and age. RESULTS: From March through August 2020, 1 671 400 deaths were registered in the United States, including 173 300 COVID-19 deaths. An average of 1 370 000 deaths were reported over the same months during 2015 to 2019, for a crude excess of 301 400 deaths (128 100 non-COVID-19 deaths). However, the 2020 U.S. population includes 5.04 million more persons aged 65 years and older than the average population in 2015 to 2019 (a 10% increase). After population changes were taken into account, an estimated 217 900 excess deaths occurred from March through August 2020 (173 300 COVID-19 and 44 600 non-COVID-19 deaths). Most excess non-COVID-19 deaths occurred in April, July, and August, and 34 900 (78%) were in persons aged 25 to 64 years. Diabetes, Alzheimer disease, and heart disease caused the most non-COVID-19 excess deaths. LIMITATION: Provisional death data are underestimated because of reporting delays. CONCLUSION: The COVID-19 pandemic resulted in an estimated 218 000 excess deaths in the United States between March and August 2020, and 80% of those deaths had COVID-19 as the underlying cause. Accounting for population changes substantially reduced the excess non-COVID-19 death estimates, providing important information for guiding future clinical and public health interventions. PRIMARY FUNDING SOURCE: National Cancer Institute.
Subject(s)
Aging , COVID-19/mortality , Mortality/trends , Pneumonia, Viral/mortality , Population Growth , Adult , Aged , Aged, 80 and over , Female , Humans , Male , Middle Aged , Pandemics , Pneumonia, Viral/virology , Population Surveillance , Risk Factors , SARS-CoV-2 , United States/epidemiologyABSTRACT
BACKGROUND: Although racial/ethnic disparities in U.S. COVID-19 death rates are striking, focusing on COVID-19 deaths alone may underestimate the true effect of the pandemic on disparities. Excess death estimates capture deaths both directly and indirectly caused by COVID-19. OBJECTIVE: To estimate U.S. excess deaths by racial/ethnic group. DESIGN: Surveillance study. SETTING: United States. PARTICIPANTS: All decedents. MEASUREMENTS: Excess deaths and excess deaths per 100Ā 000 persons from March to December 2020 were estimated by race/ethnicity, sex, age group, and cause of death, using provisional death certificate data from the Centers for Disease Control and Prevention (CDC) and U.S. Census Bureau population estimates. RESULTS: An estimated 2.88 million deaths occurred between March and December 2020. Compared with the number of expected deaths based on 2019 data, 477Ā 200 excess deaths occurred during this period, with 74% attributed to COVID-19. Age-standardized excess deaths per 100Ā 000 persons among Black, American Indian/Alaska Native (AI/AN), and Latino males and females were more than double those in White and Asian males and females. Non-COVID-19 excess deaths also disproportionately affected Black, AI/AN, and Latino persons. Compared with White males and females, non-COVID-19 excess deaths per 100Ā 000 persons were 2 to 4 times higher in Black, AI/AN, and Latino males and females, including deaths due to diabetes, heart disease, cerebrovascular disease, and Alzheimer disease. Excess deaths in 2020 resulted in substantial widening of racial/ethnic disparities in all-cause mortality from 2019 to 2020. LIMITATIONS: Completeness and availability of provisional CDC data; no estimates of precision around results. CONCLUSION: There were profound racial/ethnic disparities in excess deaths in the United States in 2020 during the COVID-19 pandemic, resulting in rapid increases in racial/ethnic disparities in all-cause mortality between 2019 and 2020. PRIMARY FUNDING SOURCE: National Institutes of Health Intramural Research Program.
Subject(s)
COVID-19/ethnology , COVID-19/mortality , Ethnic and Racial Minorities/statistics & numerical data , Health Status Disparities , Pandemics , Adolescent , Adult , Age Distribution , Aged , Aged, 80 and over , Cause of Death , Child , Child, Preschool , Female , Humans , Infant , Infant, Newborn , Male , Middle Aged , Population Surveillance , SARS-CoV-2 , Sex Distribution , United States/epidemiology , Young AdultABSTRACT
Quantitative assessment of spatial relations between tumor and tumor-infiltrating lymphocytes (TIL) is increasingly important in both basic science and clinical aspects of breast cancer research. We have developed and evaluated convolutional neural network analysis pipelines to generate combined maps of cancer regions and TILs in routine diagnostic breast cancer whole slide tissue images. The combined maps provide insight about the structural patterns and spatial distribution of lymphocytic infiltrates and facilitate improved quantification of TILs. Both tumor and TIL analyses were evaluated by using three convolutional neural network networks (34-layer ResNet, 16-layer VGG, and Inception v4); the results compared favorably with those obtained by using the best published methods. We have produced open-source tools and a public data set consisting of tumor/TIL maps for 1090 invasive breast cancer images from The Cancer Genome Atlas. The maps can be downloaded for further downstream analyses.
Subject(s)
Breast Neoplasms/pathology , Deep Learning , Lymphocytes, Tumor-Infiltrating/pathology , Breast Neoplasms/immunology , Female , Humans , Lymphocytes, Tumor-Infiltrating/immunology , SEER ProgramABSTRACT
A personalized approach based on a patient's or pathogen's unique genomic sequence is the foundation of precision medicine. Genomic findings must be robust and reproducible, and experimental data capture should adhere to findable, accessible, interoperable, and reusable (FAIR) guiding principles. Moreover, effective precision medicine requires standardized reporting that extends beyond wet-lab procedures to computational methods. The BioCompute framework (https://w3id.org/biocompute/1.3.0) enables standardized reporting of genomic sequence data provenance, including provenance domain, usability domain, execution domain, verification kit, and error domain. This framework facilitates communication and promotes interoperability. Bioinformatics computation instances that employ the BioCompute framework are easily relayed, repeated if needed, and compared by scientists, regulators, test developers, and clinicians. Easing the burden of performing the aforementioned tasks greatly extends the range of practical application. Large clinical trials, precision medicine, and regulatory submissions require a set of agreed upon standards that ensures efficient communication and documentation of genomic analyses. The BioCompute paradigm and the resulting BioCompute Objects (BCOs) offer that standard and are freely accessible as a GitHub organization (https://github.com/biocompute-objects) following the "Open-Stand.org principles for collaborative open standards development." With high-throughput sequencing (HTS) studies communicated using a BCO, regulatory agencies (e.g., Food and Drug Administration [FDA]), diagnostic test developers, researchers, and clinicians can expand collaboration to drive innovation in precision medicine, potentially decreasing the time and cost associated with next-generation sequencing workflow exchange, reporting, and regulatory reviews.
Subject(s)
Computational Biology/methods , Sequence Analysis, DNA/methods , Animals , Communication , Computational Biology/standards , Genome , Genomics/methods , High-Throughput Nucleotide Sequencing , Humans , Precision Medicine/trends , Reproducibility of Results , Sequence Analysis, DNA/standards , Software , WorkflowABSTRACT
Summary: The move of computational genomics workflows to Cloud Computing platforms is associated with a new level of integration and interoperability that challenges existing data representation formats. The Variant Calling Format (VCF) is in a particularly sensitive position in that regard, with both clinical and consumer-facing analysis tools relying on this self-contained description of genomic variation in Next Generation Sequencing (NGS) results. In this report we identify an isomorphic map between VCF and the reference Resource Description Framework. RDF is advanced by the World Wide Web Consortium (W3C) to enable representations of linked data that are both distributed and discoverable. The resulting ability to decompose VCF reports of genomic variation without loss of context addresses the need to modularize and govern NGS pipelines for Precision Medicine. Specifically, it provides the flexibility (i.e. the indexing) needed to support the wide variety of clinical scenarios and patient-facing governance where only part of the VCF data is fitting. Availability and Implementation: Software libraries with a claim to be both domain-facing and consumer-facing have to pass the test of portability across the variety of devices that those consumers in fact adopt. That is, ideally the implementation should itself take place within the space defined by web technologies. Consequently, the isomorphic mapping function was implemented in JavaScript, and was tested in a variety of environments and devices, client and server side alike. These range from web browsers in mobile phones to the most popular micro service platform, NodeJS. The code is publicly available at https://github.com/ibl/VCFr , with a live deployment at: http://ibl.github.io/VCFr/ . Contact: jonas.almeida@stonybrookmedicine.edu.
Subject(s)
Genetic Variation , Genome , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , Genomics/methods , Humans , Information Storage and Retrieval , SemanticsABSTRACT
BACKGROUND: The molecular signature of ductal carcinoma in situ (DCIS) in the breast is not well understood. Erb-b2 receptor tyrosine kinase 2 (ERBB2 [formerly known as HER2/neu]) positivity in DCIS is predictive of coexistent early invasive breast carcinoma. The aim of this study is to identify the gene-expression signature profiles of estrogen receptor (ER)/progesterone receptor (PR)-positive, ERBB2, and triple-negative subtypes of DCIS. METHODS: Based on ER, PR, and ERBB2 status, a total of 18 high nuclear grade DCIS cases with no evidence of invasive breast carcinoma were selected along with 6 non-neoplastic controls. The 3 study groups were defined as ER/PR-positive, ERBB2, and triple-negative subtypes. RESULTS: A total of 49 genes were differentially expressed in the ERBB2 subtype compared with the ER/PR-positive and triple-negative groups. PROM1 was overexpressed in the ERBB2 subtype compared with ER/PR-positive and triple-negative subtypes. Other genes differentially expressed included TAOK1, AREG, AGR3, PEG10, and MMP9. CONCLUSIONS: Our study identified unique gene signatures in ERBB2-positive DCIS, which may be associated with the development of invasive breast carcinoma. The results may enhance our understanding of the progression of breast cancer and become the basis for developing new predictive biomarkers and therapeutic targets for DCIS.
Subject(s)
Biomarkers, Tumor/metabolism , Breast Neoplasms/genetics , Carcinoma, Ductal, Breast/genetics , Carcinoma, Intraductal, Noninfiltrating/genetics , Gene Expression Profiling , Receptor, ErbB-2/metabolism , Adult , Aged , Breast Neoplasms/metabolism , Breast Neoplasms/pathology , Carcinoma, Ductal, Breast/metabolism , Carcinoma, Ductal, Breast/pathology , Carcinoma, Intraductal, Noninfiltrating/metabolism , Carcinoma, Intraductal, Noninfiltrating/pathology , Female , Humans , Middle Aged , Neoplasm Invasiveness , Neoplasm Staging , Prognosis , Receptors, Estrogen/metabolism , Receptors, Progesterone/metabolism , Survival RateABSTRACT
Among alignment-free methods, Iterated Maps (IMs) are on a particular extreme: they are also scale free (order free). The use of IMs for sequence analysis is also distinct from other alignment-free methodologies in being rooted in statistical mechanics instead of computational linguistics. Both of these roots go back over two decades to the use of fractal geometry in the characterization of phase-space representations. The time series analysis origin of the field is betrayed by the title of the manuscript that started this alignment-free subdomain in 1990, 'Chaos Game Representation'. The clash between the analysis of sequences as continuous series and the better established use of Markovian approaches to discrete series was almost immediate, with a defining critique published in same journal 2 years later. The rest of that decade would go by before the scale-free nature of the IM space was uncovered. The ensuing decade saw this scalability generalized for non-genomic alphabets as well as an interest in its use for graphic representation of biological sequences. Finally, in the past couple of years, in step with the emergence of BigData and MapReduce as a new computational paradigm, there is a surprising third act in the IM story. Multiple reports have described gains in computational efficiency of multiple orders of magnitude over more conventional sequence analysis methodologies. The stage appears to be now set for a recasting of IMs with a central role in processing nextgen sequencing results.
Subject(s)
Computational Biology/methods , Sequence Analysis/methods , Computational Biology/trends , Fractals , Models, Statistical , Nonlinear Dynamics , Sequence Alignment , Sequence Analysis/statistics & numerical dataABSTRACT
This editorial discusses the rise of computational pathology as a major driver of experimental pathology research.
Subject(s)
Pathology , Computational Biology , Humans , ResearchABSTRACT
BACKGROUND: The scarcity of tissues from racial and ethnic minorities at biobanks poses a scientific constraint to research addressing health disparities in minority populations. METHODS: To address this gap, the Minority Biospecimen/Biobanking Geographic Management Program for region 3 (BMaP-3) established a working infrastructure for a "biobanking" hub in the southeastern United States and Puerto Rico. Herein we describe the steps taken to build this infrastructure, evaluate the feasibility of collecting formalin-fixed, paraffin-embedded tissue blocks and associated data from a single cancer type (breast), and create a web-based database and tissue microarrays (TMAs). RESULTS: Cancer registry data from 6 partner institutions were collected, representing 12,408 entries from 8,279 unique patients with breast cancer (years 2001-2011). Data were harmonized and merged, and deidentified information was made available online. A TMA was constructed from formalin-fixed, paraffin-embedded samples of invasive ductal carcinoma (IDC) representing 427 patients with breast cancer (147 African Americans, 168 Hispanics, and 112 non-Hispanic whites) and was annotated according to biomarker status and race/ethnicity. Biomarker analysis of the TMA was consistent with the literature. CONCLUSIONS: Contributions from participating institutions have facilitated a robust research tool. TMAs of IDC have now been released for 5 projects at 5 different institutions.
Subject(s)
Carcinoma, Ductal, Breast/epidemiology , Adult , Aged , Aged, 80 and over , Ethnicity , Female , Humans , Immunohistochemistry , Middle Aged , Tissue Array AnalysisABSTRACT
BACKGROUND: Ongoing advancements in cloud computing provide novel opportunities in scientific computing, especially for distributed workflows. Modern web browsers can now be used as high-performance workstations for querying, processing, and visualizing genomics' "Big Data" from sources like The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) without local software installation or configuration. The design of QMachine (QM) was driven by the opportunity to use this pervasive computing model in the context of the Web of Linked Data in Biomedicine. RESULTS: QM is an open-sourced, publicly available web service that acts as a messaging system for posting tasks and retrieving results over HTTP. The illustrative application described here distributes the analyses of 20 Streptococcus pneumoniae genomes for shared suffixes. Because all analytical and data retrieval tasks are executed by volunteer machines, few server resources are required. Any modern web browser can submit those tasks and/or volunteer to execute them without installing any extra plugins or programs. A client library provides high-level distribution templates including MapReduce. This stark departure from the current reliance on expensive server hardware running "download and install" software has already gathered substantial community interest, as QM received more than 2.2 million API calls from 87 countries in 12 months. CONCLUSIONS: QM was found adequate to deliver the sort of scalable bioinformatics solutions that computation- and data-intensive workflows require. Paradoxically, the sandboxed execution of code by web browsers was also found to enable them, as compute nodes, to address critical privacy concerns that characterize biomedical environments.
Subject(s)
Software Design , Web Browser , Computational Biology/methods , Genome , Genomics/methods , Humans , Streptococcus pneumoniae/geneticsABSTRACT
BACKGROUND: Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it. RESULTS: To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr). CONCLUSIONS: Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.