Search | VHL Regional Portal

Distinct signatures of codon and codon pair usage in 32 primary tumor types in the novel database CancerCoCoPUTs for cancer-specific codon usage.

Meyer, Douglas; Kames, Jacob; Bar, Haim; Komar, Anton A; Alexaki, Aikaterini; Ibla, Juan; Hunt, Ryan C; Santana-Quintero, Luis V; Golikov, Anton; DiCuccio, Michael; Kimchi-Sarfaty, Chava.

Genome Med ; 13(1): 122, 2021 07 28.

Article in English | MEDLINE | ID: mdl-34321100

ABSTRACT

BACKGROUND: Gene expression is highly variable across tissues of multi-cellular organisms, influencing the codon usage of the tissue-specific transcriptome. Cancer disrupts the gene expression pattern of healthy tissue resulting in altered codon usage preferences. The topic of codon usage changes as they relate to codon demand, and tRNA supply in cancer is of growing interest. METHODS: We analyzed transcriptome-weighted codon and codon pair usage based on The Cancer Genome Atlas (TCGA) RNA-seq data from 6427 solid tumor samples and 632 normal tissue samples. This dataset represents 32 cancer types affecting 11 distinct tissues. Our analysis focused on tissues that give rise to multiple solid tumor types and cancer types that are present in multiple tissues. RESULTS: We identified distinct patterns of synonymous codon usage changes for different cancer types affecting the same tissue. For example, a substantial increase in GGT-glycine was observed in invasive ductal carcinoma (IDC), invasive lobular carcinoma (ILC), and mixed invasive ductal and lobular carcinoma (IDLC) of the breast. Change in synonymous codon preference favoring GGT correlated with change in synonymous codon preference against GGC in IDC and IDLC, but not in ILC. Furthermore, we examined the codon usage changes between paired healthy/tumor tissue from the same patient. Using clinical data from TCGA, we conducted a survival analysis of patients based on the degree of change between healthy and tumor-specific codon usage, revealing an association between larger changes and increased mortality. We have also created a database that contains cancer-specific codon and codon pair usage data for cancer types derived from TCGA, which represents a comprehensive tool for codon-usage-oriented cancer research. CONCLUSIONS: Based on data from TCGA, we have highlighted tumor type-specific signatures of codon and codon pair usage. Paired data revealed variable changes to codon usage patterns, which must be considered when designing personalized cancer treatments. The associated database, CancerCoCoPUTs, represents a comprehensive resource for codon and codon pair usage in cancer and is available at https://dnahive.fda.gov/review/cancercocoputs/ . These findings are important to understand the relationship between tRNA supply and codon demand in cancer states and could help guide the development of new cancer therapeutics.

Subject(s)

Codon Usage , Codon , Computational Biology/methods , Databases, Genetic , Neoplasms/diagnosis , Neoplasms/genetics , Biomarkers, Tumor , Gene Expression Profiling , Gene Expression Regulation, Neoplastic , Genome-Wide Association Study , Genomics/methods , Humans , Kaplan-Meier Estimate , Neoplasms/mortality , Prognosis , Transcriptome

Bioinformatics tools developed to support BioCompute Objects.

Patel, Janisha A; Dean, Dennis A; King, Charles Hadley; Xiao, Nan; Koc, Soner; Minina, Ekaterina; Golikov, Anton; Brooks, Phillip; Kahsay, Robel; Navelkar, Rahi; Ray, Manisha; Roberson, Dave; Armstrong, Chris; Mazumder, Raja; Keeney, Jonathon.

Database (Oxford) ; 20212021 03 30.

Article in English | MEDLINE | ID: mdl-33784373

ABSTRACT

Developments in high-throughput sequencing (HTS) result in an exponential increase in the amount of data generated by sequencing experiments, an increase in the complexity of bioinformatics analysis reporting and an increase in the types of data generated. These increases in volume, diversity and complexity of the data generated and their analysis expose the necessity of a structured and standardized reporting template. BioCompute Objects (BCOs) provide the requisite support for communication of HTS data analysis that includes support for workflow, as well as data, curation, accessibility and reproducibility of communication. BCOs standardize how researchers report provenance and the established verification and validation protocols used in workflows while also being robust enough to convey content integration or curation in knowledge bases. BCOs that encapsulate tools, platforms, datasets and workflows are FAIR (findable, accessible, interoperable and reusable) compliant. Providing operational workflow and data information facilitates interoperability between platforms and incorporation of future dataset within an HTS analysis for use within industrial, academic and regulatory settings. Cloud-based platforms, including High-performance Integrated Virtual Environment (HIVE), Cancer Genomics Cloud (CGC) and Galaxy, support BCO generation for users. Given the 100K+ userbase between these platforms, BioCompute can be leveraged for workflow documentation. In this paper, we report the availability of platform-dependent and platform-independent BCO tools: HIVE BCO App, CGC BCO App, Galaxy BCO API Extension and BCO Portal. Community engagement was utilized to evaluate tool efficacy. We demonstrate that these tools further advance BCO creation from text editing approaches used in earlier releases of the standard. Moreover, we demonstrate that integrating BCO generation within existing analysis platforms greatly streamlines BCO creation while capturing granular workflow details. We also demonstrate that the BCO tools described in the paper provide an approach to solve the long-standing challenge of standardizing workflow descriptions that are both human and machine readable while accommodating manual and automated curation with evidence tagging. Database URL: https://www.biocomputeobject.org/resources.

Subject(s)

Computational Biology , Genomics , High-Throughput Nucleotide Sequencing , Humans , Reproducibility of Results , Software , Workflow

TissueCoCoPUTs: Novel Human Tissue-Specific Codon and Codon-Pair Usage Tables Based on Differential Tissue Gene Expression.

Kames, Jacob; Alexaki, Aikaterini; Holcomb, David D; Santana-Quintero, Luis V; Athey, John C; Hamasaki-Katagiri, Nobuko; Katneni, Upendra; Golikov, Anton; Ibla, Juan C; Bar, Haim; Kimchi-Sarfaty, Chava.

J Mol Biol ; 432(11): 3369-3378, 2020 05 15.

Article in English | MEDLINE | ID: mdl-31982380

ABSTRACT

Protein expression in multicellular organisms varies widely across tissues. Codon usage in the transcriptome of each tissue is derived from genomic codon usage and the relative expression level of each gene. We created a comprehensive computational resource that houses tissue-specific codon, codon-pair, and dinucleotide usage data for 51 Homo sapiens tissues (TissueCoCoPUTs: https://hive.biochemistry.gwu.edu/review/tissue_codon), using transcriptome data from the Broad Institute Genotype-Tissue Expression (GTEx) portal. Distances between tissue-specific codon and codon-pair frequencies were used to generate a dendrogram based on the unique patterns of codon and codon-pair usage in each tissue that are clearly distinct from the genomic distribution. This novel resource may be useful in unraveling the relationship between codon usage and tRNA abundance, which could be critical in determining translation kinetics and efficiency across tissues. Areas of investigation such as biotherapeutic development, tissue-specific genetic engineering, and genetic disease prediction will greatly benefit from this resource.

Subject(s)

Codon/genetics , Databases, Genetic , Gene Expression Regulation/genetics , Organ Specificity/genetics , Codon Usage/genetics , Genome, Human/genetics , Genotype , Humans , Internet

High-performance integrated virtual environment (HIVE): a robust infrastructure for next-generation sequence data analysis.

Simonyan, Vahan; Chumakov, Konstantin; Dingerdissen, Hayley; Faison, William; Goldweber, Scott; Golikov, Anton; Gulzar, Naila; Karagiannis, Konstantinos; Vinh Nguyen Lam, Phuc; Maudru, Thomas; Muravitskaja, Olesja; Osipova, Ekaterina; Pan, Yang; Pschenichnov, Alexey; Rostovtsev, Alexandre; Santana-Quintero, Luis; Smith, Krista; Thompson, Elaine E; Tkachenko, Valery; Torcivia-Rodriguez, John; Voskanian, Alin; Wan, Quan; Wang, Jing; Wu, Tsung-Jung; Wilson, Carolyn; Mazumder, Raja.

Database (Oxford) ; 20162016.

Article in English | MEDLINE | ID: mdl-26989153

ABSTRACT

The High-performance Integrated Virtual Environment (HIVE) is a distributed storage and compute environment designed primarily to handle next-generation sequencing (NGS) data. This multicomponent cloud infrastructure provides secure web access for authorized users to deposit, retrieve, annotate and compute on NGS data, and to analyse the outcomes using web interface visual environments appropriately built in collaboration with research and regulatory scientists and other end users. Unlike many massively parallel computing environments, HIVE uses a cloud control server which virtualizes services, not processes. It is both very robust and flexible due to the abstraction layer introduced between computational requests and operating system processes. The novel paradigm of moving computations to the data, instead of moving data to computational nodes, has proven to be significantly less taxing for both hardware and network infrastructure.The honeycomb data model developed for HIVE integrates metadata into an object-oriented model. Its distinction from other object-oriented databases is in the additional implementation of a unified application program interface to search, view and manipulate data of all types. This model simplifies the introduction of new data types, thereby minimizing the need for database restructuring and streamlining the development of new integrated information systems. The honeycomb model employs a highly secure hierarchical access control and permission system, allowing determination of data access privileges in a finely granular manner without flooding the security subsystem with a multiplicity of rules. HIVE infrastructure will allow engineers and scientists to perform NGS analysis in a manner that is both efficient and secure. HIVE is actively supported in public and private domains, and project collaborations are welcomed. Database URL: https://hive.biochemistry.gwu.edu.

Subject(s)

High-Throughput Nucleotide Sequencing/methods , User-Computer Interface , Computational Biology , Mutation/genetics , Poliovirus/genetics , Poliovirus Vaccines/immunology , Proteomics , Recombination, Genetic , Sequence Alignment , Statistics as Topic

Non-synonymous variations in cancer and their effects on the human proteome: workflow for NGS data biocuration and proteome-wide analysis of TCGA data.

Cole, Charles; Krampis, Konstantinos; Karagiannis, Konstantinos; Almeida, Jonas S; Faison, William J; Motwani, Mona; Wan, Quan; Golikov, Anton; Pan, Yang; Simonyan, Vahan; Mazumder, Raja.

BMC Bioinformatics ; 15: 28, 2014 Jan 27.

Article in English | MEDLINE | ID: mdl-24467687

ABSTRACT

BACKGROUND: Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it. RESULTS: To address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr). CONCLUSIONS: Availability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.

Subject(s)

High-Throughput Nucleotide Sequencing/methods , Neoplasms/genetics , Proteome/genetics , Proteomics/methods , Algorithms , Biomedical Research , Database Management Systems , Databases, Genetic , Humans , Neoplasms/metabolism , Phylogeny , Polymorphism, Single Nucleotide , Proteome/classification , Proteome/metabolism , User-Computer Interface

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL