RESUMO
BACKGROUND: Due to the degeneracy of the genetic code, most amino acids can be encoded by multiple synonymous codons. Synonymous codons naturally occur with different frequencies in different organisms. The choice of codons may affect protein expression, structure, and function. Recombinant gene technologies commonly take advantage of the former effect by implementing a technique termed codon optimization, in which codons are replaced with synonymous ones in order to increase protein expression. This technique relies on the accurate knowledge of codon usage frequencies. Accurately quantifying codon usage bias for different organisms is useful not only for codon optimization, but also for evolutionary and translation studies: phylogenetic relations of organisms, and host-pathogen co-evolution relationships, may be explored through their codon usage similarities. Furthermore, codon usage has been shown to affect protein structure and function through interfering with translation kinetics, and cotranslational protein folding. RESULTS: Despite the obvious need for accurate codon usage tables, currently available resources are either limited in scope, encompassing only organisms from specific domains of life, or greatly outdated. Taking advantage of the exponential growth of GenBank and the creation of NCBI's RefSeq database, we have developed a new database, the High-performance Integrated Virtual Environment-Codon Usage Tables (HIVE-CUTs), to present and analyse codon usage tables for every organism with publicly available sequencing data. Compared to existing databases, this new database is more comprehensive, addresses concerns that limited the accuracy of earlier databases, and provides several new functionalities, such as the ability to view and compare codon usage between individual organisms and across taxonomical clades, through graphical representation or through commonly used indices. In addition, it is being routinely updated to keep up with the continuous flow of new data in GenBank and RefSeq. CONCLUSION: Given the impact of codon usage bias on recombinant gene technologies, this database will facilitate effective development and review of recombinant drug products and will be instrumental in a wide area of biological research. The database is available at hive.biochemistry.gwu.edu/review/codon .
Assuntos
Códon , Bases de Dados de Ácidos Nucleicos , Animais , HumanosRESUMO
BACKGROUND: Gene expression is highly variable across tissues of multi-cellular organisms, influencing the codon usage of the tissue-specific transcriptome. Cancer disrupts the gene expression pattern of healthy tissue resulting in altered codon usage preferences. The topic of codon usage changes as they relate to codon demand, and tRNA supply in cancer is of growing interest. METHODS: We analyzed transcriptome-weighted codon and codon pair usage based on The Cancer Genome Atlas (TCGA) RNA-seq data from 6427 solid tumor samples and 632 normal tissue samples. This dataset represents 32 cancer types affecting 11 distinct tissues. Our analysis focused on tissues that give rise to multiple solid tumor types and cancer types that are present in multiple tissues. RESULTS: We identified distinct patterns of synonymous codon usage changes for different cancer types affecting the same tissue. For example, a substantial increase in GGT-glycine was observed in invasive ductal carcinoma (IDC), invasive lobular carcinoma (ILC), and mixed invasive ductal and lobular carcinoma (IDLC) of the breast. Change in synonymous codon preference favoring GGT correlated with change in synonymous codon preference against GGC in IDC and IDLC, but not in ILC. Furthermore, we examined the codon usage changes between paired healthy/tumor tissue from the same patient. Using clinical data from TCGA, we conducted a survival analysis of patients based on the degree of change between healthy and tumor-specific codon usage, revealing an association between larger changes and increased mortality. We have also created a database that contains cancer-specific codon and codon pair usage data for cancer types derived from TCGA, which represents a comprehensive tool for codon-usage-oriented cancer research. CONCLUSIONS: Based on data from TCGA, we have highlighted tumor type-specific signatures of codon and codon pair usage. Paired data revealed variable changes to codon usage patterns, which must be considered when designing personalized cancer treatments. The associated database, CancerCoCoPUTs, represents a comprehensive resource for codon and codon pair usage in cancer and is available at https://dnahive.fda.gov/review/cancercocoputs/ . These findings are important to understand the relationship between tRNA supply and codon demand in cancer states and could help guide the development of new cancer therapeutics.
Assuntos
Uso do Códon , Códon , Biologia Computacional/métodos , Bases de Dados Genéticas , Neoplasias/diagnóstico , Neoplasias/genética , Biomarcadores Tumorais , Perfilação da Expressão Gênica , Regulação Neoplásica da Expressão Gênica , Estudo de Associação Genômica Ampla , Genômica/métodos , Humanos , Estimativa de Kaplan-Meier , Neoplasias/mortalidade , Prognóstico , TranscriptomaRESUMO
Protein expression in multicellular organisms varies widely across tissues. Codon usage in the transcriptome of each tissue is derived from genomic codon usage and the relative expression level of each gene. We created a comprehensive computational resource that houses tissue-specific codon, codon-pair, and dinucleotide usage data for 51 Homo sapiens tissues (TissueCoCoPUTs: https://hive.biochemistry.gwu.edu/review/tissue_codon), using transcriptome data from the Broad Institute Genotype-Tissue Expression (GTEx) portal. Distances between tissue-specific codon and codon-pair frequencies were used to generate a dendrogram based on the unique patterns of codon and codon-pair usage in each tissue that are clearly distinct from the genomic distribution. This novel resource may be useful in unraveling the relationship between codon usage and tRNA abundance, which could be critical in determining translation kinetics and efficiency across tissues. Areas of investigation such as biotherapeutic development, tissue-specific genetic engineering, and genetic disease prediction will greatly benefit from this resource.
Assuntos
Códon/genética , Bases de Dados Genéticas , Regulação da Expressão Gênica/genética , Especificidade de Órgãos/genética , Uso do Códon/genética , Genoma Humano/genética , Genótipo , Humanos , InternetRESUMO
Whole genome sequencing of bacterial isolates has become a daily task in many laboratories, generating incredible amounts of data. However, data acquisition is not an end in itself; the goal is to acquire high-quality data useful for understanding genetic relationships. Having a method that could rapidly determine which of the many available run metrics are the most important indicators of overall run quality and having a way to monitor these during a given sequencing run would be extremely helpful to this effect. Therefore, we compared various run metrics across 486 MiSeq runs, from five different machines. By performing a statistical analysis using principal components analysis and a K-means clustering algorithm of the metrics, we were able to validate metric comparisons among instruments, allowing for the development of a predictive algorithm, which permits one to observe whether a given MiSeq run has performed adequately. This algorithm is available in an Excel spreadsheet: that is, MiSeq Instrument & Run (In-Run) Forecast. Our tool can help verify that the quantity/quality of the generated sequencing data consistently meets or exceeds recommended manufacturer expectations. Patterns of deviation from those expectations can be used to assess potential run problems and plan preventative maintenance, which can save valuable time and funding resources.
Assuntos
Bactérias/genética , Genoma Bacteriano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/normas , Controle de Qualidade , Sequenciamento Completo do Genoma/métodos , Sequenciamento Completo do Genoma/normas , Algoritmos , Modelos EstatísticosRESUMO
Usage of sequential codon-pairs is non-random and unique to each species. Codon-pair bias is related to but clearly distinct from individual codon usage bias. Codon-pair bias is thought to affect translational fidelity and efficiency and is presumed to be under the selective pressure. It was suggested that changes in codon-pair utilization may affect human disease more significantly than changes in single codons. Although recombinant gene technologies often take codon-pair usage bias into account, codon-pair usage data/tables are not readily available, thus potentially impeding research efforts. The present computational resource (https://hive.biochemistry.gwu.edu/review/codon2) systematically addresses this issue. Building on our recent HIVE-Codon Usage Tables, we constructed a new database to include genomic codon-pair and dinucleotide statistics of all organisms with sequenced genome, available in the GenBank. We believe that the growing understanding of the importance of codon-pair usage will make this resource an invaluable tool to many researchers in academia and pharmaceutical industry.
Assuntos
Uso do Códon , Biologia Computacional/métodos , Variação Genética , Algoritmos , Sequência de Bases , Bases de Dados Genéticas , HumanosRESUMO
Efficiency has become one of the main concerns in evolutionary multiobjective optimization during recent years. One of the possible alternatives to achieve a faster convergence is to use a relaxed form of Pareto dominance that allows us to regulate the granularity of the approximation of the Pareto front that we wish to achieve. One such relaxed forms of Pareto dominance that has become popular in the last few years is epsilon-dominance, which has been mainly used as an archiving strategy in some multiobjective evolutionary algorithms. Despite its advantages, epsilon-dominance has some limitations. In this paper, we propose a mechanism that can be seen as a variant of epsilon-dominance, which we call Pareto-adaptive epsilon-dominance (paepsilon-dominance). Our proposed approach tries to overcome the main limitation of epsilon-dominance: the loss of several nondominated solutions from the hypergrid adopted in the archive because of the way in which solutions are selected within each box.