Search | VHL Regional Portal

1.

NCI Cancer Research Data Commons: Resources to Share Key Cancer Data.

Wang, Zhining; Davidsen, Tanja M; Kuffel, Gina R; Addepalli, KanakaDurga; Bell, Amanda; Casas-Silva, Esmeralda; Dingerdissen, Hayley; Farahani, Keyvan; Fedorov, Andrey; Gaheen, Sharon; Grossman, Robert L; Kikinis, Ron; Kim, Erika; Otridge, John; Pihl, Todd; Porter, Melissa; Rodriguez, Henry; Staudt, Louis M; Thangudu, Ratna R; Venkatachari, Sudha; Zenklusen, Jean Claude; Zhang, Xu; Barnholtz-Sloan, Jill S; Kerlavage, Anthony R.

Cancer Res ; 84(9): 1388-1395, 2024 May 02.

Article in English | MEDLINE | ID: mdl-38488507

ABSTRACT

Since 2014, the NCI has launched a series of data commons as part of the Cancer Research Data Commons (CRDC) ecosystem housing genomic, proteomic, imaging, and clinical data to support cancer research and promote data sharing of NCI-funded studies. This review describes each data commons (Genomic Data Commons, Proteomic Data Commons, Integrated Canine Data Commons, Cancer Data Service, Imaging Data Commons, and Clinical and Translational Data Commons), including their unique and shared features, accomplishments, and challenges. Also discussed is how the CRDC data commons implement Findable, Accessible, Interoperable, Reusable (FAIR) principles and promote data sharing in support of the new NIH Data Management and Sharing Policy. See related articles by Brady et al., p. 1384, Pot et al., p. 1396, and Kim et al., p. 1404.

Subject(s)

Information Dissemination , National Cancer Institute (U.S.) , Neoplasms , Humans , United States , Neoplasms/metabolism , Information Dissemination/methods , Biomedical Research , Genomics/methods , Animals , Proteomics/methods

2.

Modeling and integration of N-glycan biomarkers in a comprehensive biomarker data model.

Lyman, Daniel F; Bell, Amanda; Black, Alyson; Dingerdissen, Hayley; Cauley, Edmund; Gogate, Nikhita; Liu, David; Joseph, Ashia; Kahsay, Robel; Crichton, Daniel J; Mehta, Anand; Mazumder, Raja.

Glycobiology ; 32(10): 855-870, 2022 09 19.

Article in English | MEDLINE | ID: mdl-35925813

ABSTRACT

Molecular biomarkers measure discrete components of biological processes that can contribute to disorders when impaired. Great interest exists in discovering early cancer biomarkers to improve outcomes. Biomarkers represented in a standardized data model, integrated with multi-omics data, may improve the understanding and use of novel biomarkers such as glycans and glycoconjugates. Among altered components in tumorigenesis, N-glycans exhibit substantial biomarker potential, when analyzed with their protein carriers. However, such data are distributed across publications and databases of diverse formats, which hamper their use in research and clinical application. Mass spectrometry measures of 50 N-glycans on 7 serum proteins in liver disease were integrated (as a panel) into a cancer biomarker data model, providing a unique identifier, standard nomenclature, links to glycan resources, and accession and ontology annotations to standard protein, gene, disease, and biomarker information. Data provenance was documented with a standardized United States Food and Drug Administration-supported BioCompute Object. Using the biomarker data model allows the capture of granular information, such as glycans with different levels of abundance in cirrhosis, hepatocellular carcinoma, and transplant groups. Such representation in a standardized data model harmonizes glycomics data in a unified framework, making glycan-protein biomarker data exploration more available to investigators and to other data resources. The biomarker data model we describe can be used by researchers to describe their novel glycan and glycoconjugate biomarkers; it can integrate N-glycan biomarker data with multi-source biomedical data and can foster discovery and insight within a unified data framework for glycan biomarker representation, thereby making the data FAIR (Findable, Accessible, Interoperable, Reusable) (https://www.go-fair.org/fair-principles/).

Subject(s)

Carcinoma, Hepatocellular , Liver Neoplasms , Biomarkers , Biomarkers, Tumor , Carcinoma, Hepatocellular/diagnosis , Glycomics/methods , Humans , Liver Neoplasms/diagnosis , Polysaccharides/chemistry

3.

OncoMX: A Knowledgebase for Exploring Cancer Biomarkers in the Context of Related Cancer and Healthy Data.

Dingerdissen, Hayley M; Bastian, Frederic; Vijay-Shanker, K; Robinson-Rechavi, Marc; Bell, Amanda; Gogate, Nikhita; Gupta, Samir; Holmes, Evan; Kahsay, Robel; Keeney, Jonathon; Kincaid, Heather; King, Charles Hadley; Liu, David; Crichton, Daniel J; Mazumder, Raja.

JCO Clin Cancer Inform ; 4: 210-220, 2020 03.

Article in English | MEDLINE | ID: mdl-32142370

ABSTRACT

PURPOSE: The purpose of OncoMX1 knowledgebase development was to integrate cancer biomarker and relevant data types into a meta-portal, enabling the research of cancer biomarkers side by side with other pertinent multidimensional data types. METHODS: Cancer mutation, cancer differential expression, cancer expression specificity, healthy gene expression from human and mouse, literature mining for cancer mutation and cancer expression, and biomarker data were integrated, unified by relevant biomedical ontologies, and subjected to rule-based automated quality control before ingestion into the database. RESULTS: OncoMX provides integrated data encompassing more than 1,000 unique biomarker entries (939 from the Early Detection Research Network [EDRN] and 96 from the US Food and Drug Administration) mapped to 20,576 genes that have either mutation or differential expression in cancer. Sentences reporting mutation or differential expression in cancer were extracted from more than 40,000 publications, and healthy gene expression data with samples mapped to organs are available for both human genes and their mouse orthologs. CONCLUSION: OncoMX has prioritized user feedback as a means of guiding development priorities. By mapping to and integrating data from several cancer genomics resources, it is hoped that OncoMX will foster a dynamic engagement between bioinformaticians and cancer biomarker researchers. This engagement should culminate in a community resource that substantially improves the ability and efficiency of exploring cancer biomarker data and related multidimensional data.

Subject(s)

Biomarkers, Tumor/analysis , Computational Biology/methods , Data Mining/methods , Databases, Genetic/standards , Knowledge Bases , Neoplasms/diagnosis , Software , Animals , Biological Ontologies , Humans , Mice , Neoplasms/therapy , User-Computer Interface

4.

A Primer for Access to Repositories of Cancer-Related Genomic Big Data.

Torcivia-Rodriguez, John; Dingerdissen, Hayley; Chang, Ting-Chia; Mazumder, Raja.

Methods Mol Biol ; 1878: 1-37, 2019.

Article in English | MEDLINE | ID: mdl-30378067

ABSTRACT

The use of large datasets has become ubiquitous in biomedical sciences. Researchers in the field of cancer genomics have, in recent years, generated large volumes of data from their experiments. Those responsible for production of this data often analyze a narrow subset of this data based on the research question they are trying to address: this is the case whether or not they are acting independently or in conjunction with a large-scale cancer genomics project. The reality of this situation creates the opportunity for other researchers to repurpose this data for different hypotheses if the data is made easily and freely available. New insights in biology resulting from more researchers having access to data they otherwise would be unable to generate on their own are a boon for the field. The following chapter reviews several cancer genomics-related databases and outlines the type of data they contain, as well as the methods required to access each database. While this list is not comprehensive, it should provide a basis for cancer researchers to begin exploring some of the many large datasets that are available to them.

Subject(s)

Neoplasms/genetics , Databases, Genetic , Genomics/methods , Humans , Research

5.

Identification of key differentially expressed MicroRNAs in cancer patients through pan-cancer analysis.

Hu, Yu; Dingerdissen, Hayley; Gupta, Samir; Kahsay, Robel; Shanker, Vijay; Wan, Quan; Yan, Cheng; Mazumder, Raja.

Comput Biol Med ; 103: 183-197, 2018 12 01.

Article in English | MEDLINE | ID: mdl-30384176

ABSTRACT

microRNAs (miRNAs) functioning in gene silencing have been associated with cancer progression. However, common abnormal miRNA expression patterns and their potential roles in cancer have not yet been evaluated. To account for individual differences between patients, we retrieved miRNA sequencing data for 575 patients with both tumor and adjacent non-tumorous tissues from 14 cancer types from The Cancer Genome Atlas (TCGA). We then performed differential expression analysis using DESeq2 and edgeR. Results showed that cancer types can be grouped based on the distribution of miRNAs with different expression patterns between tumor and non-tumor samples. We found 81 significantly differentially expressed miRNAs (SDEmiRNAs) in a single cancer. We also found 21 key SDEmiRNAs (nine over-expressed and 12 under-expressed) associated with at least eight cancers each and enriched in more than 60% of patients per cancer, including four newly identified SDEmiRNAs (hsa-mir-4746, hsa-mir-3648, hsa-mir-3687, and hsa-mir-1269a). The downstream effects of these 21 SDEmiRNAs on cellular function were evaluated through enrichment and pathway analysis of 7186 protein-coding gene targets mined from literature reports of differential expression of miRNAs in cancer. This analysis enables identification of SDEmiRNA functional similarity in cell proliferation control across a wide range of cancers, and assembly of common regulatory networks over cancer-related pathways. These findings were validated by construction of a regulatory network in the PI3K pathway. This study provides evidence for the value of further analysis of SDEmiRNAs as potential biomarkers and therapeutic targets for cancer diagnosis and treatment.

Subject(s)

Gene Expression Profiling/methods , Genomics/methods , MicroRNAs/genetics , Neoplasms/genetics , Gene Expression Regulation, Neoplastic/genetics , Gene Regulatory Networks/genetics , Humans , MicroRNAs/analysis , MicroRNAs/metabolism , MicroRNAs/physiology , Neoplasms/metabolism , Neoplasms/mortality , Neoplasms/physiopathology

6.

BioMuta and BioXpress: mutation and expression knowledgebases for cancer biomarker discovery.

Dingerdissen, Hayley M; Torcivia-Rodriguez, John; Hu, Yu; Chang, Ting-Chia; Mazumder, Raja; Kahsay, Robel.

Nucleic Acids Res ; 46(D1): D1128-D1136, 2018 01 04.

Article in English | MEDLINE | ID: mdl-30053270

ABSTRACT

Single-nucleotide variation and gene expression of disease samples represent important resources for biomarker discovery. Many databases have been built to host and make available such data to the community, but these databases are frequently limited in scope and/or content. BioMuta, a database of cancer-associated single-nucleotide variations, and BioXpress, a database of cancer-associated differentially expressed genes and microRNAs, differ from other disease-associated variation and expression databases primarily through the aggregation of data across many studies into a single source with a unified representation and annotation of functional attributes. Early versions of these resources were initiated by pilot funding for specific research applications, but newly awarded funds have enabled hardening of these databases to production-level quality and will allow for sustained development of these resources for the next few years. Because both resources were developed using a similar methodology of integration, curation, unification, and annotation, we present BioMuta and BioXpress as allied databases that will facilitate a more comprehensive view of gene associations in cancer. BioMuta and BioXpress are hosted on the High-performance Integrated Virtual Environment (HIVE) server at the George Washington University at https://hive.biochemistry.gwu.edu/biomuta and https://hive.biochemistry.gwu.edu/bioxpress, respectively.

Subject(s)

Biomarkers, Tumor/genetics , Databases, Genetic , Knowledge Bases , Mutation , Neoplasms/genetics , Gene Expression Regulation, Neoplastic , Humans , MicroRNAs , User-Computer Interface

7.

DEXTER: Disease-Expression Relation Extraction from Text.

Gupta, Samir; Dingerdissen, Hayley; Ross, Karen E; Hu, Yu; Wu, Cathy H; Mazumder, Raja; Vijay-Shanker, K.

Database (Oxford) ; 20182018 01 01.

Article in English | MEDLINE | ID: mdl-29860481

ABSTRACT

Gene expression levels affect biological processes and play a key role in many diseases. Characterizing expression profiles is useful for clinical research, and diagnostics and prognostics of diseases. There are currently several high-quality databases that capture gene expression information, obtained mostly from large-scale studies, such as microarray and next-generation sequencing technologies, in the context of disease. The scientific literature is another rich source of information on gene expression-disease relationships that not only have been captured from large-scale studies but have also been observed in thousands of small-scale studies. Expression information obtained from literature through manual curation can extend expression databases. While many of the existing databases include information from literature, they are limited by the time-consuming nature of manual curation and have difficulty keeping up with the explosion of publications in the biomedical field. In this work, we describe an automated text-mining tool, Disease-Expression Relation Extraction from Text (DEXTER) to extract information from literature on gene and microRNA expression in the context of disease. One of the motivations in developing DEXTER was to extend the BioXpress database, a cancer-focused gene expression database that includes data derived from large-scale experiments and manual curation of publications. The literature-based portion of BioXpress lags behind significantly compared to expression information obtained from large-scale studies and can benefit from our text-mined results. We have conducted two different evaluations to measure the accuracy of our text-mining tool and achieved average F-scores of 88.51 and 81.81% for the two evaluations, respectively. Also, to demonstrate the ability to extract rich expression information in different disease-related scenarios, we used DEXTER to extract information on differential expression information for 2024 genes in lung cancer, 115 glycosyltransferases in 62 cancers and 826 microRNA in 171 cancers. All extractions using DEXTER are integrated in the literature-based portion of BioXpress.Database URL: http://biotm.cis.udel.edu/DEXTER.

Subject(s)

Data Mining , Databases, Bibliographic , Databases, Genetic , Gene Expression Regulation, Neoplastic , Glycosyltransferases , Lung Neoplasms , MicroRNAs , Neoplasm Proteins , RNA, Neoplasm , Glycosyltransferases/genetics , Glycosyltransferases/metabolism , Humans , Lung Neoplasms/genetics , Lung Neoplasms/metabolism , MicroRNAs/biosynthesis , MicroRNAs/genetics , Neoplasm Proteins/genetics , Neoplasm Proteins/metabolism , RNA, Neoplasm/biosynthesis , RNA, Neoplasm/genetics

8.

Loss and gain of N-linked glycosylation sequons due to single-nucleotide variation in cancer.

Fan, Yu; Hu, Yu; Yan, Cheng; Goldman, Radoslav; Pan, Yang; Mazumder, Raja; Dingerdissen, Hayley M.

Sci Rep ; 8(1): 4322, 2018 03 12.

Article in English | MEDLINE | ID: mdl-29531238

ABSTRACT

Despite availability of sequence site-specific information resulting from years of sequencing and sequence feature curation, there have been few efforts to integrate and annotate this information. In this study, we update the number of human N-linked glycosylation sequons (NLGs), and we investigate cancer-relatedness of glycosylation-impacting somatic nonsynonymous single-nucleotide variation (nsSNV) by mapping human NLGs to cancer variation data and reporting the expected loss or gain of glycosylation sequon. We find 75.8% of all human proteins have at least one NLG for a total of 59,341 unique NLGs (includes predicted and experimentally validated). Only 27.4% of all NLGs are experimentally validated sites on 4,412 glycoproteins. With respect to cancer, 8,895 somatic-only nsSNVs abolish NLGs in 5,204 proteins and 12,939 somatic-only nsSNVs create NLGs in 7,356 proteins in cancer samples. nsSNVs causing loss of 24 NLGs on 23 glycoproteins and nsSNVs creating 41 NLGs on 40 glycoproteins are identified in three or more cancers. Of all identified cancer somatic variants causing potential loss or gain of glycosylation, only 36 have previously known disease associations. Although this work is computational, it builds on existing genomics and glycobiology research to promote identification and rank potential cancer nsSNV biomarkers for experimental validation.

Subject(s)

Neoplasms/genetics , Polymorphism, Single Nucleotide , Proteome/genetics , Genome, Human , Genomics/methods , Glycoproteins/genetics , Glycosylation , Humans

9.

HIVE-heptagon: A sensible variant-calling algorithm with post-alignment quality controls.

Simonyan, Vahan; Chumakov, Konstantin; Donaldson, Eric; Karagiannis, Konstantinos; Lam, Phuc VinhNguyen; Dingerdissen, Hayley; Voskanian, Alin.

Genomics ; 109(3-4): 131-140, 2017 07.

Article in English | MEDLINE | ID: mdl-28188908

ABSTRACT

Advances in high-throughput sequencing (HTS) technologies have greatly increased the availability of genomic data and potential discovery of clinically significant genomic variants. However, numerous issues still exist with the analysis of these data, including data complexity, the absence of formally agreed upon best practices, and inconsistent reproducibility. Toward a more robust and reproducible variant-calling paradigm, we propose a series of selective noise filtrations and post-alignment quality control (QC) techniques that may reduce the rate of false variant calls. We have implemented both novel and refined post-alignment QC mechanisms to augment existing pre-alignment QC measures. These techniques can be used independently or in combination to identify and correct issues caused during data generation or early analysis stages. The adoption of these procedures by the broader scientific community is expected to improve the identification of clinically significant variants both in terms of computational efficiency and in the confidence of the results. AVAILABILITY: https://hive.biochemistry.gwu.edu/.

Subject(s)

Algorithms , Genome, Human , High-Throughput Nucleotide Sequencing/methods , Polymorphism, Genetic , Quality Control , Genomics/methods , Humans , Reproducibility of Results , Sequence Analysis, DNA/methods

10.

Impact of Nonsynonymous Single-Nucleotide Variations on Post-Translational Modification Sites in Human Proteins.

Gulzar, Naila; Dingerdissen, Hayley; Yan, Cheng; Mazumder, Raja.

Methods Mol Biol ; 1558: 159-190, 2017.

Article in English | MEDLINE | ID: mdl-28150238

ABSTRACT

Post-translational modifications (PTMs) are covalent modifications that proteins might undergo following or sometimes during the process of translation. Together with gene diversity, PTMs contribute to the overall variety of possible protein function for a given organism. Single-nucleotide polymorphisms (SNPs) are the most common form of variations found in the human genome, and have been found to be associated with diseases like Alzheimer's disease (AD) and Parkinson's disease (PD), among many others. Studies have also shown that non-synonymous single-nucleotide variation (nsSNV) at the PTM site, which alters the corresponding encoded amino acid in the translated protein sequence, can lead to abnormal activity of a protein and can contribute to a disease phenotype. Significant advances in next-generation sequencing (NGS) technologies and high-throughput proteomics have resulted in the generation of a huge amount of data for both SNPs and PTMs. However, these data are unsystematically distributed across a number of diverse databases. Thus, there is a need for efforts toward data standardization and validation of bioinformatics algorithms that can fully leverage SNP and PTM information for biomedical research. In this book chapter, we will present some of the commonly used databases for both SNVs and PTMs and describe a broad approach that can be applied to many scenarios for studying the impact of nsSNVs on PTM sites of human proteins.

Subject(s)

Amino Acids , Computational Biology/methods , Databases, Genetic , Polymorphism, Single Nucleotide , Protein Processing, Post-Translational , Proteins , Proteomics/methods , Amino Acids/chemistry , Amino Acids/metabolism , Genetic Variation , Humans , Molecular Sequence Annotation , Mutation , Proteins/chemistry , Proteins/genetics , Proteins/metabolism , Quality Control , Software , Structure-Activity Relationship , Web Browser

11.

High-performance integrated virtual environment (HIVE): a robust infrastructure for next-generation sequence data analysis.

Simonyan, Vahan; Chumakov, Konstantin; Dingerdissen, Hayley; Faison, William; Goldweber, Scott; Golikov, Anton; Gulzar, Naila; Karagiannis, Konstantinos; Vinh Nguyen Lam, Phuc; Maudru, Thomas; Muravitskaja, Olesja; Osipova, Ekaterina; Pan, Yang; Pschenichnov, Alexey; Rostovtsev, Alexandre; Santana-Quintero, Luis; Smith, Krista; Thompson, Elaine E; Tkachenko, Valery; Torcivia-Rodriguez, John; Voskanian, Alin; Wan, Quan; Wang, Jing; Wu, Tsung-Jung; Wilson, Carolyn; Mazumder, Raja.

Database (Oxford) ; 20162016.

Article in English | MEDLINE | ID: mdl-26989153

ABSTRACT

The High-performance Integrated Virtual Environment (HIVE) is a distributed storage and compute environment designed primarily to handle next-generation sequencing (NGS) data. This multicomponent cloud infrastructure provides secure web access for authorized users to deposit, retrieve, annotate and compute on NGS data, and to analyse the outcomes using web interface visual environments appropriately built in collaboration with research and regulatory scientists and other end users. Unlike many massively parallel computing environments, HIVE uses a cloud control server which virtualizes services, not processes. It is both very robust and flexible due to the abstraction layer introduced between computational requests and operating system processes. The novel paradigm of moving computations to the data, instead of moving data to computational nodes, has proven to be significantly less taxing for both hardware and network infrastructure.The honeycomb data model developed for HIVE integrates metadata into an object-oriented model. Its distinction from other object-oriented databases is in the additional implementation of a unified application program interface to search, view and manipulate data of all types. This model simplifies the introduction of new data types, thereby minimizing the need for database restructuring and streamlining the development of new integrated information systems. The honeycomb model employs a highly secure hierarchical access control and permission system, allowing determination of data access privileges in a finely granular manner without flooding the security subsystem with a multiplicity of rules. HIVE infrastructure will allow engineers and scientists to perform NGS analysis in a manner that is both efficient and secure. HIVE is actively supported in public and private domains, and project collaborations are welcomed. Database URL: https://hive.biochemistry.gwu.edu.

Subject(s)

High-Throughput Nucleotide Sequencing/methods , User-Computer Interface , Computational Biology , Mutation/genetics , Poliovirus/genetics , Poliovirus Vaccines/immunology , Proteomics , Recombination, Genetic , Sequence Alignment , Statistics as Topic

12.

BioXpress: an integrated RNA-seq-derived gene expression database for pan-cancer analysis.

Wan, Quan; Dingerdissen, Hayley; Fan, Yu; Gulzar, Naila; Pan, Yang; Wu, Tsung-Jung; Yan, Cheng; Zhang, Haichen; Mazumder, Raja.

Database (Oxford) ; 20152015.

Article in English | MEDLINE | ID: mdl-25819073

ABSTRACT

BioXpress is a gene expression and cancer association database in which the expression levels are mapped to genes using RNA-seq data obtained from The Cancer Genome Atlas, International Cancer Genome Consortium, Expression Atlas and publications. The BioXpress database includes expression data from 64 cancer types, 6361 patients and 17 469 genes with 9513 of the genes displaying differential expression between tumor and normal samples. In addition to data directly retrieved from RNA-seq data repositories, manual biocuration of publications supplements the available cancer association annotations in the database. All cancer types are mapped to Disease Ontology terms to facilitate a uniform pan-cancer analysis. The BioXpress database is easily searched using HUGO Gene Nomenclature Committee gene symbol, UniProtKB/RefSeq accession or, alternatively, can be queried by cancer type with specified significance filters. This interface along with availability of pre-computed downloadable files containing differentially expressed genes in multiple cancers enables straightforward retrieval and display of a broad set of cancer-related genes.

Subject(s)

Databases, Genetic , Gene Expression Regulation, Neoplastic , Neoplasms , RNA, Neoplasm , Humans , Neoplasms/genetics , Neoplasms/metabolism , RNA, Neoplasm/biosynthesis , RNA, Neoplasm/genetics

13.

Human germline and pan-cancer variomes and their distinct functional profiles.

Pan, Yang; Karagiannis, Konstantinos; Zhang, Haichen; Dingerdissen, Hayley; Shamsaddini, Amirhossein; Wan, Quan; Simonyan, Vahan; Mazumder, Raja.

Nucleic Acids Res ; 42(18): 11570-88, 2014 Oct.

Article in English | MEDLINE | ID: mdl-25232094

ABSTRACT

Identification of non-synonymous single nucleotide variations (nsSNVs) has exponentially increased due to advances in Next-Generation Sequencing technologies. The functional impacts of these variations have been difficult to ascertain because the corresponding knowledge about sequence functional sites is quite fragmented. It is clear that mapping of variations to sequence functional features can help us better understand the pathophysiological role of variations. In this study, we investigated the effect of nsSNVs on more than 17 common types of post-translational modification (PTM) sites, active sites and binding sites. Out of 1 705 285 distinct nsSNVs on 259 216 functional sites we identified 38 549 variations that significantly affect 10 major functional sites. Furthermore, we found distinct patterns of site disruptions due to germline and somatic nsSNVs. Pan-cancer analysis across 12 different cancer types led to the identification of 51 genes with 106 nsSNV affected functional sites found in 3 or more cancer types. 13 of the 51 genes overlap with previously identified Significantly Mutated Genes (Nature. 2013 Oct 17;502(7471)). 62 mutations in these 13 genes affecting functional sites such as DNA, ATP binding and various PTM sites occur across several cancers and can be prioritized for additional validation and investigations.

Subject(s)

Genes, Neoplasm , Genetic Variation , Acetylation , Binding Sites/genetics , Catalytic Domain/genetics , Disease/genetics , Gene Ontology , Genomics , Glycosylation , Humans , Methylation , Mutation , Neoplasm Proteins/genetics , Phosphorylation/genetics , Phylogeny , Protein Processing, Post-Translational/genetics , Proteome/genetics , Ubiquitination/genetics

14.

HIVE-hexagon: high-performance, parallelized sequence alignment for next-generation sequencing data analysis.

Santana-Quintero, Luis; Dingerdissen, Hayley; Thierry-Mieg, Jean; Mazumder, Raja; Simonyan, Vahan.

PLoS One ; 9(6): e99033, 2014.

Article in English | MEDLINE | ID: mdl-24918764

ABSTRACT

UNLABELLED: Due to the size of Next-Generation Sequencing data, the computational challenge of sequence alignment has been vast. Inexact alignments can take up to 90% of total CPU time in bioinformatics pipelines. High-performance Integrated Virtual Environment (HIVE), a cloud-based environment optimized for storage and analysis of extra-large data, presents an algorithmic solution: the HIVE-hexagon DNA sequence aligner. HIVE-hexagon implements novel approaches to exploit both characteristics of sequence space and CPU, RAM and Input/Output (I/O) architecture to quickly compute accurate alignments. Key components of HIVE-hexagon include non-redundification and sorting of sequences; floating diagonals of linearized dynamic programming matrices; and consideration of cross-similarity to minimize computations. AVAILABILITY: https://hive.biochemistry.gwu.edu/hive/

Subject(s)

Sequence Alignment , Sequence Analysis, DNA/methods , Genome

15.

A framework for application of metabolic modeling in yeast to predict the effects of nsSNV in human orthologs.

Dingerdissen, Hayley; Weaver, Daniel S; Karp, Peter D; Pan, Yang; Simonyan, Vahan; Mazumder, Raja.

Biol Direct ; 9: 9, 2014 Jun 03.

Article in English | MEDLINE | ID: mdl-24894379

ABSTRACT

BACKGROUND: We have previously suggested a method for proteome wide analysis of variation at functional residues wherein we identified the set of all human genes with nonsynonymous single nucleotide variation (nsSNV) in the active site residue of the corresponding proteins. 34 of these proteins were shown to have a 1:1:1 enzyme:pathway:reaction relationship, making these proteins ideal candidates for laboratory validation through creation and observation of specific yeast active site knock-outs and downstream targeted metabolomics experiments. Here we present the next step in the workflow toward using yeast metabolic modeling to predict human metabolic behavior resulting from nsSNV. RESULTS: For the previously identified candidate proteins, we used the reciprocal best BLAST hits method followed by manual alignment and pathway comparison to identify 6 human proteins with yeast orthologs which were suitable for flux balance analysis (FBA). 5 of these proteins are known to be associated with diseases, including ribose 5-phosphate isomerase deficiency, myopathy with lactic acidosis and sideroblastic anaemia, anemia due to disorders of glutathione metabolism, and two porphyrias, and we suspect the sixth enzyme to have disease associations which are not yet classified or understood based on the work described herein. CONCLUSIONS: Preliminary findings using the Yeast 7.0 FBA model show lack of growth for only one enzyme, but augmentation of the Yeast 7.0 biomass function to better simulate knockout of certain genes suggested physiological relevance of variations in three additional proteins. Thus, we suggest the following four proteins for laboratory validation: delta-aminolevulinic acid dehydratase, ferrochelatase, ribose-5 phosphate isomerase and mitochondrial tyrosyl-tRNA synthetase. This study indicates that the predictive ability of this method will improve as more advanced, comprehensive models are developed. Moreover, these findings will be useful in the development of simple downstream biochemical or mass-spectrometric assays to corroborate these predictions and detect presence of certain known nsSNVs with deleterious outcomes. Results may also be useful in predicting as yet unknown outcomes of active site nsSNVs for enzymes that are not yet well classified or annotated.

Subject(s)

Fungal Proteins/genetics , Polymorphism, Single Nucleotide , Proteome/genetics , Saccharomyces cerevisiae/genetics , Fungal Proteins/chemistry , Fungal Proteins/metabolism , Humans , Proteome/chemistry , Proteome/metabolism , Saccharomyces cerevisiae/metabolism

16.

Proteome-wide analysis of nonsynonymous single-nucleotide variations in active sites of human proteins.

Dingerdissen, Hayley; Motwani, Mona; Karagiannis, Konstantinos; Simonyan, Vahan; Mazumder, Raja.

FEBS J ; 280(6): 1542-62, 2013 Mar.

Article in English | MEDLINE | ID: mdl-23350563

ABSTRACT

An enzyme's active site is essential to normal protein activity such that any disruptions at this site may lead to dysfunction and disease. Nonsynonymous single-nucleotide variations (nsSNVs), which alter the amino acid sequence, are one type of disruption that can alter the active site. When this occurs, it is assumed that enzyme activity will vary because of the criticality of the site to normal protein function. We integrate nsSNV data and active site annotations from curated resources to identify all active-site-impacting nsSNVs in the human genome and search for all pathways observed to be associated with this data set to assess the likely consequences. We find that there are 934 unique nsSNVs that occur at the active sites of 559 proteins. Analysis of the nsSNV data shows an over-representation of arginine and an under-representation of cysteine, phenylalanine and tyrosine when comparing the list of nsSNV-impacted active site residues with the list of all possible proteomic active site residues, implying a potential bias for or against variation of these residues at the active site. Clustering analysis shows an abundance of hydrolases and transferases. Pathway and functional analysis shows several pathways over- or under-represented in the data set, with the most significantly affected pathways involved in carbohydrate metabolism. We provide a table of 32 variation-substrate/product pairs that can be used in targeted metabolomics experiments to assay the effects of specific variations. In addition, we report the significant prevalence of aspartic acid to histidine variation in eight proteins associated with nine diseases including glycogen storage diseases, lacrimo-auriculo-dento-digital syndrome, Parkinson's disease and several cancers.

Subject(s)

Amino Acid Substitution , Catalytic Domain , Genome, Human , Polymorphism, Single Nucleotide , Proteome/analysis , Abnormalities, Multiple/genetics , Abnormalities, Multiple/pathology , Arginine/chemistry , Arginine/genetics , Aspartic Acid/chemistry , Aspartic Acid/genetics , Carbohydrate Metabolism , Cluster Analysis , Enzyme Activation , Genetic Variation , Glycogen Storage Disease/genetics , Glycogen Storage Disease/pathology , Hearing Loss/genetics , Hearing Loss/pathology , Histidine/chemistry , Histidine/genetics , Humans , Lacrimal Apparatus Diseases/genetics , Lacrimal Apparatus Diseases/pathology , Metabolomics/methods , Molecular Sequence Annotation , Phenylalanine/chemistry , Phenylalanine/genetics , Proteome/chemistry , Proteome/genetics , Proteomics/methods , Structure-Activity Relationship , Syndactyly/genetics , Syndactyly/pathology , Tooth Abnormalities/genetics , Tooth Abnormalities/pathology , Tyrosine/chemistry , Tyrosine/genetics

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL