RESUMO
Understanding how genetic variants influence molecular phenotypes in different cellular contexts is crucial for elucidating the molecular and cellular mechanisms behind complex traits, which in turn has spurred significant advances in research into molecular quantitative trait locus (xQTL) at the cellular level. With the rapid proliferation of data, there is a critical need for a comprehensive and accessible platform to integrate this information. To meet this need, we developed xQTLatlas (http://www.hitxqtl.org.cn/), a database that provides a multi-omics genetic regulatory landscape at cellular resolution. xQTLatlas compiles xQTL summary statistics from 151 cell types and 339 cell states across 55 human tissues. It organizes these data into 20 xQTL types, based on four distinct discovery strategies, and spans 13 molecular phenotypes. Each entry in xQTLatlas is meticulously annotated with comprehensive metadata, including the origin of the tissue, cell type, cell state and the QTL discovery strategies utilized. Additionally, xQTLatlas features multiscale data exploration tools and a suite of interactive visualizations, facilitating in-depth analysis of cell-level xQTL. xQTLatlas provides a valuable resource for deepening our understanding of the impact of functional variants on molecular phenotypes in different cellular environments, thereby facilitating extensive research efforts.
RESUMO
Single-cell RNA sequencing (scRNA-seq) detects whole transcriptome signals for large amounts of individual cells and is powerful for determining cell-to-cell differences and investigating the functional characteristics of various cell types. scRNA-seq datasets are usually sparse and highly noisy. Many steps in the scRNA-seq analysis workflow, including reasonable gene selection, cell clustering and annotation, as well as discovering the underlying biological mechanisms from such datasets, are difficult. In this study, we proposed an scRNA-seq analysis method based on the latent Dirichlet allocation (LDA) model. The LDA model estimates a series of latent variables, i.e. putative functions (PFs), from the input raw cell-gene data. Thus, we incorporated the 'cell-function-gene' three-layer framework into scRNA-seq analysis, as this framework is capable of discovering latent and complex gene expression patterns via a built-in model approach and obtaining biologically meaningful results through a data-driven functional interpretation process. We compared our method with four classic methods on seven benchmark scRNA-seq datasets. The LDA-based method performed best in the cell clustering test in terms of both accuracy and purity. By analysing three complex public datasets, we demonstrated that our method could distinguish cell types with multiple levels of functional specialization, and precisely reconstruct cell development trajectories. Moreover, the LDA-based method accurately identified the representative PFs and the representative genes for the cell types/cell stages, enabling data-driven cell cluster annotation and functional interpretation. According to the literature, most of the previously reported marker/functionally relevant genes were recognized.
Assuntos
Perfilação da Expressão Gênica , Análise de Célula Única , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Análise de Célula Única/métodos , Transcriptoma , Análise por Conglomerados , AlgoritmosRESUMO
SC2disease (http://easybioai.com/sc2disease/) is a manually curated database that aims to provide a comprehensive and accurate resource of gene expression profiles in various cell types for different diseases. With the development of single-cell RNA sequencing (scRNA-seq) technologies, uncovering cellular heterogeneity of different tissues for different diseases has become feasible by profiling transcriptomes across cell types at the cellular level. In particular, comparing gene expression profiles between different cell types and identifying cell-type-specific genes in various diseases offers new possibilities to address biological and medical questions. However, systematic, hierarchical and vast databases of gene expression profiles in human diseases at the cellular level are lacking. Thus, we reviewed the literature prior to March 2020 for studies which used scRNA-seq to study diseases with human samples, and developed the SC2disease database to summarize all the data by different diseases, tissues and cell types. SC2disease documents 946 481 entries, corresponding to 341 cell types, 29 tissues and 25 diseases. Each entry in the SC2disease database contains comparisons of differentially expressed genes between different cell types, tissues and disease-related health status. Furthermore, we reanalyzed gene expression matrix by unified pipeline to improve the comparability between different studies. For each disease, we also compare cell-type-specific genes with the corresponding genes of lead single nucleotide polymorphisms (SNPs) identified in genome-wide association studies (GWAS) to implicate cell type specificity of the traits.
Assuntos
Transtorno do Espectro Autista/genética , Doenças Autoimunes/genética , Doenças Cardiovasculares/genética , Bases de Dados Factuais , Gastroenteropatias/genética , Neoplasias/genética , Doenças Neurodegenerativas/genética , Viroses/genética , Algoritmos , Transtorno do Espectro Autista/metabolismo , Transtorno do Espectro Autista/patologia , Doenças Autoimunes/metabolismo , Doenças Autoimunes/patologia , Doenças Cardiovasculares/metabolismo , Doenças Cardiovasculares/patologia , Gastroenteropatias/metabolismo , Gastroenteropatias/patologia , Perfilação da Expressão Gênica , Heterogeneidade Genética , Estudo de Associação Genômica Ampla , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Internet , Neoplasias/metabolismo , Neoplasias/patologia , Doenças Neurodegenerativas/metabolismo , Doenças Neurodegenerativas/patologia , Especificidade de Órgãos , Polimorfismo de Nucleotídeo Único , Análise de Célula Única/métodos , Software , Transcriptoma , Viroses/metabolismo , Viroses/patologiaRESUMO
BACKGROUND: In biological systems, metabolomics can not only contribute to the discovery of metabolic signatures for disease diagnosis, but is very helpful to illustrate the underlying molecular disease-causing mechanism. Therefore, identification of disease-related metabolites is of great significance for comprehensively understanding the pathogenesis of diseases and improving clinical medicine. RESULTS: In the paper, we propose a disease and literature driven metabolism prediction model (DLMPM) to identify the potential associations between metabolites and diseases based on latent factor model. We build the disease glossary with disease terms from different databases and an association matrix based on the mapping between diseases and metabolites. The similarity of diseases and metabolites is used to complete the association matrix. Finally, we predict potential associations between metabolites and diseases based on the matrix decomposition method. In total, 1,406 direct associations between diseases and metabolites are found. There are 119,206 unknown associations between diseases and metabolites predicted with a coverage rate of 80.88%. Subsequently, we extract training sets and testing sets based on data increment from the database of disease-related metabolites and assess the performance of DLMPM on 19 diseases. As a result, DLMPM is proven to be successful in predicting potential metabolic signatures for human diseases with an average AUC value of 82.33%. CONCLUSION: In this paper, a computational model is proposed for exploring metabolite-disease pairs and has good performance in predicting potential metabolites related to diseases through adequate validation. The results show that DLMPM has a better performance in prioritizing candidate diseases-related metabolites compared with the previous methods and would be helpful for researchers to reveal more information about human diseases.
Assuntos
Metabolômica , Publicações , Biologia Computacional/métodos , Bases de Dados Factuais , Humanos , Metabolômica/métodosRESUMO
MOTIVATION: Evaluating genome similarity among individuals is an essential step in data analysis. Advanced sequencing technology detects more and rarer variants for massive individual genomes, thus enabling individual-level genome similarity evaluation. However, the current methodologies, such as the principal component analysis (PCA), lack the capability to fully leverage rare variants and are also difficult to interpret in terms of population genetics. RESULTS: Here, we introduce a probabilistic topic model, latent Dirichlet allocation, to evaluate individual genome similarity. A total of 2535 individuals from the 1000 Genomes Project (KGP) were used to demonstrate our method. Various aspects of variant choice and model parameter selection were studied. We found that relatively rare (0.001
Assuntos
Genoma , Software , Genética Populacional , Humanos , Modelos Estatísticos , Análise de Componente PrincipalRESUMO
BACKGROUND: As the terminal products of cellular regulatory process, functional related metabolites have a close relationship with complex diseases, and are often associated with the same or similar diseases. Therefore, identification of disease related metabolites play a critical role in understanding comprehensively pathogenesis of disease, aiming at improving the clinical medicine. Considering that a large number of metabolic markers of diseases need to be explored, we propose a computational model to identify potential disease-related metabolites based on functional relationships and scores of referred literatures between metabolites. First, obtaining associations between metabolites and diseases from the Human Metabolome database, we calculate the similarities of metabolites based on modified recommendation strategy of collaborative filtering utilizing the similarities between diseases. Next, a disease-associated metabolite network (DMN) is built with similarities between metabolites as weight. To improve the ability of identifying disease-related metabolites, we introduce scores of text mining from the existing database of chemicals and proteins into DMN and build a new disease-associated metabolite network (FLDMN) by fusing functional associations and scores of literatures. Finally, we utilize random walking with restart (RWR) in this network to predict candidate metabolites related to diseases. RESULTS: We construct the disease-associated metabolite network and its improved network (FLDMN) with 245 diseases, 587 metabolites and 28,715 disease-metabolite associations. Subsequently, we extract training sets and testing sets from two different versions of the Human Metabolome database and assess the performance of DMN and FLDMN on 19 diseases, respectively. As a result, the average AUC (area under the receiver operating characteristic curve) of DMN is 64.35%. As a further improved network, FLDMN is proven to be successful in predicting potential metabolic signatures for 19 diseases with an average AUC value of 76.03%. CONCLUSION: In this paper, a computational model is proposed for exploring metabolite-disease pairs and has good performance in predicting potential metabolites related to diseases through adequate validation. This result suggests that integrating literature and functional associations can be an effective way to construct disease associated metabolite network for prioritizing candidate diseases-related metabolites.
Assuntos
Biologia Computacional/métodos , Metaboloma , Algoritmos , Simulação por Computador , Mineração de Dados , Bases de Dados Factuais , Humanos , Publicações/estatística & dados numéricos , Curva ROCRESUMO
BACKGROUND: Over the past decades, a large number of long non-coding RNAs (lncRNAs) have been identified. Growing evidence has indicated that the mutation and dysregulation of lncRNAs play a critical role in the development of many complex human diseases. Consequently, identifying potential disease-related lncRNAs is an effective means to improve the quality of disease diagnostics and treatment, which is the motivation of this work. Here, we propose a computational model (LncDisAP) for potential disease-related lncRNA identification based on multiple biological datasets. First, the associations between lncRNA and different data sources are collected from different databases. With these data sources as dimensions, we calculate the functional associations between lncRNAs by the recommendation strategy of collaborative filtering. Subsequently, a disease-associated lncRNA functional network is built with functional similarities between lncRNAs as the weight. Ultimately, potential disease-related lncRNAs can be identified based on ranked scores derived by random walking with restart (RWR). Then, training sets and testing sets are extracted from two different versions of a disease-lncRNA dataset to assess the performance of LncDisAP on 54 diseases. RESULTS: A lncRNA functional network is built based on the proposed computational model, and it contains 66,060 associations among 364 lncRNAs associated with 182 diseases in total. We extract 218 known disease-lncRNA pairs associated with 54 diseases to assess the network. As a result, the average AUC (area under the receiver operating characteristic curve) of LncDisAP is 78.08%. CONCLUSION: In this article, a computational model integrating multiple lncRNA-related biological datasets is proposed for identifying potential disease-related lncRNAs. The result shows that LncDisAP is successful in predicting novel disease-related lncRNA signatures. In addition, with several common cancers taken as case studies, we found some unknown lncRNAs that could be associated with these diseases through our network. These results suggest that this method can be helpful in improving the quality for disease diagnostics and treatment.
Assuntos
Algoritmos , Biologia Computacional/métodos , Simulação por Computador , Bases de Dados Genéticas , Doença/genética , Regulação da Expressão Gênica , RNA Longo não Codificante/genética , Área Sob a Curva , Redes Reguladoras de Genes , Humanos , Curva ROCRESUMO
MOTIVATION: Whole genome sequencing (WGS) of parent-offspring trios is a powerful approach for identifying disease-associated genes via detecting copy number variations (CNVs). Existing approaches, which detect CNVs for each individual in a trio independently, usually yield low-detection accuracy. Joint modeling approaches leveraging Mendelian transmission within the parent-offspring trio can be an efficient strategy to improve CNV detection accuracy. RESULTS: In this study, we developed TrioCNV, a novel approach for jointly detecting CNVs in parent-offspring trios from WGS data. Using negative binomial regression, we modeled the read depth signal while considering both GC content bias and mappability bias. Moreover, we incorporated the family relationship and used a hidden Markov model to jointly infer CNVs for three samples of a parent-offspring trio. Through application to both simulated data and a trio from 1000 Genomes Project, we showed that TrioCNV achieved superior performance than existing approaches. AVAILABILITY AND IMPLEMENTATION: The software TrioCNV implemented using a combination of Java and R is freely available from the website at https://github.com/yongzhuang/TrioCNV CONTACT: ydwang@hit.edu.cn SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Variações do Número de Cópias de DNA , Software , Algoritmos , Genoma , Humanos , PaisRESUMO
BACKGROUND: The Gene Ontology (GO) has been used in high-throughput omics research as a major bioinformatics resource. The hierarchical structure of GO provides users a convenient platform for biological information abstraction and hypothesis testing. Computational methods have been developed to identify functionally similar genes. However, none of the existing measurements take into account all the rich information in GO. Similarly, using these existing methods, web-based applications have been constructed to compute gene functional similarities, and to provide pure text-based outputs. Without a graphical visualization interface, it is difficult for result interpretation. RESULTS: We present InteGO2, a web tool that allows researchers to calculate the GO-based gene semantic similarities using seven widely used GO-based similarity measurements. Also, we provide an integrative measurement that synergistically integrates all the individual measurements to improve the overall performance. Using HTML5 and cytoscape.js, we provide a graphical interface in InteGO2 to visualize the resulting gene functional association networks. CONCLUSIONS: InteGO2 is an easy-to-use HTML5 based web tool. With it, researchers can measure gene or gene product functional similarity conveniently, and visualize the network of functional interactions in a graphical interface. InteGO2 can be accessed via http://mlg.hit.edu.cn:8089/ .
Assuntos
Ontologia Genética , Software , Algoritmos , Biologia Computacional , Redes Reguladoras de Genes , Genes BacterianosRESUMO
MOTIVATION: Families with inherited diseases are widely used in Mendelian/complex disease studies. Owing to the advances in high-throughput sequencing technologies, family genome sequencing becomes more and more prevalent. Visualizing family genomes can greatly facilitate human genetics studies and personalized medicine. However, due to the complex genetic relationships and high similarities among genomes of consanguineous family members, family genomes are difficult to be visualized in traditional genome visualization framework. How to visualize the family genome variants and their functions with integrated pedigree information remains a critical challenge. RESULTS: We developed the Family Genome Browser (FGB) to provide comprehensive analysis and visualization for family genomes. The FGB can visualize family genomes in both individual level and variant level effectively, through integrating genome data with pedigree information. Family genome analysis, including determination of parental origin of the variants, detection of de novo mutations, identification of potential recombination events and identical-by-decent segments, etc., can be performed flexibly. Diverse annotations for the family genome variants, such as dbSNP memberships, linkage disequilibriums, genes, variant effects, potential phenotypes, etc., are illustrated as well. Moreover, the FGB can automatically search de novo mutations and compound heterozygous variants for a selected individual, and guide investigators to find high-risk genes with flexible navigation options. These features enable users to investigate and understand family genomes intuitively and systematically. AVAILABILITY AND IMPLEMENTATION: The FGB is available at http://mlg.hit.edu.cn/FGB/.
Assuntos
Genoma Humano , Linhagem , Software , Gráficos por Computador , Variação Genética , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Anotação de Sequência MolecularRESUMO
Advances in high-throughput sequencing technologies have brought us into the individual genome era. Projects such as the 1000 Genomes Project have led the individual genome sequencing to become more and more popular. How to visualize, analyse and annotate individual genomes with knowledge bases to support genome studies and personalized healthcare is still a big challenge. The Personal Genome Browser (PGB) is developed to provide comprehensive functional annotation and visualization for individual genomes based on the genetic-molecular-phenotypic model. Investigators can easily view individual genetic variants, such as single nucleotide variants (SNVs), INDELs and structural variations (SVs), as well as genomic features and phenotypes associated to the individual genetic variants. The PGB especially highlights potential functional variants using the PGB built-in method or SIFT/PolyPhen2 scores. Moreover, the functional risks of genes could be evaluated by scanning individual genetic variants on the whole genome, a chromosome, or a cytoband based on functional implications of the variants. Investigators can then navigate to high risk genes on the scanned individual genome. The PGB accepts Variant Call Format (VCF) and Genetic Variation Format (GVF) files as the input. The functional annotation of input individual genome variants can be visualized in real time by well-defined symbols and shapes. The PGB is available at http://www.pgbrowser.org/.
Assuntos
Variação Genética , Genoma Humano , Software , Gráficos por Computador , Genômica , Humanos , InternetRESUMO
Conventional COVID-19 testing methods have some flaws: they are expensive and time-consuming. Chest X-ray (CXR) diagnostic approaches can alleviate these flaws to some extent. However, there is no accurate and practical automatic diagnostic framework with good interpretability. The application of artificial intelligence (AI) technology to medical radiography can help to accurately detect the disease, reduce the burden on healthcare organizations, and provide good interpretability. Therefore, this study proposes a new deep neural network (CNN) based on CXR for COVID-19 diagnosis - CodeNet. This method uses contrastive learning to make full use of latent image data to enhance the model's ability to extract features and generalize across different data domains. On the evaluation dataset, the proposed method achieves an accuracy as high as 94.20%, outperforming several other existing methods used for comparison. Ablation studies validate the efficacy of the proposed method, while interpretability analysis shows that the method can effectively guide clinical professionals. This work demonstrates the superior detection performance of a CNN using contrastive learning techniques on CXR images, paving the way for computer vision and artificial intelligence technologies to leverage massive medical data for disease diagnosis.
Assuntos
COVID-19 , Aprendizado Profundo , Humanos , COVID-19/diagnóstico por imagem , Teste para COVID-19 , Inteligência Artificial , Redes Neurais de ComputaçãoRESUMO
The detection of drug-target interactions (DTIs) plays an important role in drug discovery and development, making DTI prediction urgent to be solved. Existing computational methods usually utilize drug similarity, target similarity and DTI information to make prediction, providing the convenience of fast time and low cost. However, they usually learn features for drugs and targets separately, lacking of a global consideration. In this study, we proposed a novel neighborhood-based global network model, named as NGN, to accurately predict DTIs from the global perspective. We designed a distance constraint for features of all entities (drugs and targets) in the latent space to ensure the close distance between adjacent entities, and defined a global probability matrix to compute the predicted DTI scores on our constructed neighborhood-based global network. Results showed that NGN obtained advantageous performance compared with other state-of-the-art methods, especially surpassing them by 4.2-9.1 percent on AUPR values in the biggest dataset. Furthermore, several novel high-ranked DTIs were successfully predicted with confirmations by public sources, demonstrating the effectiveness of our method.
Assuntos
Descoberta de Drogas , Descoberta de Drogas/métodos , Interações MedicamentosasRESUMO
BACKGROUND: RNA-binding proteins (RBPs) play diverse roles in eukaryotic RNA processing. Despite their pervasive functions in coding and noncoding RNA biogenesis and regulation, elucidating the sequence specificities that define protein-RNA interactions remains a major challenge. Recently, CLIP-seq (Cross-linking immunoprecipitation followed by high-throughput sequencing) has been successfully implemented to study the transcriptome-wide binding patterns of SRSF1, PTBP1, NOVA and fox2 proteins. These studies either adopted traditional methods like Multiple EM for Motif Elicitation (MEME) to discover the sequence consensus of RBP's binding sites or used Z-score statistics to search for the overrepresented nucleotides of a certain size. We argue that most of these methods are not well-suited for RNA motif identification, as they are unable to incorporate the RNA structural context of protein-RNA interactions, which may affect to binding specificity. Here, we describe a novel model-based approach--RNAMotifModeler to identify the consensus of protein-RNA binding regions by integrating sequence features and RNA secondary structures. RESULTS: As an example, we implemented RNAMotifModeler on SRSF1 (SF2/ASF) CLIP-seq data. The sequence-structural consensus we identified is a purine-rich octamer 'AGAAGAAG' in a highly single-stranded RNA context. The unpaired probabilities, the probabilities of not forming pairs, are significantly higher than negative controls and the flanking sequence surrounding the binding site, indicating that SRSF1 proteins tend to bind on single-stranded RNA. Further statistical evaluations revealed that the second and fifth bases of SRSF1octamer motif have much stronger sequence specificities, but weaker single-strandedness, while the third, fourth, sixth and seventh bases are far more likely to be single-stranded, but have more degenerate sequence specificities. Therefore, we hypothesize that nucleotide specificity and secondary structure play complementary roles during binding site recognition by SRSF1. CONCLUSION: In this study, we presented a computational model to predict the sequence consensus and optimal RNA secondary structure for protein-RNA binding regions. The successful implementation on SRSF1 CLIP-seq data demonstrates great potential to improve our understanding on the binding specificity of RNA binding proteins.
Assuntos
Proteínas Nucleares/metabolismo , Proteínas de Ligação a RNA/metabolismo , RNA/metabolismo , Algoritmos , Pareamento de Bases , Sítios de Ligação , Modelos Estatísticos , Proteínas Nucleares/genética , Conformação de Ácido Nucleico , Proteínas de Ligação a RNA/genética , Curva ROC , Fatores de Processamento de Serina-ArgininaRESUMO
'miR2Disease', a manually curated database, aims at providing a comprehensive resource of microRNA deregulation in various human diseases. The current version of miR2Disease documents 1939 curated relationships between 299 human microRNAs and 94 human diseases by reviewing more than 600 published papers. Around one-seventh of the microRNA-disease relationships represent the pathogenic roles of deregulated microRNA in human disease. Each entry in the miR2Disease contains detailed information on a microRNA-disease relationship, including a microRNA ID, the disease name, a brief description of the microRNA-disease relationship, an expression pattern of the microRNA, the detection method for microRNA expression, experimentally verified target gene(s) of the microRNA and a literature reference. miR2Disease provides a user-friendly interface for a convenient retrieval of each entry by microRNA ID, disease name, or target gene. In addition, miR2Disease offers a submission page that allows researchers to submit established microRNA-disease relationships that are not documented. Once approved by the submission review committee, the submitted records will be included in the database. miR2Disease is freely available at http://www.miR2Disease.org.
Assuntos
Bases de Dados de Ácidos Nucleicos , Doença/genética , MicroRNAs/metabolismo , Regulação da Expressão Gênica , Humanos , Interface Usuário-ComputadorRESUMO
Modeling-based anti-cancer drug sensitivity prediction has been extensively studied in recent years. While most drug sensitivity prediction models only use gene expression data, the remarkable impacts of gene mutation, methylation, and copy number variation on drug sensitivity are neglected. Drug sensitivity prediction can both help protect patients from some adverse drug reactions and improve the efficacy of treatment. Genomics data are extremely useful for drug sensitivity prediction task. This article reviews the role of drug sensitivity prediction, describes a variety of methods for predicting drug sensitivity. Moreover, the research significance of drug sensitivity prediction, as well as existing problems are well discussed.
RESUMO
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), causing an outbreak of coronavirus disease 2019 (COVID-19), has been undergoing various mutations. The analysis of the structural and energetic effects of mutations on protein-protein interactions between the receptor binding domain (RBD) of SARS-CoV-2 and angiotensin converting enzyme 2 (ACE2) or neutralizing monoclonal antibodies will be beneficial for epidemic surveillance, diagnosis, and optimization of neutralizing agents. According to the molecular dynamics simulation, a key mutation N439K in the SARS-CoV-2 RBD region created a new salt bridge with Glu329 of hACE2, which resulted in greater electrostatic complementarity, and created a weak salt bridge with Asp442 of RBD. Furthermore, the N439K-mutated RBD bound hACE2 with a higher affinity than wild-type, which may lead to more infectious. In addition, the N439K-mutated RBD was markedly resistant to the SARS-CoV-2 neutralizing antibody REGN10987, which may lead to the failure of neutralization. The results show consistent with the previous experimental conclusion and clarify the structural mechanism under affinity changes. Our methods will offer guidance on the assessment of the infection efficiency and antigenicity effect of continuing mutations in SARS-CoV-2.
RESUMO
Gastric cancer (GC) is the fourth most common malignant tumor. The mechanism underlying GC occurrence and development remains unclear. Previous studies have indicated that long non-coding RNAs (lncRNAs) are significantly associated with gastric cancer, but a systematic understanding of the role of lncRNAs in gastric cancer is lacking. In recent years, with the development of next-generation sequencing technology, tens of thousands of lncRNAs have been discovered. However, a large number of unannotated lncRNAs remain unidentified in different tissues, including potential gastric cancer-related lncRNAs. In this study, RNA sequencing (RNA-seq) data from 16 samples of eight gastric cancer patients were obtained and analyzed. A total of 1,854 previously unannotated lncRNAs were identified by ab initio assembly, and 520 differentially expressed lncRNAs were validated in the TCGA expression dataset. Methylation and copy number variation (CNV) array data from the same sample were integrated in the analysis. Changes in DNA methylation levels and CNVs may be responsible for the differential expression of 91 lncRNAs. Differentially expressed lncRNAs were enriched in coexpressed clusters of genes related to functions such as cell signaling, cell cycle, immune response, metabolic processes, angiogenesis, and regulation of retinoic acid (RA) receptors. Finally, a differentially expressed lncRNA, AC004510.3, was identified as a potential biomarker for the prediction of the overall survival of gastric cancer patients.
RESUMO
Enzymes are proteins that can efficiently catalyze specific biochemical reactions, and they are widely present in the human body. Developing an efficient method to identify human enzymes is vital to select enzymes from the vast number of human proteins and to investigate their functions. Nevertheless, only a limited amount of research has been conducted on the classification of human enzymes and nonenzymes. In this work, we developed a support vector machine- (SVM-) based predictor to classify human enzymes using the amino acid composition (AAC), the composition of k-spaced amino acid pairs (CKSAAP), and selected informative amino acid pairs through the use of a feature selection technique. A training dataset including 1117 human enzymes and 2099 nonenzymes and a test dataset including 684 human enzymes and 1270 nonenzymes were constructed to train and test the proposed model. The results of jackknife cross-validation showed that the overall accuracy was 76.46% for the training set and 76.21% for the test set, which are higher than the 72.6% achieved in previous research. Furthermore, various feature extraction methods and mainstream classifiers were compared in this task, and informative feature parameters of k-spaced amino acid pairs were selected and compared. The results suggest that our classifier can be used in human enzyme identification effectively and efficiently and can help to understand their functions and develop new drugs.