Your browser doesn't support javascript.
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 1.249
Filtrar
1.
Nat Commun ; 10(1): 4613, 2019 10 10.
Artigo em Inglês | MEDLINE | ID: mdl-31601804

RESUMO

Characterizing and interpreting heterogeneous mixtures at the cellular level is a critical problem in genomics. Single-cell assays offer an opportunity to resolve cellular level heterogeneity, e.g., scRNA-seq enables single-cell expression profiling, and scATAC-seq identifies active regulatory elements. Furthermore, while scHi-C can measure the chromatin contacts (i.e., loops) between active regulatory elements to target genes in single cells, bulk HiChIP can measure such contacts in a higher resolution. In this work, we introduce DC3 (De-Convolution and Coupled-Clustering) as a method for the joint analysis of various bulk and single-cell data such as HiChIP, RNA-seq and ATAC-seq from the same heterogeneous cell population. DC3 can simultaneously identify distinct subpopulations, assign single cells to the subpopulations (i.e., clustering) and de-convolve the bulk data into subpopulation-specific data. The subpopulation-specific profiles of gene expression, chromatin accessibility and enhancer-promoter contact obtained by DC3 provide a comprehensive characterization of the gene regulatory system in each subpopulation.


Assuntos
Algoritmos , Análise por Conglomerados , Perfilação da Expressão Gênica/estatística & dados numéricos , Genômica/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Animais , Linhagem Celular , Cromatina , Imunoprecipitação da Cromatina/estatística & dados numéricos , Simulação por Computador , Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Humanos , Camundongos , Regiões Promotoras Genéticas , Análise de Célula Única/métodos
2.
Medicine (Baltimore) ; 98(37): e17100, 2019 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-31517839

RESUMO

BACKGROUND: Tongue squamous cell carcinoma (TSCC) is one of the most common malignant tumors in head and neck, but its molecular mechanism is not clear. METHODS: Weighted gene co-expression network analysis (WGCNA) combining with gene differential expression analysis, survival analysis to screen key modules and hub genes related to the progress of TSCC. Gene Set Enrichment Analysis (GSEA) was used to identify biological pathways that might be involved. RESULTS: Weighted gene co-expression network was constructed based on dataset GSE34105. The blue module and turquoise module most related to the progress of TSCC were identified by the network. Gene Ontology (GO) enrichment analysis showed that 2 key modules were significantly enriched in apoptosis and immunity related biological processes and pathway. Network topology analysis, gene difference analysis and survival analysis were used to screen 9 hub genes (NOC2L, AIMP2, ANXA2, DIABLO, H2AFZ, MANBAL, PRDX6, SNX14, TIMM23). The expression of hub genes was significantly correlated with the prognosis of TSCC. GSEA showed that the high expression group of hub genes was mainly enriched in olfactory transduction, neuroactive ligand receptor interaction, nicotinate and nicotinamide metabolism, and the low expression group was mainly enriched in base excision repair, cysteine and methionine metabolism, oxidative phosphorylation. CONCLUSION: Two key modules and 9 hub genes screened by WGCNA were closely related to the occurrence and prognosis of TSCC. Hub genes can be used as biomarkers and potential therapeutic targets for the accurate diagnosis and treatment of TSCC in the future.


Assuntos
Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Carcinoma de Células Escamosas de Cabeça e Pescoço/genética , Neoplasias da Língua/genética , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Modelos Lineares , Prognóstico , Neoplasias da Língua/classificação
3.
BMC Res Notes ; 12(1): 631, 2019 Sep 24.
Artigo em Inglês | MEDLINE | ID: mdl-31551084

RESUMO

OBJECTIVE: Basal stem rot disease causes severe economic losses to oil palm production in South-east Asia and little is known on the pathogenicity of the pathogen, the basidiomyceteous Ganoderma boninense. Our data presented here aims to identify both the house-keeping and pathogenicity genes of G. boninense using Illumina sequencing reads. DESCRIPTION: The hemibiotroph G. boninense establishes via root contact during early stage of colonization and subsequently kills the host tissue as the disease progresses. Information on the pathogenicity factors/genes that causes BSR remain poorly understood. In addition, the molecular expressions corresponding to G. boninense growth and pathogenicity are not reported. Here, six transcriptome datasets of G. boninense from two contrasting conditions (three biological replicates per condition) are presented. The first datasets, collected from a 7-day-old axenic condition provide an insight onto genes responsible for sustenance, growth and development of G. boninense while datasets of the infecting G. boninense collected from oil palm-G. boninense pathosystem (in planta condition) at 1 month post-inoculation offer a comprehensive avenue to understand G. boninense pathogenesis and infection especially in regard to molecular mechanisms and pathways. Raw sequences deposited in Sequence Read Archive (SRA) are available at NCBI SRA portal with PRJNA514399, bioproject ID.


Assuntos
Cultura Axênica/métodos , Ganoderma/genética , Perfilação da Expressão Gênica/métodos , Regulação Bacteriana da Expressão Gênica , /métodos , Arecaceae/microbiologia , Ganoderma/patogenicidade , Perfilação da Expressão Gênica/estatística & dados numéricos , Interações Hospedeiro-Patógeno , Doenças das Plantas/microbiologia , Raízes de Plantas/microbiologia , Transdução de Sinais/genética , Virulência/genética
4.
Nat Commun ; 10(1): 3512, 2019 08 05.
Artigo em Inglês | MEDLINE | ID: mdl-31383865

RESUMO

The amount of omics data in the public domain is increasing every year. Modern science has become a data-intensive discipline. Innovative solutions for data management, data sharing, and for discovering novel datasets are therefore increasingly required. In 2016, we released the first version of the Omics Discovery Index (OmicsDI) as a light-weight system to aggregate datasets across multiple public omics data resources. OmicsDI aggregates genomics, transcriptomics, proteomics, metabolomics and multiomics datasets, as well as computational models of biological processes. Here, we propose a set of novel metrics to quantify the attention and impact of biomedical datasets. A complete framework (now integrated into OmicsDI) has been implemented in order to provide and evaluate those metrics. Finally, we propose a set of recommendations for authors, journals and data resources to promote an optimal quantification of the impact of datasets.


Assuntos
Acesso à Informação , Conjuntos de Dados como Assunto , Disseminação de Informação , Biologia Computacional/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Genômica/estatística & dados numéricos , Humanos , Metabolômica/estatística & dados numéricos , Proteômica/estatística & dados numéricos
5.
Clin Lab ; 65(8)2019 Aug 01.
Artigo em Inglês | MEDLINE | ID: mdl-31414766

RESUMO

BACKGROUND: Colorectal cancer (CRC) involves the abnormal expression of a set of genetic and epigenetic genes, which may be useful for predicting prognosis. The transcription factor homeobox C9 (HOXC9) is a member of the homeobox family and participates in diverse cellular metabolic processes. In the current study, the prognostic value of HOXC9 in CRC was evaluated by analyzing public data from The Cancer Genome Atlas. METHODS: The correlation between clinical features and HOXC9 expression levels was evaluated by logistic regression. Kaplan-Meier and Cox regression was performed to determine the association between HOXC9 expression and patient prognosis. Gene set enrichment analysis was conducted to explore the function of HOXC9 in CRC. RESULTS: HOXC9 showed higher expression in tumor tissue than in normal tissue. An increased level of HOXC9 in CRC was notably associated with an advanced tumor stage (OR = 1.58, for stage I/II vs. stage III/IV, p = 0.037), increased risk of distant metastasis (odds ratio = 1.84, for T1/T2 vs. T3/T4, p = 0.025), and tendency for venous invasion (OR = 2.25, p = 0.003). Kaplan-Meier analysis revealed that higher HOXC9 levels were predictive of poor overall (p = 0.0083) and progression-free survival (p = 0.0014). Multivariate COX regression model analysis proved that HOXC9 was independently associated with overall survival (hazard ratio = 2.88, 95% confidence interval: 1.14 - 7.29, p = 0.025). Gene set enrichment analysis showed that several biological function symbols were particularly enriched in the increased HOXC9 phenotype. CONCLUSIONS: HOXC9 may play a critical role in CRC progression and serve as a novel potential marker of poor prognosis in CRC.


Assuntos
Biomarcadores Tumorais/genética , Neoplasias Colorretais/genética , Perfilação da Expressão Gênica/métodos , Regulação Neoplásica da Expressão Gênica , Proteínas de Homeodomínio/genética , Biomarcadores Tumorais/metabolismo , Neoplasias Colorretais/metabolismo , Neoplasias Colorretais/patologia , Progressão da Doença , Feminino , Perfilação da Expressão Gênica/estatística & dados numéricos , Proteínas de Homeodomínio/metabolismo , Humanos , Estimativa de Kaplan-Meier , Masculino , Pessoa de Meia-Idade , Estadiamento de Neoplasias , Prognóstico , Modelos de Riscos Proporcionais
6.
PLoS Comput Biol ; 15(8): e1007264, 2019 08.
Artigo em Inglês | MEDLINE | ID: mdl-31404060

RESUMO

Accurately predicting and testing the types of Pulmonary arterial hypertension (PAH) of each patient using cost-effective microarray-based expression data and machine learning algorithms could greatly help either identifying the most targeting medicine or adopting other therapeutic measures that could correct/restore defective genetic signaling at the early stage. Furthermore, the prediction model construction processes can also help identifying highly informative genes controlling PAH, leading to enhanced understanding of the disease etiology and molecular pathways. In this study, we used several different gene filtering methods based on microarray expression data obtained from a high-quality patient PAH dataset. Following that, we proposed a novel feature selection and refinement algorithm in conjunction with well-known machine learning methods to identify a small set of highly informative genes. Results indicated that clusters of small-expression genes could be extremely informative at predicting and differentiating different forms of PAH. Additionally, our proposed novel feature refinement algorithm could lead to significant enhancement in model performance. To summarize, integrated with state-of-the-art machine learning and novel feature refining algorithms, the most accurate models could provide near-perfect classification accuracies using very few (close to ten) low-expression genes.


Assuntos
/genética , Algoritmos , Estudos de Casos e Controles , Biologia Computacional , Bases de Dados Genéticas , Expressão Gênica , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Modelos Genéticos , Mutação , Análise de Sequência com Séries de Oligonucleotídeos/estatística & dados numéricos , /etiologia , Aprendizado de Máquina Supervisionado
7.
BMC Genomics ; 20(1): 540, 2019 Jul 02.
Artigo em Inglês | MEDLINE | ID: mdl-31266443

RESUMO

BACKGROUND: Transcriptomic profiles can improve our understanding of the phenotypic molecular basis of biological research, and many statistical methods have been proposed to identify differentially expressed genes (DEGs) under two or more conditions with RNA-seq data. However, statistical analyses with RNA-seq data are often limited by small sample sizes, and global variance estimates of RNA expression levels have been utilized as prior distributions for gene-specific variance estimates, making it difficult to generalize the methods to more complicated settings. We herein proposed a Bartlett-Adjusted Likelihood-based LInear mixed model approach (BALLI) to analyze more complicated RNA-seq data. The proposed method estimates the technical and biological variances with a linear mixed-effects model, with and without adjusting small sample bias using Bartlkett's corrections. RESULTS: We conducted extensive simulations to compare the performance of BALLI with those of existing approaches (edgeR, DESeq2, and voom). Results from the simulation studies showed that BALLI correctly controlled the type-1 error rates at various nominal significance levels and produced better statistical power and precision estimates than those of other competing methods in various scenarios. Furthermore, BALLI was robust to variation of library size. It was also successfully applied to Holstein milk yield data, illustrating its practical value. CONCLUSIONS;: BALLI is statistically more efficient and valid than existing methods, and we conclude that it is useful for identifying DEGs in RNA-seq analysis.


Assuntos
Bovinos/genética , Biologia Computacional/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Modelos Lineares , Análise de Sequência de RNA/estatística & dados numéricos , Animais , Biologia Computacional/métodos , Feminino , Perfilação da Expressão Gênica/métodos , Funções Verossimilhança , Leite , Modelos Genéticos , Distribuição Aleatória , Tamanho da Amostra , Análise de Sequência de RNA/métodos , Software , Transcriptoma
8.
BMC Res Notes ; 12(1): 441, 2019 Jul 19.
Artigo em Inglês | MEDLINE | ID: mdl-31324268

RESUMO

OBJECTIVE: Visualization of sequencing data is an integral part of genomic data analysis. Although there are several tools to visualize sequencing data on genomic regions, they do not offer user-friendly ways to view simultaneously different groups of replicates. To address this need, we developed a tool that allows efficient viewing of both intra- and intergroup variation of sequencing counts on a genomic region, as well as their comparison to the output of user selected analysis methods, such as peak calling. RESULTS: We present an R package RepViz for replicate-driven visualization of genomic regions. With ChIP-seq and ATAC-seq data we demonstrate its potential to aid visual inspection involved in the evaluation of normalization, outlier behavior, detected features from differential peak calling analysis, and combined analysis of multiple data types. RepViz is readily available on Bioconductor ( https://www.bioconductor.org/packages/devel/bioc/html/RepViz.html ) and on Github ( https://github.com/elolab/RepViz ).


Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Genômica/métodos , Análise de Sequência de DNA/métodos , Software , Animais , Perfilação da Expressão Gênica/estatística & dados numéricos , Genômica/estatística & dados numéricos , Internet , Camundongos , Análise de Sequência de DNA/estatística & dados numéricos
9.
PLoS Comput Biol ; 15(7): e1007185, 2019 07.
Artigo em Inglês | MEDLINE | ID: mdl-31323017

RESUMO

To gain insights into complex biological processes, genome-scale data (e.g., RNA-Seq) are often overlaid on biochemical networks. However, many networks do not have a one-to-one relationship between genes and network edges, due to the existence of isozymes and protein complexes. Therefore, decisions must be made on how to overlay data onto networks. For example, for metabolic networks, these decisions include (1) how to integrate gene expression levels using gene-protein-reaction rules, (2) the approach used for selection of thresholds on expression data to consider the associated gene as "active", and (3) the order in which these steps are imposed. However, the influence of these decisions has not been systematically tested. We compared 20 decision combinations using a transcriptomic dataset across 32 tissues and showed that definition of which reaction may be considered as active (i.e., reactions of the genome-scale metabolic network with a non-zero expression level after overlaying the data) is mainly influenced by thresholding approach used. To determine the most appropriate decisions, we evaluated how these decisions impact the acquisition of tissue-specific active reaction lists that recapitulate organ-system tissue groups. These results will provide guidelines to improve data analyses with biochemical networks and facilitate the construction of context-specific metabolic models.


Assuntos
Perfilação da Expressão Gênica/métodos , Redes e Vias Metabólicas/genética , Fenômenos Bioquímicos , Biologia Computacional , Interpretação Estatística de Dados , Técnicas de Apoio para a Decisão , Perfilação da Expressão Gênica/estatística & dados numéricos , Redes Reguladoras de Genes , Humanos , Biologia de Sistemas
10.
BMC Med Genet ; 20(1): 104, 2019 06 11.
Artigo em Inglês | MEDLINE | ID: mdl-31185929

RESUMO

BACKGROUND: A multidirectional relationship has been demonstrated between myocardial infarction (MI) and depression. However, the causal genetic factors and molecular mechanisms underlying this interaction remain unclear. The main purpose of this study was to identify potential candidate genes for the interaction between the two diseases. METHODS: Using a bioinformatics approach and existing gene expression data in the biomedical discovery support system (BITOLA), we defined the starting concept X as "Myocardial Infarction" and end concept Z as "Major Depressive Disorder" or "Depressive disorder". All intermediate concepts relevant to the "Gene or Gene Product" for MI and depression were searched. Gene expression data and tissue-specific expression of potential candidate genes were evaluated using the Human eFP (electronic Fluorescent Pictograph) Browser, and intermediate concepts were filtered by manual inspection. RESULTS: Our analysis identified 128 genes common to both the "MI" and "depression" text mining concepts. Twenty-three of the 128 genes were selected as intermediates for this study, 9 of which passed the manual filtering step. Among the 9 genes, LCAT, CD4, SERPINA1, IL6, and PPBP failed to pass the follow-up filter in the Human eFP Browser, due to their low levels in the heart tissue. Finally, four genes (GNB3, CNR1, MTHFR, and NCAM1) remained. CONCLUSIONS: GNB3, CNR1, MTHFR, and NCAM1 are putative new candidate genes that may influence the interactions between MI and depression, and may represent potential targets for therapeutic intervention.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Transtorno Depressivo Maior/genética , Predisposição Genética para Doença/genética , Infarto do Miocárdio/genética , Perfilação da Expressão Gênica/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Reprodutibilidade dos Testes
11.
Stat Appl Genet Mol Biol ; 18(3)2019 05 01.
Artigo em Inglês | MEDLINE | ID: mdl-31042646

RESUMO

Gene Regulatory Networks (GRNs) are known as the most adequate instrument to provide a clear insight and understanding of the cellular systems. One of the most successful techniques to reconstruct GRNs using gene expression data is Bayesian networks (BN) which have proven to be an ideal approach for heterogeneous data integration in the learning process. Nevertheless, the incorporation of prior knowledge has been achieved by using prior beliefs or by using networks as a starting point in the search process. In this work, the utilization of different kinds of structural restrictions within algorithms for learning BNs from gene expression data is considered. These restrictions will codify prior knowledge, in such a way that a BN should satisfy them. Therefore, one aim of this work is to make a detailed review on the use of prior knowledge and gene expression data to inferring GRNs from BNs, but the major purpose in this paper is to research whether the structural learning algorithms for BNs from expression data can achieve better outcomes exploiting this prior knowledge with the use of structural restrictions. In the experimental study, it is shown that this new way to incorporate prior knowledge leads us to achieve better reverse-engineered networks.


Assuntos
Biologia Computacional/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Redes Reguladoras de Genes/genética , Algoritmos , Teorema de Bayes , Humanos , Modelos Genéticos
12.
PLoS Comput Biol ; 15(4): e1006899, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30939133

RESUMO

Small sample sizes combined with high person-to-person variability can make it difficult to detect significant gene expression changes from transcriptional profiling studies. Subtle, but coordinated, gene expression changes may be detected using gene set analysis approaches. Meta-analysis is another approach to increase the power to detect biologically relevant changes by integrating information from multiple studies. Here, we present a framework that combines both approaches and allows for meta-analysis of gene sets. QuSAGE meta-analysis extends our previously published QuSAGE framework, which offers several advantages for gene set analysis, including fully accounting for gene-gene correlations and quantifying gene set activity as a full probability density function. Application of QuSAGE meta-analysis to influenza vaccination response shows it can detect significant activity that is not apparent in individual studies.


Assuntos
Perfilação da Expressão Gênica/estatística & dados numéricos , Expressão Gênica , Software , Biologia Computacional , Humanos , Influenza Humana/genética , Influenza Humana/imunologia , Influenza Humana/prevenção & controle , Probabilidade , Vacinação
13.
Pac Symp Biocomput ; 24: 350-361, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30963074

RESUMO

Single-cell RNA sequencing (scRNA-seq) techniques have been very powerful in analyzing heterogeneous cell population and identifying cell types. Visualizing scRNA-seq data can help researchers effectively extract meaningful biological information and make new discoveries. While commonly used scRNA-seq visualization methods, such as t-SNE, are useful in detecting cell clusters, they often tear apart the intrinsic continuous structure in gene expression profiles. Topological Data Analysis (TDA) approaches like Mapper capture the shape of data by representing data as topological networks. TDA approaches are robust to noise and different platforms, while preserving the locality and data continuity. Moreover, instead of analyzing the whole dataset, Mapper allows researchers to explore biological meanings of specific pathways and genes by using different filter functions. In this paper, we applied Mapper to visualize scRNA-seq data. Our method can not only capture the clustering structure of cells, but also preserve the continuous gene expression topologies of cells. We demonstrated that by combining with gene co-expression network analysis, our method can reveal differential expression patterns of gene co-expression modules along the Mapper visualization.


Assuntos
RNA/genética , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Algoritmos , Biologia Computacional , Interpretação Estatística de Dados , Bases de Dados Genéticas/estatística & dados numéricos , Perfilação da Expressão Gênica/estatística & dados numéricos , Redes Reguladoras de Genes , Humanos , Melanoma/genética , Pâncreas/citologia , Pâncreas/metabolismo
14.
Pac Symp Biocomput ; 24: 362-373, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30963075

RESUMO

Single-cell RNA sequencing (scRNA-seq) is a powerful tool to profile the transcriptomes of a large number of individual cells at a high resolution. These data usually contain measurements of gene expression for many genes in thousands or tens of thousands of cells, though some datasets now reach the million-cell mark. Projecting high-dimensional scRNA-seq data into a low dimensional space aids downstream analysis and data visualization. Many recent preprints accomplish this using variational autoencoders (VAE), generative models that learn underlying structure of data by compress it into a constrained, low dimensional space. The low dimensional spaces generated by VAEs have revealed complex patterns and novel biological signals from large-scale gene expression data and drug response predictions. Here, we evaluate a simple VAE approach for gene expression data, Tybalt, by training and measuring its performance on sets of simulated scRNA-seq data. We find a number of counter-intuitive performance features: i.e., deeper neural networks can struggle when datasets contain more observations under some parameter configurations. We show that these methods are highly sensitive to parameter tuning: when tuned, the performance of the Tybalt model, which was not optimized for scRNA-seq data, outperforms other popular dimension reduction approaches - PCA, ZIFA, UMAP and t-SNE. On the other hand, without tuning performance can also be remarkably poor on the same data. Our results should discourage authors and reviewers from relying on self-reported performance comparisons to evaluate the relative value of contributions in this area at this time. Instead, we recommend that attempts to compare or benchmark autoencoder methods for scRNA-seq data be performed by disinterested third parties or by methods developers only on unseen benchmark data that are provided to all participants simultaneously because the potential for performance differences due to unequal parameter tuning is so high.


Assuntos
Perfilação da Expressão Gênica/estatística & dados numéricos , Análise de Sequência de RNA/estatística & dados numéricos , Análise de Célula Única/estatística & dados numéricos , Análise por Conglomerados , Biologia Computacional , Simulação por Computador , Humanos , Transcriptoma
15.
Pac Symp Biocomput ; 24: 374-385, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30963076

RESUMO

When analyzing biological data, it can be helpful to consider gene sets, or predefined groups of biologically related genes. Methods exist for identifying gene sets that are differential between conditions, but large public datasets from consortium projects and single-cell RNA-Sequencing have opened the door for gene set analysis using more sophisticated machine learning techniques, such as autoencoders and variational autoencoders. We present shallow sparsely-connected autoencoders (SSCAs) and variational autoencoders (SSCVAs) as tools for projecting gene-level data onto gene sets. We tested these approaches on single-cell RNA-Sequencing data from blood cells and on RNA-Sequencing data from breast cancer patients. Both SSCA and SSCVA can recover known biological features from these datasets and the SSCVA method often outperforms SSCA (and six existing gene set scoring algorithms) on classification and prediction tasks.


Assuntos
Perfilação da Expressão Gênica/estatística & dados numéricos , Redes Reguladoras de Genes , Análise de Sequência de RNA/estatística & dados numéricos , Células Sanguíneas/metabolismo , Neoplasias da Mama/genética , Neoplasias da Mama/mortalidade , Biologia Computacional , Feminino , Humanos , Análise de Célula Única/estatística & dados numéricos , Aprendizado de Máquina Supervisionado , Análise de Sobrevida
16.
PLoS Comput Biol ; 15(4): e1006937, 2019 04.
Artigo em Inglês | MEDLINE | ID: mdl-30973878

RESUMO

Gestational alcohol exposure causes fetal alcohol spectrum disorder (FASD) and is a prominent cause of neurodevelopmental disability. Whole transcriptome sequencing (RNA-Seq) offer insights into mechanisms underlying FASD, but gene-level analysis provides limited information regarding complex transcriptional processes such as alternative splicing and non-coding RNAs. Moreover, traditional analytical approaches that use multiple hypothesis testing with a false discovery rate adjustment prioritize genes based on an adjusted p-value, which is not always biologically relevant. We address these limitations with a novel approach and implemented an unsupervised machine learning model, which we applied to an exon-level analysis to reduce data complexity to the most likely functionally relevant exons, without loss of novel information. This was performed on an RNA-Seq paired-end dataset derived from alcohol-exposed neural fold-stage chick crania, wherein alcohol causes facial deficits recapitulating those of FASD. A principal component analysis along with k-means clustering was utilized to extract exons that deviated from baseline expression. This identified 6857 differentially expressed exons representing 1251 geneIDs; 391 of these genes were identified in a prior gene-level analysis of this dataset. It also identified exons encoding 23 microRNAs (miRNAs) having significantly differential expression profiles in response to alcohol. We developed an RDAVID pipeline to identify KEGG pathways represented by these exons, and separately identified predicted KEGG pathways targeted by these miRNAs. Several of these (ribosome biogenesis, oxidative phosphorylation) were identified in our prior gene-level analysis. Other pathways are crucial to facial morphogenesis and represent both novel (focal adhesion, FoxO signaling, insulin signaling) and known (Wnt signaling) alcohol targets. Importantly, there was substantial overlap between the exomes themselves and the predicted miRNA targets, suggesting these miRNAs contribute to the gene-level expression changes. Our novel application of unsupervised machine learning in conjunction with statistical analyses facilitated the discovery of signaling pathways and miRNAs that inform mechanisms underlying FASD.


Assuntos
Éxons/genética , Transtornos do Espectro Alcoólico Fetal/genética , MicroRNAs/genética , Aprendizado de Máquina não Supervisionado , Animais , Big Data , Embrião de Galinha , Análise por Conglomerados , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Modelos Animais de Doenças , Etanol/toxicidade , Feminino , Perfilação da Expressão Gênica/estatística & dados numéricos , Humanos , Gravidez , Análise de Componente Principal , Aprendizado de Máquina não Supervisionado/estatística & dados numéricos
17.
BMC Genomics ; 20(1): 167, 2019 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-30832569

RESUMO

BACKGROUND: Deep learning has made tremendous successes in numerous artificial intelligence applications and is unsurprisingly penetrating into various biomedical domains. High-throughput omics data in the form of molecular profile matrices, such as transcriptomes and metabolomes, have long existed as a valuable resource for facilitating diagnosis of patient statuses/stages. It is timely imperative to compare deep learning neural networks against classical machine learning methods in the setting of matrix-formed omics data in terms of classification accuracy and robustness. RESULTS: Using 37 high throughput omics datasets, covering transcriptomes and metabolomes, we evaluated the classification power of deep learning compared to traditional machine learning methods. Representative deep learning methods, Multi-Layer Perceptrons (MLP) and Convolutional Neural Networks (CNN), were deployed and explored in seeking optimal architectures for the best classification performance. Together with five classical supervised classification methods (Linear Discriminant Analysis, Multinomial Logistic Regression, Naïve Bayes, Random Forest, Support Vector Machine), MLP and CNN were comparatively tested on the 37 datasets to predict disease stages or to discriminate diseased samples from normal samples. MLPs achieved the highest overall accuracy among all methods tested. More thorough analyses revealed that single hidden layer MLPs with ample hidden units outperformed deeper MLPs. Furthermore, MLP was one of the most robust methods against imbalanced class composition and inaccurate class labels. CONCLUSION: Our results concluded that shallow MLPs (of one or two hidden layers) with ample hidden neurons are sufficient to achieve superior and robust classification performance in exploiting numerical matrix-formed omics data for diagnosis purpose. Specific observations regarding optimal network width, class imbalance tolerance, and inaccurate labeling tolerance will inform future improvement of neural network applications on functional genomics data.


Assuntos
Aprendizado Profundo/tendências , Perfilação da Expressão Gênica/estatística & dados numéricos , Aprendizado de Máquina/tendências , Algoritmos , Inteligência Artificial/estatística & dados numéricos , Teorema de Bayes , Aprendizado Profundo/estatística & dados numéricos , Perfilação da Expressão Gênica/métodos , Humanos , Modelos Logísticos , Aprendizado de Máquina/estatística & dados numéricos , Metaboloma/genética , Máquina de Vetores de Suporte/estatística & dados numéricos , Máquina de Vetores de Suporte/tendências
18.
Pac Symp Biocomput ; 24: 160-171, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30864319

RESUMO

BACKGROUND: MicroRNAs (miRNAs) are small, non-coding RNA that regulate gene expression through post-transcriptional silencing. Differential expression observed in miRNAs, combined with advancements in deep learning (DL), have the potential to improve cancer classification by modelling non-linear miRNA-phenotype associations. We propose a novel miRNA-based deep cancer classifier (DCC) incorporating genomic and hierarchical tissue annotation, capable of accurately predicting the presence of cancer in wide range of human tissues. METHODS: miRNA expression profiles were analyzed for 1746 neoplastic and 3871 normal samples, across 26 types of cancer involving six organ sub-structures and 68 cell types. miRNAs were ranked and filtered using a specificity score representing their information content in relation to neoplasticity, incorporating 3 levels of hierarchical biological annotation. A DL architecture composed of stacked autoencoders (AE) and a multi-layer perceptron (MLP) was trained to predict neoplasticity using 497 abundant and informative miRNAs. Additional DCCs were trained using expression of miRNA cistrons and sequence families, and combined as a diagnostic ensemble. Important miRNAs were identified using backpropagation, and analyzed in Cytoscape using iCTNet and BiNGO. RESULTS: Nested four-fold cross-validation was used to assess the performance of the DL model. The model achieved an accuracy, AUC/ROC, sensitivity, and specificity of 94.73%, 98.6%, 95.1%, and 94.3%, respectively. CONCLUSION: Deep autoencoder networks are a powerful tool for modelling complex miRNA-phenotype associations in cancer. The proposed DCC improves classification accuracy by learning from the biological context of both samples and miRNAs, using anatomical and genomic annotation. Analyzing the deep structure of DCCs with backpropagation can also facilitate biological discovery, by performing gene ontology searches on the most highly significant features.


Assuntos
Aprendizado Profundo , MicroRNAs/genética , Neoplasias/genética , Biologia Computacional , Bases de Dados de Ácidos Nucleicos/estatística & dados numéricos , Diagnóstico por Computador/métodos , Feminino , Perfilação da Expressão Gênica/estatística & dados numéricos , Regulação Neoplásica da Expressão Gênica , Ontologia Genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Masculino , MicroRNAs/classificação , Anotação de Sequência Molecular , Neoplasias/classificação , Neoplasias/diagnóstico , Análise de Sequência de RNA
19.
Pac Symp Biocomput ; 24: 208-219, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30864323

RESUMO

Benchmark challenges, such as the Critical Assessment of Structure Prediction (CASP) and Dialogue for Reverse Engineering Assessments and Methods (DREAM) have been instrumental in driving the development of bioinformatics methods. Typically, challenges are posted, and then competitors perform a prediction based upon blinded test data. Challengers then submit their answers to a central server where they are scored. Recent efforts to automate these challenges have been enabled by systems in which challengers submit Docker containers, a unit of software that packages up code and all of its dependencies, to be run on the cloud. Despite their incredible value for providing an unbiased test-bed for the bioinformatics community, there remain opportunities to further enhance the potential impact of benchmark challenges. Specifically, current approaches only evaluate end-to-end performance; it is nearly impossible to directly compare methodologies or parameters. Furthermore, the scientific community cannot easily reuse challengers' approaches, due to lack of specifics, ambiguity in tools and parameters as well as problems in sharing and maintenance. Lastly, the intuition behind why particular steps are used is not captured, as the proposed workflows are not explicitly defined, making it cumbersome to understand the flow and utilization of data. Here we introduce an approach to overcome these limitations based upon the WINGS semantic workflow system. Specifically, WINGS enables researchers to submit complete semantic workflows as challenge submissions. By submitting entries as workflows, it then becomes possible to compare not just the results and performance of a challenger, but also the methodology employed. This is particularly important when dozens of challenge entries may use nearly identical tools, but with only subtle changes in parameters (and radical differences in results). WINGS uses a component driven workflow design and offers intelligent parameter and data selection by reasoning about data characteristics. This proves to be especially critical in bioinformatics workflows where using default or incorrect parameter values is prone to drastically altering results. Different challenge entries may be readily compared through the use of abstract workflows, which also facilitate reuse. WINGS is housed on a cloud based setup, which stores data, dependencies and workflows for easy sharing and utility. It also has the ability to scale workflow executions using distributed computing through the Pegasus workflow execution system. We demonstrate the application of this architecture to the DREAM proteogenomic challenge.


Assuntos
Benchmarking/métodos , Semântica , Fluxo de Trabalho , Algoritmos , Biologia Computacional/métodos , Perfilação da Expressão Gênica/estatística & dados numéricos , Genômica , Metadados , Proteínas/genética , Proteínas/metabolismo , Reprodutibilidade dos Testes , Análise de Sequência de RNA/estatística & dados numéricos
20.
Pac Symp Biocomput ; 24: 296-307, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-30864331

RESUMO

Transcriptome-wide association studies (TWAS) have recently gained great attention due to their ability to prioritize complex trait-associated genes and promote potential therapeutics development for complex human diseases. TWAS integrates genotypic data with expression quantitative trait loci (eQTLs) to predict genetically regulated gene expression components and associates predictions with a trait of interest. As such, TWAS can prioritize genes whose differential expressions contribute to the trait of interest and provide mechanistic explanation of complex trait(s). Tissue-specific eQTL information grants TWAS the ability to perform association analysis on tissues whose gene expression profiles are otherwise hard to obtain, such as liver and heart. However, as eQTLs are tissue context-dependent, whether and how the tissue-specificity of eQTLs influences TWAS gene prioritization has not been fully investigated. In this study, we addressed this question by adopting two distinct TWAS methods, PrediXcan and UTMOST, which assume single tissue and integrative tissue effects of eQTLs, respectively. Thirty-eight baseline laboratory traits in 4,360 antiretroviral treatment-naïve individuals from the AIDS Clinical Trials Group (ACTG) studies comprised the input dataset for TWAS. We performed TWAS in a tissue-specific manner and obtained a total of 430 significant gene-trait associations (q-value < 0.05) across multiple tissues. Single tissue-based analysis by PrediXcan contributed 116 of the 430 associations including 64 unique gene-trait pairs in 28 tissues. Integrative tissue-based analysis by UTMOST found the other 314 significant associations that include 50 unique gene-trait pairs across all 44 tissues. Both analyses were able to replicate some associations identified in past variant-based genome-wide association studies (GWAS), such as high-density lipoprotein (HDL) and CETP (PrediXcan, q-value = 3.2e-16). Both analyses also identified novel associations. Moreover, single tissue-based and integrative tissuebased analysis shared 11 of 103 unique gene-trait pairs, for example, PSRC1-low-density lipoprotein (PrediXcan's lowest q-value = 8.5e-06; UTMOST's lowest q-value = 1.8e-05). This study suggests that single tissue-based analysis may have performed better at discovering gene-trait associations when combining results from all tissues. Integrative tissue-based analysis was better at prioritizing genes in multiple tissues and in trait-related tissue. Additional exploration is needed to confirm this conclusion. Finally, although single tissue-based and integrative tissue-based analysis shared significant novel discoveries, tissue context-dependency of eQTLs impacted TWAS gene prioritization. This study provides preliminary data to support continued work on tissue contextdependency of eQTL studies and TWAS.


Assuntos
Perfilação da Expressão Gênica/estatística & dados numéricos , Estudo de Associação Genômica Ampla/estatística & dados numéricos , Especificidade de Órgãos/genética , Locos de Características Quantitativas , Transcriptoma , Fármacos Anti-HIV/uso terapêutico , Biologia Computacional , Perfilação da Expressão Gênica/métodos , Predisposição Genética para Doença , Estudo de Associação Genômica Ampla/métodos , Genótipo , Infecções por HIV/tratamento farmacológico , Infecções por HIV/genética , Humanos , Variantes Farmacogenômicos , Polimorfismo de Nucleotídeo Único
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA