Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 6 de 6
Filtrar
1.
Artigo em Inglês | MEDLINE | ID: mdl-32750845

RESUMO

The study of genetic variants (GVs) can help find correlating population groups and to identify cohorts that are predisposed to common diseases and explain differences in disease susceptibility and how patients react to drugs. Machine learning techniques are increasingly being applied to identify interacting GVs to understand their complex phenotypic traits. Since the performance of a learning algorithm not only depends on the size and nature of the data but also on the quality of underlying representation, deep neural networks (DNNs) can learn non-linear mappings that allow transforming GVs data into more clustering and classification friendly representations than manual feature selection. In this paper, we propose convolutional embedded networks (CEN) in which we combine two DNN architectures called convolutional embedded clustering (CEC) and convolutional autoencoder (CAE) classifier for clustering individuals and predicting geographic ethnicity based on GVs, respectively. We employed CAE-based representation learning to 95 million GVs from the '1000 genomes' (covering 2,504 individuals from 26 ethnic origins) and 'Simons genome diversity' (covering 279 individuals from 130 ethnic origins) projects. Quantitative and qualitative analyses with a focus on accuracy and scalability show that our approach outperforms state-of-the-art approaches such as VariantSpark and ADMIXTURE. In particular, CEC can cluster targeted population groups in 22 hours with an adjusted rand index (ARI) of 0.915, the normalized mutual information (NMI) of 0.92, and the clustering accuracy (ACC) of 89 percent. Contrarily, the CAE classifier can predict the geographic ethnicity of unknown samples with an F1 and Mathews correlation coefficient (MCC) score of 0.9004 and 0.8245, respectively. Further, to provide interpretations of the predictions, we identify significant biomarkers using gradient boosted trees (GBT) and SHapley Additive exPlanations (SHAP). Overall, our approach is transparent and faster than the baseline methods, and scalable for 5 to 100 percent of the full human genome.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Algoritmos , Análise por Conglomerados , Humanos
2.
Brief Bioinform ; 22(1): 393-415, 2021 01 18.
Artigo em Inglês | MEDLINE | ID: mdl-32008043

RESUMO

Clustering is central to many data-driven bioinformatics research and serves a powerful computational method. In particular, clustering helps at analyzing unstructured and high-dimensional data in the form of sequences, expressions, texts and images. Further, clustering is used to gain insights into biological processes in the genomics level, e.g. clustering of gene expressions provides insights on the natural structure inherent in the data, understanding gene functions, cellular processes, subtypes of cells and understanding gene regulations. Subsequently, clustering approaches, including hierarchical, centroid-based, distribution-based, density-based and self-organizing maps, have long been studied and used in classical machine learning settings. In contrast, deep learning (DL)-based representation and feature learning for clustering have not been reviewed and employed extensively. Since the quality of clustering is not only dependent on the distribution of data points but also on the learned representation, deep neural networks can be effective means to transform mappings from a high-dimensional data space into a lower-dimensional feature space, leading to improved clustering results. In this paper, we review state-of-the-art DL-based approaches for cluster analysis that are based on representation learning, which we hope to be useful, particularly for bioinformatics research. Further, we explore in detail the training procedures of DL-based clustering algorithms, point out different clustering quality metrics and evaluate several DL-based approaches on three bioinformatics use cases, including bioimaging, cancer genomics and biomedical text mining. We believe this review and the evaluation results will provide valuable insights and serve a starting point for researchers wanting to apply DL-based unsupervised methods to solve emerging bioinformatics research problems.


Assuntos
Biologia Computacional/métodos , Aprendizado Profundo , Análise por Conglomerados
3.
Brief Bioinform ; 19(5): 1035-1050, 2018 09 28.
Artigo em Inglês | MEDLINE | ID: mdl-28419324

RESUMO

Data workflow systems (DWFSs) enable bioinformatics researchers to combine components for data access and data analytics, and to share the final data analytics approach with their collaborators. Increasingly, such systems have to cope with large-scale data, such as full genomes (about 200 GB each), public fact repositories (about 100 TB of data) and 3D imaging data at even larger scales. As moving the data becomes cumbersome, the DWFS needs to embed its processes into a cloud infrastructure, where the data are already hosted. As the standardized public data play an increasingly important role, the DWFS needs to comply with Semantic Web technologies. This advancement to DWFS would reduce overhead costs and accelerate the progress in bioinformatics research based on large-scale data and public resources, as researchers would require less specialized IT knowledge for the implementation. Furthermore, the high data growth rates in bioinformatics research drive the demand for parallel and distributed computing, which then imposes a need for scalability and high-throughput capabilities onto the DWFS. As a result, requirements for data sharing and access to public knowledge bases suggest that compliance of the DWFS with Semantic Web standards is necessary. In this article, we will analyze the existing DWFS with regard to their capabilities toward public open data use as well as large-scale computational and human interface requirements. We untangle the parameters for selecting a preferable solution for bioinformatics research with particular consideration to using cloud services and Semantic Web technologies. Our analysis leads to research guidelines and recommendations toward the development of future DWFS for the bioinformatics research community.


Assuntos
Computação em Nuvem , Biologia Computacional/métodos , Fluxo de Trabalho , Big Data , Interpretação Estatística de Dados , Sistemas de Gerenciamento de Base de Dados , Descoberta de Drogas/estatística & dados numéricos , Genômica/estatística & dados numéricos , Humanos , Disseminação de Informação , Bases de Conhecimento , Web Semântica/estatística & dados numéricos , Interface Usuário-Computador
4.
J Biomed Semantics ; 8(1): 40, 2017 Sep 19.
Artigo em Inglês | MEDLINE | ID: mdl-28927463

RESUMO

BACKGROUND: Next Generation Sequencing (NGS) is playing a key role in therapeutic decision making for the cancer prognosis and treatment. The NGS technologies are producing a massive amount of sequencing datasets. Often, these datasets are published from the isolated and different sequencing facilities. Consequently, the process of sharing and aggregating multisite sequencing datasets are thwarted by issues such as the need to discover relevant data from different sources, built scalable repositories, the automation of data linkage, the volume of the data, efficient querying mechanism, and information rich intuitive visualisation. RESULTS: We present an approach to link and query different sequencing datasets (TCGA, COSMIC, REACTOME, KEGG and GO) to indicate risks for four cancer types - Ovarian Serous Cystadenocarcinoma (OV), Uterine Corpus Endometrial Carcinoma (UCEC), Uterine Carcinosarcoma (UCS), Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma (CESC) - covering the 16 healthy tissue-specific genes from Illumina Human Body Map 2.0. The differentially expressed genes from Illumina Human Body Map 2.0 are analysed together with the gene expressions reported in COSMIC and TCGA repositories leading to the discover of potential biomarkers for a tissue-specific cancer. CONCLUSION: We analyse the tissue expression of genes, copy number variation (CNV), somatic mutation, and promoter methylation to identify associated pathways and find novel biomarkers. We discovered twenty (20) mutated genes and three (3) potential pathways causing promoter changes in different gynaecological cancer types. We propose a data-interlinked platform called BIOOPENER that glues together heterogeneous cancer and biomedical repositories. The key approach is to find correspondences (or data links) among genetic, cellular and molecular features across isolated cancer datasets giving insight into cancer progression from normal to diseased tissues. The proposed BIOOPENER platform enriches mutations by filling in missing links from TCGA, COSMIC, REACTOME, KEGG and GO datasets and provides an interlinking mechanism to understand cancer progression from normal to diseased tissues with pathway components, which in turn helped to map mutations, associated phenotypes, pathways, and mechanism.


Assuntos
Biomarcadores Tumorais/metabolismo , Doenças dos Genitais Femininos/metabolismo , Informática Médica/métodos , Medicina de Precisão/métodos , Mineração de Dados , Bases de Dados Factuais , Feminino , Humanos
5.
BMC Bioinformatics ; 13 Suppl 4: S7, 2012 Mar 28.
Artigo em Inglês | MEDLINE | ID: mdl-22536974

RESUMO

BACKGROUND: With the advent of high-throughput technologies, a great wealth of variation data is being produced. Such information may constitute the basis for correlation analyses between genotypes and phenotypes and, in the future, for personalized medicine. Several databases on gene variation exist, but this kind of information is still scarce in the Semantic Web framework. In this paper, we discuss issues related to the integration of mutation data in the Linked Open Data infrastructure, part of the Semantic Web framework. We present the development of a mapping from the IARC TP53 Mutation database to RDF and the implementation of servers publishing this data. METHODS: A version of the IARC TP53 Mutation database implemented in a relational database was used as first test set. Automatic mappings to RDF were first created by using D2RQ and later manually refined by introducing concepts and properties from domain vocabularies and ontologies, as well as links to Linked Open Data implementations of various systems of biomedical interest. Since D2RQ query performances are lower than those that can be achieved by using an RDF archive, generated data was also loaded into a dedicated system based on tools from the Jena software suite. RESULTS: We have implemented a D2RQ Server for TP53 mutation data, providing data on a subset of the IARC database, including gene variations, somatic mutations, and bibliographic references. The server allows to browse the RDF graph by using links both between classes and to external systems. An alternative interface offers improved performances for SPARQL queries. The resulting data can be explored by using any Semantic Web browser or application. CONCLUSIONS: This has been the first case of a mutation database exposed as Linked Data. A revised version of our prototype, including further concepts and IARC TP53 Mutation database data sets, is under development.The publication of variation information as Linked Data opens new perspectives: the exploitation of SPARQL searches on mutation data and other biological databases may support data retrieval which is presently not possible. Moreover, reasoning on integrated variation data may support discoveries towards personalized medicine.


Assuntos
Bases de Dados Genéticas , Genes p53 , Mutação , Variação Genética , Humanos , Internet , Semântica , Software
6.
Nucleic Acids Res ; 38(Web Server issue): W712-8, 2010 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-20484374

RESUMO

The Bioinformatics group at the National Cancer Research Institute (IST) of Genoa has been involved since many years in the development and maintenance of biomedical information systems. Among them, the Common Access to Biological Resources and Information network services offer access to more than 130,000 biological resources, like strains of micro-organisms and human and animal cell lines, included in 29 collections from some of the most known European Biological Resource Centers. An Sequence Retrieval System (SRS) implementation of the TP53 Mutation Database of the International Agency for Research on Cancer (Lyon) was made available in order to improve interoperability of this data with other molecular biology databases. 'SRS by WS (SWS)', a system for retrieving information on public SRS sites and for directly querying them, was also implemented. In order to make this information available through application programming interfaces, we implemented a suite of free web services (WS), called the 'IST Bioinformatics Web Services (IBWS)'. A support web site, including a description of the system, a list of available WS together with help pages, links to corresponding WSDLs and forms for testing services, is available at http://bioinformatics.istge.it/ibws/. WSDL definitions can also be retrieved directly at http://bioinformatics.istge.it:8080/axis/services.


Assuntos
Bancos de Espécimes Biológicos , Biologia Computacional , Bases de Dados Factuais , Genes p53 , Software , Catálogos como Assunto , Bases de Dados de Ácidos Nucleicos , Internet , Mutação , Interface Usuário-Computador
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA