Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 60
Filtrar
1.
Nature ; 620(7972): 47-60, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37532811

RESUMO

Artificial intelligence (AI) is being increasingly integrated into scientific discovery to augment and accelerate research, helping scientists to generate hypotheses, design experiments, collect and interpret large datasets, and gain insights that might not have been possible using traditional scientific methods alone. Here we examine breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency. Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances. Both developers and users of AI toolsneed a better understanding of when such approaches need improvement, and challenges posed by poor data quality and stewardship remain. These issues cut across scientific disciplines and require developing foundational algorithmic approaches that can contribute to scientific understanding or acquire it autonomously, making them critical areas of focus for AI innovation.


Assuntos
Inteligência Artificial , Projetos de Pesquisa , Inteligência Artificial/normas , Inteligência Artificial/tendências , Conjuntos de Dados como Assunto , Aprendizado Profundo , Projetos de Pesquisa/normas , Projetos de Pesquisa/tendências , Aprendizado de Máquina não Supervisionado
2.
Bioinformatics ; 39(5)2023 05 04.
Artigo em Inglês | MEDLINE | ID: mdl-37140542

RESUMO

SUMMARY: Heterogeneous knowledge graphs (KGs) have enabled the modeling of complex systems, from genetic interaction graphs and protein-protein interaction networks to networks representing drugs, diseases, proteins, and side effects. Analytical methods for KGs rely on quantifying similarities between entities, such as nodes, in the graph. However, such methods must consider the diversity of node and edge types contained within the KG via, for example, defined sequences of entity types known as meta-paths. We present metapaths, the first R software package to implement meta-paths and perform meta-path-based similarity search in heterogeneous KGs. The metapaths package offers various built-in similarity metrics for node pair comparison by querying KGs represented as either edge or adjacency lists, as well as auxiliary aggregation methods to measure set-level relationships. Indeed, evaluation of these methods on an open-source biomedical KG recovered meaningful drug and disease-associated relationships, including those in Alzheimer's disease. The metapaths framework facilitates the scalable and flexible modeling of network similarities in KGs with applications across KG learning. AVAILABILITY AND IMPLEMENTATION: The metapaths R package is available via GitHub at https://github.com/ayushnoori/metapaths and is released under MPL 2.0 (Zenodo DOI: 10.5281/zenodo.7047209). Package documentation and usage examples are available at https://www.ayushnoori.com/metapaths.


Assuntos
Doença de Alzheimer , Reconhecimento Automatizado de Padrão , Humanos , Software , Mapas de Interação de Proteínas
3.
Bioinformatics ; 39(2)2023 02 03.
Artigo em Inglês | MEDLINE | ID: mdl-36805623

RESUMO

MOTIVATION: Predicting molecule-disease indications and side effects is important for drug development and pharmacovigilance. Comprehensively mining molecule-molecule, molecule-disease and disease-disease semantic dependencies can potentially improve prediction performance. METHODS: We introduce a Multi-Modal REpresentation Mapping Approach to Predicting molecular-disease relations (M2REMAP) by incorporating clinical semantics learned from electronic health records (EHR) of 12.6 million patients. Specifically, M2REMAP first learns a multimodal molecule representation that synthesizes chemical property and clinical semantic information by mapping molecule chemicals via a deep neural network onto the clinical semantic embedding space shared by drugs, diseases and other common clinical concepts. To infer molecule-disease relations, M2REMAP combines multimodal molecule representation and disease semantic embedding to jointly infer indications and side effects. RESULTS: We extensively evaluate M2REMAP on molecule indications, side effects and interactions. Results show that incorporating EHR embeddings improves performance significantly, for example, attaining an improvement over the baseline models by 23.6% in PRC-AUC on indications and 23.9% on side effects. Further, M2REMAP overcomes the limitation of existing methods and effectively predicts drugs for novel diseases and emerging pathogens. AVAILABILITY AND IMPLEMENTATION: The code is available at https://github.com/celehs/M2REMAP, and prediction results are provided at https://shiny.parse-health.org/drugs-diseases-dev/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Humanos , Desenvolvimento de Medicamentos , Registros Eletrônicos de Saúde , Redes Neurais de Computação , Farmacovigilância
5.
Proc Natl Acad Sci U S A ; 118(19)2021 05 11.
Artigo em Inglês | MEDLINE | ID: mdl-33906951

RESUMO

The COVID-19 pandemic has highlighted the need to quickly and reliably prioritize clinically approved compounds for their potential effectiveness for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections. Here, we deployed algorithms relying on artificial intelligence, network diffusion, and network proximity, tasking each of them to rank 6,340 drugs for their expected efficacy against SARS-CoV-2. To test the predictions, we used as ground truth 918 drugs experimentally screened in VeroE6 cells, as well as the list of drugs in clinical trials that capture the medical community's assessment of drugs with potential COVID-19 efficacy. We find that no single predictive algorithm offers consistently reliable outcomes across all datasets and metrics. This outcome prompted us to develop a multimodal technology that fuses the predictions of all algorithms, finding that a consensus among the different predictive methods consistently exceeds the performance of the best individual pipelines. We screened in human cells the top-ranked drugs, obtaining a 62% success rate, in contrast to the 0.8% hit rate of nonguided screenings. Of the six drugs that reduced viral infection, four could be directly repurposed to treat COVID-19, proposing novel treatments for COVID-19. We also found that 76 of the 77 drugs that successfully reduced viral infection do not bind the proteins targeted by SARS-CoV-2, indicating that these network drugs rely on network-based mechanisms that cannot be identified using docking-based strategies. These advances offer a methodological pathway to identify repurposable drugs for future pathogens and neglected diseases underserved by the costs and extended timeline of de novo drug development.


Assuntos
Tratamento Farmacológico da COVID-19 , Reposicionamento de Medicamentos/métodos , Biologia de Sistemas/métodos , Animais , Antivirais/administração & dosagem , Antivirais/farmacologia , Antivirais/uso terapêutico , Chlorocebus aethiops , Bases de Dados de Produtos Farmacêuticos , Humanos , Redes Neurais de Computação , Ligação Proteica , Células Vero , Proteínas Virais/metabolismo
6.
Nat Methods ; 17(12): 1200-1206, 2020 12.
Artigo em Inglês | MEDLINE | ID: mdl-33077966

RESUMO

Although tremendous effort has been put into cell-type annotation, identification of previously uncharacterized cell types in heterogeneous single-cell RNA-seq data remains a challenge. Here we present MARS, a meta-learning approach for identifying and annotating known as well as new cell types. MARS overcomes the heterogeneity of cell types by transferring latent cell representations across multiple datasets. MARS uses deep learning to learn a cell embedding function as well as a set of landmarks in the cell embedding space. The method has a unique ability to discover cell types that have never been seen before and annotate experiments that are as yet unannotated. We apply MARS to a large mouse cell atlas and show its ability to accurately identify cell types, even when it has never seen them before. Further, MARS automatically generates interpretable names for new cell types by probabilistically defining a cell type in the embedding space.


Assuntos
Células/classificação , Análise de Célula Única/métodos , Transcriptoma/genética , Algoritmos , Animais , Bases de Dados Factuais , Perfilação da Expressão Gênica , Camundongos , RNA/genética , Análise de Sequência de RNA , Software
7.
J Biomed Inform ; 143: 104415, 2023 07.
Artigo em Inglês | MEDLINE | ID: mdl-37276949

RESUMO

Disease knowledge graphs have emerged as a powerful tool for artificial intelligence to connect, organize, and access diverse information about diseases. Relations between disease concepts are often distributed across multiple datasets, including unstructured plain text datasets and incomplete disease knowledge graphs. Extracting disease relations from multimodal data sources is thus crucial for constructing accurate and comprehensive disease knowledge graphs. We introduce REMAP, a multimodal approach for disease relation extraction. The REMAP machine learning approach jointly embeds a partial, incomplete knowledge graph and a medical language dataset into a compact latent vector space, aligning the multimodal embeddings for optimal disease relation extraction. Additionally, REMAP utilizes a decoupled model structure to enable inference in single-modal data, which can be applied under missing modality scenarios. We apply the REMAP approach to a disease knowledge graph with 96,913 relations and a text dataset of 1.24 million sentences. On a dataset annotated by human experts, REMAP improves language-based disease relation extraction by 10.0% (accuracy) and 17.2% (F1-score) by fusing disease knowledge graphs with language information. Furthermore, REMAP leverages text information to recommend new relationships in the knowledge graph, outperforming graph-based methods by 8.4% (accuracy) and 10.4% (F1-score). REMAP is a flexible multimodal approach for extracting disease relations by fusing structured knowledge and language information. This approach provides a powerful model to easily find, access, and evaluate relations between disease concepts.


Assuntos
Inteligência Artificial , Aprendizado de Máquina , Humanos , Unified Medical Language System , Idioma , Processamento de Linguagem Natural
8.
Bioinformatics ; 36(22-23): 5545-5547, 2021 04 01.
Artigo em Inglês | MEDLINE | ID: mdl-33275143

RESUMO

SUMMARY: Accurate prediction of drug-target interactions (DTI) is crucial for drug discovery. Recently, deep learning (DL) models for show promising performance for DTI prediction. However, these models can be difficult to use for both computer scientists entering the biomedical field and bioinformaticians with limited DL experience. We present DeepPurpose, a comprehensive and easy-to-use DL library for DTI prediction. DeepPurpose supports training of customized DTI prediction models by implementing 15 compound and protein encoders and over 50 neural architectures, along with providing many other useful features. We demonstrate state-of-the-art performance of DeepPurpose on several benchmark datasets. AVAILABILITY AND IMPLEMENTATION: https://github.com/kexinhuang12345/DeepPurpose. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Preparações Farmacêuticas , Desenvolvimento de Medicamentos , Descoberta de Drogas , Proteínas
9.
Proc Natl Acad Sci U S A ; 116(10): 4426-4433, 2019 03 05.
Artigo em Inglês | MEDLINE | ID: mdl-30765515

RESUMO

Phenotype robustness to environmental fluctuations is a common biological phenomenon. Although most phenotypes involve multiple proteins that interact with each other, the basic principles of how such interactome networks respond to environmental unpredictability and change during evolution are largely unknown. Here we study interactomes of 1,840 species across the tree of life involving a total of 8,762,166 protein-protein interactions. Our study focuses on the resilience of interactomes to network failures and finds that interactomes become more resilient during evolution, meaning that interactomes become more robust to network failures over time. In bacteria, we find that a more resilient interactome is in turn associated with the greater ability of the organism to survive in a more complex, variable, and competitive environment. We find that at the protein family level proteins exhibit a coordinated rewiring of interactions over time and that a resilient interactome arises through gradual change of the network topology. Our findings have implications for understanding molecular network structure in the context of both evolution and environment.


Assuntos
Evolução Biológica , Mapeamento de Interação de Proteínas , Proteoma/metabolismo , Especificidade da Espécie
10.
Bioinformatics ; 35(10): 1745-1752, 2019 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-30307536

RESUMO

MOTIVATION: State-of-the-art biomedical named entity recognition (BioNER) systems often require handcrafted features specific to each entity type, such as genes, chemicals and diseases. Although recent studies explored using neural network models for BioNER to free experts from manual feature engineering, the performance remains limited by the available training data for each entity type. RESULTS: We propose a multi-task learning framework for BioNER to collectively use the training data of different types of entities and improve the performance on each of them. In experiments on 15 benchmark BioNER datasets, our multi-task model achieves substantially better performance compared with state-of-the-art BioNER systems and baseline neural sequence labeling models. Further analysis shows that the large performance gains come from sharing character- and word-level information among relevant biomedical entities across differently labeled corpora. AVAILABILITY AND IMPLEMENTATION: Our source code is available at https://github.com/yuzhimanhua/lm-lstm-crf. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Aprendizado Profundo , Redes Neurais de Computação , Benchmarking , Software
11.
Bioinformatics ; 34(13): i457-i466, 2018 07 01.
Artigo em Inglês | MEDLINE | ID: mdl-29949996

RESUMO

Motivation: The use of drug combinations, termed polypharmacy, is common to treat patients with complex diseases or co-existing conditions. However, a major consequence of polypharmacy is a much higher risk of adverse side effects for the patient. Polypharmacy side effects emerge because of drug-drug interactions, in which activity of one drug may change, favorably or unfavorably, if taken with another drug. The knowledge of drug interactions is often limited because these complex relationships are rare, and are usually not observed in relatively small clinical testing. Discovering polypharmacy side effects thus remains an important challenge with significant implications for patient mortality and morbidity. Results: Here, we present Decagon, an approach for modeling polypharmacy side effects. The approach constructs a multimodal graph of protein-protein interactions, drug-protein target interactions and the polypharmacy side effects, which are represented as drug-drug interactions, where each side effect is an edge of a different type. Decagon is developed specifically to handle such multimodal graphs with a large number of edge types. Our approach develops a new graph convolutional neural network for multirelational link prediction in multimodal networks. Unlike approaches limited to predicting simple drug-drug interaction values, Decagon can predict the exact side effect, if any, through which a given drug combination manifests clinically. Decagon accurately predicts polypharmacy side effects, outperforming baselines by up to 69%. We find that it automatically learns representations of side effects indicative of co-occurrence of polypharmacy in patients. Furthermore, Decagon models particularly well polypharmacy side effects that have a strong molecular basis, while on predominantly non-molecular side effects, it achieves good performance because of effective sharing of model parameters across edge types. Decagon opens up opportunities to use large pharmacogenomic and patient population data to flag and prioritize polypharmacy side effects for follow-up analysis via formal pharmacological studies. Availability and implementation: Source code and preprocessed datasets are at: http://snap.stanford.edu/decagon.


Assuntos
Biologia Computacional/métodos , Interações Medicamentosas , Efeitos Colaterais e Reações Adversas Relacionados a Medicamentos , Redes Neurais de Computação , Polimedicação , Visualização de Dados , Feminino , Humanos , Modelos Biológicos , Mapas de Interação de Proteínas , Software
13.
Inf Fusion ; 50: 71-91, 2019 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-30467459

RESUMO

New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include myriad properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.

14.
Bioinformatics ; 33(14): i190-i198, 2017 Jul 15.
Artigo em Inglês | MEDLINE | ID: mdl-28881986

RESUMO

MOTIVATION: Understanding functions of proteins in specific human tissues is essential for insights into disease diagnostics and therapeutics, yet prediction of tissue-specific cellular function remains a critical challenge for biomedicine. RESULTS: Here, we present OhmNet , a hierarchy-aware unsupervised node feature learning approach for multi-layer networks. We build a multi-layer network, where each layer represents molecular interactions in a different human tissue. OhmNet then automatically learns a mapping of proteins, represented as nodes, to a neural embedding-based low-dimensional space of features. OhmNet encourages sharing of similar features among proteins with similar network neighborhoods and among proteins activated in similar tissues. The algorithm generalizes prior work, which generally ignores relationships between tissues, by modeling tissue organization with a rich multiscale tissue hierarchy. We use OhmNet to study multicellular function in a multi-layer protein interaction network of 107 human tissues. In 48 tissues with known tissue-specific cellular functions, OhmNet provides more accurate predictions of cellular function than alternative approaches, and also generates more accurate hypotheses about tissue-specific protein actions. We show that taking into account the tissue hierarchy leads to improved predictive power. Remarkably, we also demonstrate that it is possible to leverage the tissue hierarchy in order to effectively transfer cellular functions to a functionally uncharacterized tissue. Overall, OhmNet moves from flat networks to multiscale models able to predict a range of phenotypes spanning cellular subsystems. AVAILABILITY AND IMPLEMENTATION: Source code and datasets are available at http://snap.stanford.edu/ohmnet . CONTACT: jure@cs.stanford.edu.


Assuntos
Biologia Computacional/métodos , Mapas de Interação de Proteínas , Software , Algoritmos , Humanos , Especificidade de Órgãos
15.
Bioinformatics ; 32(12): i90-i100, 2016 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-27307649

RESUMO

MOTIVATION: The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects-such as a gene and a disease-can be related in different ways, for example, directly via gene-disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different semantic meanings However, traditional methods disregard these semantics and thus cannot fully exploit their value in data modeling. RESULTS: We present Medusa, an approach to detect size-k modules of objects that, taken together, appear most significant to another set of objects. Medusa operates on large-scale collections of heterogeneous datasets and explicitly distinguishes between diverse data semantics. It advances research along two dimensions: it builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program. Medusa is flexible in choosing or combining semantic meanings and provides theoretical guarantees about detection quality. In a systematic study on 310 complex diseases, we show the effectiveness of Medusa in associating genes with diseases and detecting disease modules. We demonstrate that in predicting gene-disease associations Medusa compares favorably to methods that ignore diverse semantic meanings. We find that the utility of different semantics depends on disease categories and that, overall, Medusa recovers disease modules more accurately when combining different semantics. AVAILABILITY AND IMPLEMENTATION: Source code is at http://github.com/marinkaz/medusa CONTACT: marinka@cs.stanford.edu, blaz.zupan@fri.uni-lj.si.


Assuntos
Biologia Computacional/métodos , Compressão de Dados , Doença/genética , Semântica , Algoritmos , Ontologia Genética , Humanos
16.
Bioinformatics ; 32(10): 1527-35, 2016 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-26787667

RESUMO

MOTIVATION: RNA binding proteins (RBPs) play important roles in post-transcriptional control of gene expression, including splicing, transport, polyadenylation and RNA stability. To model protein-RNA interactions by considering all available sources of information, it is necessary to integrate the rapidly growing RBP experimental data with the latest genome annotation, gene function, RNA sequence and structure. Such integration is possible by matrix factorization, where current approaches have an undesired tendency to identify only a small number of the strongest patterns with overlapping features. Because protein-RNA interactions are orchestrated by multiple factors, methods that identify discriminative patterns of varying strengths are needed. RESULTS: We have developed an integrative orthogonality-regularized nonnegative matrix factorization (iONMF) to integrate multiple data sources and discover non-overlapping, class-specific RNA binding patterns of varying strengths. The orthogonality constraint halves the effective size of the factor model and outperforms other NMF models in predicting RBP interaction sites on RNA. We have integrated the largest data compendium to date, which includes 31 CLIP experiments on 19 RBPs involved in splicing (such as hnRNPs, U2AF2, ELAVL1, TDP-43 and FUS) and processing of 3'UTR (Ago, IGF2BP). We show that the integration of multiple data sources improves the predictive accuracy of retrieval of RNA binding sites. In our study the key predictive factors of protein-RNA interactions were the position of RNA structure and sequence motifs, RBP co-binding and gene region type. We report on a number of protein-specific patterns, many of which are consistent with experimentally determined properties of RBPs. AVAILABILITY AND IMPLEMENTATION: The iONMF implementation and example datasets are available at https://github.com/mstrazar/ionmf CONTACT: : tomaz.curk@fri.uni-lj.si SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Modelos Moleculares , Proteínas de Ligação a RNA , Sítios de Ligação , Coleta de Dados , Conjuntos de Dados como Assunto , RNA
17.
Bioinformatics ; 31(12): i230-9, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-26072487

RESUMO

MOTIVATION: Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions, their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets. RESULTS: We present FuseNet, a Markov network formulation that infers networks from a collection of nonidentically distributed datasets. Our approach is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. Our results demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies. AVAILABILITY AND IMPLEMENTATION: Source code is at https://github.com/marinkaz/fusenet.


Assuntos
Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Algoritmos , Neoplasias da Mama/genética , Feminino , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Cadeias de Markov , Distribuição de Poisson , Análise de Sequência de RNA
18.
PLoS Comput Biol ; 11(10): e1004552, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26465776

RESUMO

Data integration procedures combine heterogeneous data sets into predictive models, but they are limited to data explicitly related to the target object type, such as genes. Collage is a new data fusion approach to gene prioritization. It considers data sets of various association levels with the prediction task, utilizes collective matrix factorization to compress the data, and chaining to relate different object types contained in a data compendium. Collage prioritizes genes based on their similarity to several seed genes. We tested Collage by prioritizing bacterial response genes in Dictyostelium as a novel model system for prokaryote-eukaryote interactions. Using 4 seed genes and 14 data sets, only one of which was directly related to the bacterial response, Collage proposed 8 candidate genes that were readily validated as necessary for the response of Dictyostelium to Gram-negative bacteria. These findings establish Collage as a method for inferring biological knowledge from the integration of heterogeneous and coarsely related data sets.


Assuntos
Compressão de Dados/métodos , Bases de Dados Genéticas , Dictyostelium/metabolismo , Dictyostelium/microbiologia , Bactérias Gram-Negativas/fisiologia , Proteínas de Protozoários/metabolismo , Proliferação de Células/fisiologia , Mineração de Dados/métodos , Proteínas de Protozoários/genética
19.
BMC Bioinformatics ; 16 Suppl 16: S1, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26551454

RESUMO

BACKGROUND: Relation extraction is an essential procedure in literature mining. It focuses on extracting semantic relations between parts of text, called mentions. Biomedical literature includes an enormous amount of textual descriptions of biological entities, their interactions and results of related experiments. To extract them in an explicit, computer readable format, these relations were at first extracted manually from databases. Manual curation was later replaced with automatic or semi-automatic tools with natural language processing capabilities. The current challenge is the development of information extraction procedures that can directly infer more complex relational structures, such as gene regulatory networks. RESULTS: We develop a computational approach for extraction of gene regulatory networks from textual data. Our method is designed as a sieve-based system and uses linear-chain conditional random fields and rules for relation extraction. With this method we successfully extracted the sporulation gene regulation network in the bacterium Bacillus subtilis for the information extraction challenge at the BioNLP 2013 conference. To enable extraction of distant relations using first-order models, we transform the data into skip-mention sequences. We infer multiple models, each of which is able to extract different relationship types. Following the shared task, we conducted additional analysis using different system settings that resulted in reducing the reconstruction error of bacterial sporulation network from 0.73 to 0.68, measured as the slot error rate between the predicted and the reference network. We observe that all relation extraction sieves contribute to the predictive performance of the proposed approach. Also, features constructed by considering mention words and their prefixes and suffixes are the most important features for higher accuracy of extraction. Analysis of distances between different mention types in the text shows that our choice of transforming data into skip-mention sequences is appropriate for detecting relations between distant mentions. CONCLUSIONS: Linear-chain conditional random fields, along with appropriate data transformations, can be efficiently used to extract relations. The sieve-based architecture simplifies the system as new sieves can be easily added or removed and each sieve can utilize the results of previous ones. Furthermore, sieves with conditional random fields can be trained on arbitrary text data and hence are applicable to broad range of relation extraction tasks and data domains.


Assuntos
Redes Reguladoras de Genes , Armazenamento e Recuperação da Informação , Publicações , Algoritmos , Modelos Teóricos
20.
Bioinformatics ; 30(12): i246-i254, 2014 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-24931990

RESUMO

MOTIVATION: Epistasis analysis is an essential tool of classical genetics for inferring the order of function of genes in a common pathway. Typically, it considers single and double mutant phenotypes and for a pair of genes observes whether a change in the first gene masks the effects of the mutation in the second gene. Despite the recent emergence of biotechnology techniques that can provide gene interaction data on a large, possibly genomic scale, few methods are available for quantitative epistasis analysis and epistasis-based network reconstruction. RESULTS: We here propose a conceptually new probabilistic approach to gene network inference from quantitative interaction data. The approach is founded on epistasis analysis. Its features are joint treatment of the mutant phenotype data with a factorized model and probabilistic scoring of pairwise gene relationships that are inferred from the latent gene representation. The resulting gene network is assembled from scored pairwise relationships. In an experimental study, we show that the proposed approach can accurately reconstruct several known pathways and that it surpasses the accuracy of current approaches. AVAILABILITY AND IMPLEMENTATION: Source code is available at http://github.com/biolab/red.


Assuntos
Epistasia Genética , Redes Reguladoras de Genes , Modelos Estatísticos , Algoritmos , Degradação Associada com o Retículo Endoplasmático/genética , Glicosilação , Mutação , Fenótipo , Fosfatidilserinas/metabolismo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA