Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 73
Filtrar
1.
Genome Res ; 31(8): 1498-1511, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34183452

RESUMO

Dictyostelium development begins with single-cell starvation and ends with multicellular fruiting bodies. Developmental morphogenesis is accompanied by sweeping transcriptional changes, encompassing nearly half of the 13,000 genes in the genome. We performed time-series RNA-sequencing analyses of the wild type and 20 mutants to explore the relationships between transcription and morphogenesis. These strains show developmental arrest at different stages, accelerated development, or atypical morphologies. Considering eight major morphological transitions, we identified 1371 milestone genes whose expression changes sharply between consecutive transitions. We also identified 1099 genes as members of 21 regulons, which are groups of genes that remain coordinately regulated despite the genetic, temporal, and developmental perturbations. The gene annotations in these groups validate known transitions and reveal new developmental events. For example, DNA replication genes are tightly coregulated with cell division genes, so they are expressed in mid-development although chromosomal DNA is not replicated. Our data set includes 486 transcriptional profiles that can help identify new relationships between transcription and development and improve gene annotations. We show its utility by showing that cycles of aggregation and disaggregation in allorecognition-defective mutants involve dedifferentiation. We also show sensitivity to genetic and developmental conditions in two commonly used actin genes, act6 and act15, and robustness of the coaA gene. Finally, we propose that gpdA is a better mRNA quantitation standard because it is less sensitive to external conditions than commonly used standards. The data set is available for democratized exploration through the web application dictyExpress and the data mining environment Orange.


Assuntos
Dictyostelium , Dictyostelium/genética , Morfogênese , RNA Mensageiro/metabolismo , Regulon , Software
2.
Clin Microbiol Infect ; 27(7): 1039.e1-1039.e7, 2021 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-33838303

RESUMO

OBJECTIVES: Seroprevalence surveys provide crucial information on cumulative severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) exposure. This Slovenian nationwide population study is the first longitudinal 6-month serosurvey using probability-based samples across all age categories. METHODS: Each participant supplied two blood samples: 1316 samples in April 2020 (first round) and 1211 in October/November 2020 (second round). The first-round sera were tested using Euroimmun Anti-SARS-CoV-2 ELISA IgG (ELISA) and, because of uncertain estimates, were retested using Elecsys Anti-SARS-CoV-2 (Elecsys-N) and Elecsys Anti-SARS-CoV-2 S (Elecsys-S). The second-round sera were concomitantly tested using Elecsys-N/Elecsys-S. RESULTS: The populations of both rounds matched the overall population (n = 3000), with minor settlement type and age differences. The first-round seroprevalence corrected for the ELISA manufacturer's specificity was 2.78% (95% highest density interval [HDI] 1.81%-3.80%), corrected using pooled ELISA specificity calculated from published data 0.93% (95% CI 0.00%-2.65%), and based on Elecsys-N/Elecsys-S results 0.87% (95% HDI 0.40%-1.38%). The second-round unadjusted lower limit of seroprevalence on 11 November 2020 was 4.06% (95% HDI 2.97%-5.16%) and on 3 October 2020, unadjusted upper limit was 4.29% (95% HDI 3.18%-5.47%). CONCLUSIONS: SARS-CoV-2 seroprevalence in Slovenia increased four-fold from late April to October/November 2020, mainly due to a devastating second wave. Significant logistic/methodological challenges accompanied both rounds. The main lessons learned were a need for caution when relying on manufacturer-generated assay evaluation data, the importance of multiple manufacturer-independent assay performance assessments, the need for concomitant use of highly-specific serological assays targeting different SARS-CoV-2 proteins in serosurveys conducted in low-prevalence settings or during epidemic exponential growth and the usefulness of a Bayesian approach for overcoming complex methodological challenges.


Assuntos
Teste Sorológico para COVID-19/estatística & dados numéricos , COVID-19/epidemiologia , COVID-19/imunologia , Adolescente , Adulto , Distribuição por Idade , Idoso , Idoso de 80 Anos ou mais , Anticorpos Antivirais/sangue , Teorema de Bayes , Criança , Pré-Escolar , Ensaio de Imunoadsorção Enzimática , Feminino , Humanos , Imunoglobulina G/sangue , Lactente , Recém-Nascido , Masculino , Pessoa de Meia-Idade , Pandemias , Vigilância da População , Prevalência , Sensibilidade e Especificidade , Estudos Soroepidemiológicos , Distribuição por Sexo , Eslovênia/epidemiologia , Adulto Jovem
3.
PLoS Comput Biol ; 17(3): e1008671, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-33661899

RESUMO

Overfitting is one of the critical problems in developing models by machine learning. With machine learning becoming an essential technology in computational biology, we must include training about overfitting in all courses that introduce this technology to students and practitioners. We here propose a hands-on training for overfitting that is suitable for introductory level courses and can be carried out on its own or embedded within any data science course. We use workflow-based design of machine learning pipelines, experimentation-based teaching, and hands-on approach that focuses on concepts rather than underlying mathematics. We here detail the data analysis workflows we use in training and motivate them from the viewpoint of teaching goals. Our proposed approach relies on Orange, an open-source data science toolbox that combines data visualization and machine learning, and that is tailored for education in machine learning and explorative data analysis.


Assuntos
Biologia Computacional , Ciência de Dados , Aprendizado de Máquina , Modelos Estatísticos , Biologia Computacional/educação , Biologia Computacional/métodos , Ciência de Dados/educação , Ciência de Dados/métodos , Humanos , Modelos Biológicos , Software
4.
Nat Commun ; 10(1): 4551, 2019 10 07.
Artigo em Inglês | MEDLINE | ID: mdl-31591416

RESUMO

Analysis of biomedical images requires computational expertize that are uncommon among biomedical scientists. Deep learning approaches for image analysis provide an opportunity to develop user-friendly tools for exploratory data analysis. Here, we use the visual programming toolbox Orange ( http://orange.biolab.si ) to simplify image analysis by integrating deep-learning embedding, machine learning procedures, and data visualization. Orange supports the construction of data analysis workflows by assembling components for data preprocessing, visualization, and modeling. We equipped Orange with components that use pre-trained deep convolutional networks to profile images with vectors of features. These vectors are used in image clustering and classification in a framework that enables mining of image sets for both novel and experienced users. We demonstrate the utility of the tool in image analysis of progenitor cells in mouse bone healing, identification of developmental competence in mouse oocytes, subcellular protein localization in yeast, and developmental morphology of social amoebae.


Assuntos
Biologia Computacional/métodos , Processamento de Imagem Assistida por Computador/métodos , Aprendizado de Máquina , Redes Neurais de Computação , Animais , Dictyostelium/citologia , Dictyostelium/crescimento & desenvolvimento , Dictyostelium/metabolismo , Proteínas de Fluorescência Verde/genética , Proteínas de Fluorescência Verde/metabolismo , Internet , Estágios do Ciclo de Vida , Camundongos Transgênicos , Oócitos/metabolismo , Reprodutibilidade dos Testes , Saccharomyces cerevisiae/metabolismo , Proteínas de Saccharomyces cerevisiae/metabolismo
5.
Bioinformatics ; 35(14): i4-i12, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510695

RESUMO

MOTIVATION: Single-cell RNA sequencing allows us to simultaneously profile the transcriptomes of thousands of cells and to indulge in exploring cell diversity, development and discovery of new molecular mechanisms. Analysis of scRNA data involves a combination of non-trivial steps from statistics, data visualization, bioinformatics and machine learning. Training molecular biologists in single-cell data analysis and empowering them to review and analyze their data can be challenging, both because of the complexity of the methods and the steep learning curve. RESULTS: We propose a workshop-style training in single-cell data analytics that relies on an explorative data analysis toolbox and a hands-on teaching style. The training relies on scOrange, a newly developed extension of a data mining framework that features workflow design through visual programming and interactive visualizations. Workshops with scOrange can proceed much faster than similar training methods that rely on computer programming and analysis through scripting in R or Python, allowing the trainer to cover more ground in the same time-frame. We here review the design principles of the scOrange toolbox that support such workshops and propose a syllabus for the course. We also provide examples of data analysis workflows that instructors can use during the training. AVAILABILITY AND IMPLEMENTATION: scOrange is an open-source software. The software, documentation and an emerging set of educational videos are available at http://singlecell.biolab.si.


Assuntos
Biologia Computacional , Ciência de Dados , Software , Análise de Sequência de RNA , Fluxo de Trabalho
6.
PLoS One ; 14(6): e0217994, 2019.
Artigo em Inglês | MEDLINE | ID: mdl-31185054

RESUMO

Non-negative matrix tri-factorization (NMTF) is a popular technique for learning low-dimensional feature representation of relational data. Currently, NMTF learns a representation of a dataset through an optimization procedure that typically uses multiplicative update rules. This procedure has had limited success, and its failure cases have not been well understood. We here perform an empirical study involving six large datasets comparing multiplicative update rules with three alternative optimization methods, including alternating least squares, projected gradients, and coordinate descent. We find that methods based on projected gradients and coordinate descent converge up to twenty-four times faster than multiplicative update rules. Furthermore, alternating least squares method can quickly train NMTF models on sparse datasets but often fails on dense datasets. Coordinate descent-based NMTF converges up to sixteen times faster compared to well-established methods.


Assuntos
Algoritmos , Modelos Teóricos , Bases de Dados Factuais
7.
BMC Med ; 16(1): 150, 2018 08 27.
Artigo em Inglês | MEDLINE | ID: mdl-30145981

RESUMO

BACKGROUND: Personalized, precision, P4, or stratified medicine is understood as a medical approach in which patients are stratified based on their disease subtype, risk, prognosis, or treatment response using specialized diagnostic tests. The key idea is to base medical decisions on individual patient characteristics, including molecular and behavioral biomarkers, rather than on population averages. Personalized medicine is deeply connected to and dependent on data science, specifically machine learning (often named Artificial Intelligence in the mainstream media). While during recent years there has been a lot of enthusiasm about the potential of 'big data' and machine learning-based solutions, there exist only few examples that impact current clinical practice. The lack of impact on clinical practice can largely be attributed to insufficient performance of predictive models, difficulties to interpret complex model predictions, and lack of validation via prospective clinical trials that demonstrate a clear benefit compared to the standard of care. In this paper, we review the potential of state-of-the-art data science approaches for personalized medicine, discuss open challenges, and highlight directions that may help to overcome them in the future. CONCLUSIONS: There is a need for an interdisciplinary effort, including data scientists, physicians, patient advocates, regulatory agencies, and health insurance organizations. Partially unrealistic expectations and concerns about data science-based solutions need to be better managed. In parallel, computational methods must advance more to provide direct benefit to clinical practice.


Assuntos
Medicina de Precisão/métodos , Humanos , Estudos Prospectivos
8.
Genome Announc ; 6(2)2018 Jan 11.
Artigo em Inglês | MEDLINE | ID: mdl-29326223

RESUMO

Verticillium nonalfalfae, a soilborne vascular phytopathogenic fungus, causes wilt disease in several crop species. Of great concern are outbreaks of highly aggressive V. nonalfalfae strains, which cause a devastating wilt disease in European hops. We report here the genome sequence and annotation of V. nonalfalfae strain T2, providing genomic information that will allow better understanding of the molecular mechanisms underlying the development of highly aggressive strains.

9.
Nat Commun ; 8(1): 1541, 2017 11 16.
Artigo em Inglês | MEDLINE | ID: mdl-29142246

RESUMO

The NUDIX enzymes are involved in cellular metabolism and homeostasis, as well as mRNA processing. Although highly conserved throughout all organisms, their biological roles and biochemical redundancies remain largely unclear. To address this, we globally resolve their individual properties and inter-relationships. We purify 18 of the human NUDIX proteins and screen 52 substrates, providing a substrate redundancy map. Using crystal structures, we generate sequence alignment analyses revealing four major structural classes. To a certain extent, their substrate preference redundancies correlate with structural classes, thus linking structure and activity relationships. To elucidate interdependence among the NUDIX hydrolases, we pairwise deplete them generating an epistatic interaction map, evaluate cell cycle perturbations upon knockdown in normal and cancer cells, and analyse their protein and mRNA expression in normal and cancer tissues. Using a novel FUSION algorithm, we integrate all data creating a comprehensive NUDIX enzyme profile map, which will prove fundamental to understanding their biological functionality.


Assuntos
Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Família Multigênica , Pirofosfatases/genética , Células A549 , Linhagem Celular , Linhagem Celular Tumoral , Regulação Enzimológica da Expressão Gênica , Regulação Neoplásica da Expressão Gênica , Humanos , Células MCF-7 , Filogenia , Pirofosfatases/classificação , Pirofosfatases/metabolismo , Interferência de RNA , Especificidade por Substrato , Nudix Hidrolases
10.
BMC Bioinformatics ; 18(1): 291, 2017 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-28578698

RESUMO

BACKGROUND: Dictyostelium discoideum, a soil-dwelling social amoeba, is a model for the study of numerous biological processes. Research in the field has benefited mightily from the adoption of next-generation sequencing for genomics and transcriptomics. Dictyostelium biologists now face the widespread challenges of analyzing and exploring high dimensional data sets to generate hypotheses and discovering novel insights. RESULTS: We present dictyExpress (2.0), a web application designed for exploratory analysis of gene expression data, as well as data from related experiments such as Chromatin Immunoprecipitation sequencing (ChIP-Seq). The application features visualization modules that include time course expression profiles, clustering, gene ontology enrichment analysis, differential expression analysis and comparison of experiments. All visualizations are interactive and interconnected, such that the selection of genes in one module propagates instantly to visualizations in other modules. dictyExpress currently stores the data from over 800 Dictyostelium experiments and is embedded within a general-purpose software framework for management of next-generation sequencing data. dictyExpress allows users to explore their data in a broader context by reciprocal linking with dictyBase-a repository of Dictyostelium genomic data. In addition, we introduce a companion application called GenBoard, an intuitive graphic user interface for data management and bioinformatics analysis. CONCLUSIONS: dictyExpress and GenBoard enable broad adoption of next generation sequencing based inquiries by the Dictyostelium research community. Labs without the means to undertake deep sequencing projects can mine the data available to the public. The entire information flow, from raw sequence data to hypothesis testing, can be accomplished in an efficient workspace. The software framework is generalizable and represents a useful approach for any research community. To encourage more wide usage, the backend is open-source, available for extension and further development by bioinformaticians and data scientists.


Assuntos
Dictyostelium/metabolismo , Interface Usuário-Computador , Imunoprecipitação da Cromatina , Análise por Conglomerados , Dictyostelium/genética , Sequenciamento de Nucleotídeos em Larga Escala , Internet , Análise de Sequência de RNA , Transcriptoma
11.
BioData Min ; 10: 41, 2017.
Artigo em Inglês | MEDLINE | ID: mdl-29299064

RESUMO

BACKGROUND: Matrix factorization is a well established pattern discovery tool that has seen numerous applications in biomedical data analytics, such as gene expression co-clustering, patient stratification, and gene-disease association mining. Matrix factorization learns a latent data model that takes a data matrix and transforms it into a latent feature space enabling generalization, noise removal and feature discovery. However, factorization algorithms are numerically intensive, and hence there is a pressing challenge to scale current algorithms to work with large datasets. Our focus in this paper is matrix tri-factorization, a popular method that is not limited by the assumption of standard matrix factorization about data residing in one latent space. Matrix tri-factorization solves this by inferring a separate latent space for each dimension in a data matrix, and a latent mapping of interactions between the inferred spaces, making the approach particularly suitable for biomedical data mining. RESULTS: We developed a block-wise approach for latent factor learning in matrix tri-factorization. The approach partitions a data matrix into disjoint submatrices that are treated independently and fed into a parallel factorization system. An appealing property of the proposed approach is its mathematical equivalence with serial matrix tri-factorization. In a study on large biomedical datasets we show that our approach scales well on multi-processor and multi-GPU architectures. On a four-GPU system we demonstrate that our approach can be more than 100-times faster than its single-processor counterpart. CONCLUSIONS: A general approach for scaling non-negative matrix tri-factorization is proposed. The approach is especially useful parallel matrix factorization implemented in a multi-GPU environment. We expect the new approach will be useful in emerging procedures for latent factor analysis, notably for data integration, where many large data matrices need to be collectively factorized.

12.
Genome Res ; 26(9): 1268-76, 2016 09.
Artigo em Inglês | MEDLINE | ID: mdl-27307293

RESUMO

Whole-genome sequencing is a useful approach for identification of chemical-induced lesions, but previous applications involved tedious genetic mapping to pinpoint the causative mutations. We propose that saturation mutagenesis under low mutagenic loads, followed by whole-genome sequencing, should allow direct implication of genes by identifying multiple independent alleles of each relevant gene. We tested the hypothesis by performing three genetic screens with chemical mutagenesis in the social soil amoeba Dictyostelium discoideum Through genome sequencing, we successfully identified mutant genes with multiple alleles in near-saturation screens, including resistance to intense illumination and strong suppressors of defects in an allorecognition pathway. We tested the causality of the mutations by comparison to published data and by direct complementation tests, finding both dominant and recessive causative mutations. Therefore, our strategy provides a cost- and time-efficient approach to gene discovery by integrating chemical mutagenesis and whole-genome sequencing. The method should be applicable to many microbial systems, and it is expected to revolutionize the field of functional genomics in Dictyostelium by greatly expanding the mutation spectrum relative to other common mutagenesis methods.


Assuntos
Dictyostelium/genética , Mutagênese/genética , Sequenciamento Completo do Genoma/métodos , Mapeamento Cromossômico , Dictyostelium/efeitos dos fármacos , Estudos de Associação Genética , Sequenciamento de Nucleotídeos em Larga Escala , Mutagênese/efeitos dos fármacos , Mutagênicos/toxicidade
13.
Bioinformatics ; 32(12): i90-i100, 2016 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-27307649

RESUMO

MOTIVATION: The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects-such as a gene and a disease-can be related in different ways, for example, directly via gene-disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different semantic meanings However, traditional methods disregard these semantics and thus cannot fully exploit their value in data modeling. RESULTS: We present Medusa, an approach to detect size-k modules of objects that, taken together, appear most significant to another set of objects. Medusa operates on large-scale collections of heterogeneous datasets and explicitly distinguishes between diverse data semantics. It advances research along two dimensions: it builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program. Medusa is flexible in choosing or combining semantic meanings and provides theoretical guarantees about detection quality. In a systematic study on 310 complex diseases, we show the effectiveness of Medusa in associating genes with diseases and detecting disease modules. We demonstrate that in predicting gene-disease associations Medusa compares favorably to methods that ignore diverse semantic meanings. We find that the utility of different semantics depends on disease categories and that, overall, Medusa recovers disease modules more accurately when combining different semantics. AVAILABILITY AND IMPLEMENTATION: Source code is at http://github.com/marinkaz/medusa CONTACT: marinka@cs.stanford.edu, blaz.zupan@fri.uni-lj.si.


Assuntos
Biologia Computacional/métodos , Compressão de Dados , Doença/genética , Semântica , Algoritmos , Ontologia Genética , Humanos
14.
Bioinformatics ; 32(10): 1527-35, 2016 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-26787667

RESUMO

MOTIVATION: RNA binding proteins (RBPs) play important roles in post-transcriptional control of gene expression, including splicing, transport, polyadenylation and RNA stability. To model protein-RNA interactions by considering all available sources of information, it is necessary to integrate the rapidly growing RBP experimental data with the latest genome annotation, gene function, RNA sequence and structure. Such integration is possible by matrix factorization, where current approaches have an undesired tendency to identify only a small number of the strongest patterns with overlapping features. Because protein-RNA interactions are orchestrated by multiple factors, methods that identify discriminative patterns of varying strengths are needed. RESULTS: We have developed an integrative orthogonality-regularized nonnegative matrix factorization (iONMF) to integrate multiple data sources and discover non-overlapping, class-specific RNA binding patterns of varying strengths. The orthogonality constraint halves the effective size of the factor model and outperforms other NMF models in predicting RBP interaction sites on RNA. We have integrated the largest data compendium to date, which includes 31 CLIP experiments on 19 RBPs involved in splicing (such as hnRNPs, U2AF2, ELAVL1, TDP-43 and FUS) and processing of 3'UTR (Ago, IGF2BP). We show that the integration of multiple data sources improves the predictive accuracy of retrieval of RNA binding sites. In our study the key predictive factors of protein-RNA interactions were the position of RNA structure and sequence motifs, RBP co-binding and gene region type. We report on a number of protein-specific patterns, many of which are consistent with experimentally determined properties of RBPs. AVAILABILITY AND IMPLEMENTATION: The iONMF implementation and example datasets are available at https://github.com/mstrazar/ionmf CONTACT: : tomaz.curk@fri.uni-lj.si SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Modelos Moleculares , Proteínas de Ligação a RNA , Sítios de Ligação , Coleta de Dados , Conjuntos de Dados como Assunto , RNA
15.
Pac Symp Biocomput ; 21: 81-92, 2016.
Artigo em Inglês | MEDLINE | ID: mdl-26776175

RESUMO

Interactions between drugs, drug targets or diseases can be predicted on the basis of molecular, clinical and genomic features by, for example, exploiting similarity of disease pathways, chemical structures, activities across cell lines or clinical manifestations of diseases. A successful way to better understand complex interactions in biomedical systems is to employ collective relational learning approaches that can jointly model diverse relationships present in multiplex data. We propose a novel collective pairwise classification approach for multi-way data analysis. Our model leverages the superiority of latent factor models and classifies relationships in a large relational data domain using a pairwise ranking loss. In contrast to current approaches, our method estimates probabilities, such that probabilities for existing relationships are higher than for assumed-to-be-negative relationships. Although our method bears correspondence with the maximization of non-differentiable area under the ROC curve, we were able to design a learning algorithm that scales well on multi-relational data encoding interactions between thousands of entities.We use the new method to infer relationships from multiplex drug data and to predict connections between clinical manifestations of diseases and their underlying molecular signatures. Our method achieves promising predictive performance when compared to state-of-the-art alternative approaches and can make "category-jumping" predictions about diseases from genomic and clinical data generated far outside the molecular context.


Assuntos
Biologia Computacional/métodos , Interações Medicamentosas , Algoritmos , Teorema de Bayes , Classificação/métodos , Biologia Computacional/estatística & dados numéricos , Bases de Dados de Produtos Farmacêuticos/estatística & dados numéricos , Tratamento Farmacológico/estatística & dados numéricos , Humanos , Funções Verossimilhança , Aprendizado de Máquina , Modelos Biológicos , Medicina de Precisão , Probabilidade , Curva ROC , Integração de Sistemas
16.
BMC Bioinformatics ; 16 Suppl 16: S1, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26551454

RESUMO

BACKGROUND: Relation extraction is an essential procedure in literature mining. It focuses on extracting semantic relations between parts of text, called mentions. Biomedical literature includes an enormous amount of textual descriptions of biological entities, their interactions and results of related experiments. To extract them in an explicit, computer readable format, these relations were at first extracted manually from databases. Manual curation was later replaced with automatic or semi-automatic tools with natural language processing capabilities. The current challenge is the development of information extraction procedures that can directly infer more complex relational structures, such as gene regulatory networks. RESULTS: We develop a computational approach for extraction of gene regulatory networks from textual data. Our method is designed as a sieve-based system and uses linear-chain conditional random fields and rules for relation extraction. With this method we successfully extracted the sporulation gene regulation network in the bacterium Bacillus subtilis for the information extraction challenge at the BioNLP 2013 conference. To enable extraction of distant relations using first-order models, we transform the data into skip-mention sequences. We infer multiple models, each of which is able to extract different relationship types. Following the shared task, we conducted additional analysis using different system settings that resulted in reducing the reconstruction error of bacterial sporulation network from 0.73 to 0.68, measured as the slot error rate between the predicted and the reference network. We observe that all relation extraction sieves contribute to the predictive performance of the proposed approach. Also, features constructed by considering mention words and their prefixes and suffixes are the most important features for higher accuracy of extraction. Analysis of distances between different mention types in the text shows that our choice of transforming data into skip-mention sequences is appropriate for detecting relations between distant mentions. CONCLUSIONS: Linear-chain conditional random fields, along with appropriate data transformations, can be efficiently used to extract relations. The sieve-based architecture simplifies the system as new sieves can be easily added or removed and each sieve can utilize the results of previous ones. Furthermore, sieves with conditional random fields can be trained on arbitrary text data and hence are applicable to broad range of relation extraction tasks and data domains.


Assuntos
Redes Reguladoras de Genes , Armazenamento e Recuperação da Informação , Publicações , Algoritmos , Modelos Teóricos
17.
PLoS Comput Biol ; 11(10): e1004552, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26465776

RESUMO

Data integration procedures combine heterogeneous data sets into predictive models, but they are limited to data explicitly related to the target object type, such as genes. Collage is a new data fusion approach to gene prioritization. It considers data sets of various association levels with the prediction task, utilizes collective matrix factorization to compress the data, and chaining to relate different object types contained in a data compendium. Collage prioritizes genes based on their similarity to several seed genes. We tested Collage by prioritizing bacterial response genes in Dictyostelium as a novel model system for prokaryote-eukaryote interactions. Using 4 seed genes and 14 data sets, only one of which was directly related to the bacterial response, Collage proposed 8 candidate genes that were readily validated as necessary for the response of Dictyostelium to Gram-negative bacteria. These findings establish Collage as a method for inferring biological knowledge from the integration of heterogeneous and coarsely related data sets.


Assuntos
Compressão de Dados/métodos , Bases de Dados Genéticas , Dictyostelium/metabolismo , Dictyostelium/microbiologia , Bactérias Gram-Negativas/fisiologia , Proteínas de Protozoários/metabolismo , Proliferação de Células/fisiologia , Mineração de Dados/métodos , Proteínas de Protozoários/genética
18.
IEEE Trans Pattern Anal Mach Intell ; 37(1): 41-53, 2015 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-26353207

RESUMO

For most problems in science and engineering we can obtain data sets that describe the observed system from various perspectives and record the behavior of its individual components. Heterogeneous data sets can be collectively mined by data fusion. Fusion can focus on a specific target relation and exploit directly associated data together with contextual data and data about system's constraints. In the paper we describe a data fusion approach with penalized matrix tri-factorization (DFMF) that simultaneously factorizes data matrices to reveal hidden associations. The approach can directly consider any data that can be expressed in a matrix, including those from feature-based representations, ontologies, associations and networks. We demonstrate the utility of DFMF for gene function prediction task with eleven different data sources and for prediction of pharmacologic actions by fusing six data sources. Our data fusion algorithm compares favorably to alternative data integration approaches and achieves higher accuracy than can be obtained from any single data source alone.


Assuntos
Algoritmos , Informática/métodos , Modelos Teóricos
19.
Bioinformatics ; 31(12): i230-9, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-26072487

RESUMO

MOTIVATION: Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions, their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets. RESULTS: We present FuseNet, a Markov network formulation that infers networks from a collection of nonidentically distributed datasets. Our approach is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. Our results demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies. AVAILABILITY AND IMPLEMENTATION: Source code is at https://github.com/marinkaz/fusenet.


Assuntos
Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Algoritmos , Neoplasias da Mama/genética , Feminino , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Cadeias de Markov , Distribuição de Poisson , Análise de Sequência de RNA
20.
BMC Genomics ; 16: 294, 2015 Apr 13.
Artigo em Inglês | MEDLINE | ID: mdl-25887420

RESUMO

BACKGROUND: Development of the soil amoeba Dictyostelium discoideum is triggered by starvation. When placed on a solid substrate, the starving solitary amoebae cease growth, communicate via extracellular cAMP, aggregate by tens of thousands and develop into multicellular organisms. Early phases of the developmental program are often studied in cells starved in suspension while cAMP is provided exogenously. Previous studies revealed massive shifts in the transcriptome under both developmental conditions and a close relationship between gene expression and morphogenesis, but were limited by the sampling frequency and the resolution of the methods. RESULTS: Here, we combine the superior depth and specificity of RNA-seq-based analysis of mRNA abundance with high frequency sampling during filter development and cAMP pulsing in suspension. We found that the developmental transcriptome exhibits mostly gradual changes interspersed by a few instances of large shifts. For each time point we treated the entire transcriptome as single phenotype, and were able to characterize development as groups of similar time points separated by gaps. The grouped time points represented gradual changes in mRNA abundance, or molecular phenotype, and the gaps represented times during which many genes are differentially expressed rapidly, and thus the phenotype changes dramatically. Comparing developmental experiments revealed that gene expression in filter developed cells lagged behind those treated with exogenous cAMP in suspension. The high sampling frequency revealed many genes whose regulation is reproducibly more complex than indicated by previous studies. Gene Ontology enrichment analysis suggested that the transition to multicellularity coincided with rapid accumulation of transcripts associated with DNA processes and mitosis. Later development included the up-regulation of organic signaling molecules and co-factor biosynthesis. Our analysis also demonstrated a high level of synchrony among the developing structures throughout development. CONCLUSIONS: Our data describe D. discoideum development as a series of coordinated cellular and multicellular activities. Coordination occurred within fields of aggregating cells and among multicellular bodies, such as mounds or migratory slugs that experience both cell-cell contact and various soluble signaling regimes. These time courses, sampled at the highest temporal resolution to date in this system, provide a comprehensive resource for studies of developmental gene expression.


Assuntos
Dictyostelium/crescimento & desenvolvimento , Dictyostelium/genética , RNA Mensageiro/metabolismo , Transcriptoma , AMP Cíclico/metabolismo , Dictyostelium/metabolismo , Morfogênese
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...