Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 74
Filtrar
Mais filtros

Bases de dados
Tipo de documento
Intervalo de ano de publicação
1.
Bioinformatics ; 40(Supplement_1): i20-i29, 2024 Jun 28.
Artigo em Inglês | MEDLINE | ID: mdl-38940150

RESUMO

MOTIVATION: We learn more effectively through experience and reflection than through passive reception of information. Bioinformatics offers an excellent opportunity for project-based learning. Molecular data are abundant and accessible in open repositories, and important concepts in biology can be rediscovered by reanalyzing the data. RESULTS: In the manuscript, we report on five hands-on assignments we designed for master's computer science students to train them in bioinformatics for genomics. These assignments are the cornerstones of our introductory bioinformatics course and are centered around the study of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). They assume no prior knowledge of molecular biology but do require programming skills. Through these assignments, students learn about genomes and genes, discover their composition and function, relate SARS-CoV-2 to other viruses, and learn about the body's response to infection. Student evaluation of the assignments confirms their usefulness and value, their appropriate mastery-level difficulty, and their interesting and motivating storyline. AVAILABILITY AND IMPLEMENTATION: The course materials are freely available on GitHub at https://github.com/IB-ULFRI.


Assuntos
COVID-19 , Biologia Computacional , SARS-CoV-2 , Biologia Computacional/métodos , SARS-CoV-2/genética , Humanos , COVID-19/virologia , Genômica/métodos , Estudantes
2.
Genome Res ; 31(8): 1498-1511, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34183452

RESUMO

Dictyostelium development begins with single-cell starvation and ends with multicellular fruiting bodies. Developmental morphogenesis is accompanied by sweeping transcriptional changes, encompassing nearly half of the 13,000 genes in the genome. We performed time-series RNA-sequencing analyses of the wild type and 20 mutants to explore the relationships between transcription and morphogenesis. These strains show developmental arrest at different stages, accelerated development, or atypical morphologies. Considering eight major morphological transitions, we identified 1371 milestone genes whose expression changes sharply between consecutive transitions. We also identified 1099 genes as members of 21 regulons, which are groups of genes that remain coordinately regulated despite the genetic, temporal, and developmental perturbations. The gene annotations in these groups validate known transitions and reveal new developmental events. For example, DNA replication genes are tightly coregulated with cell division genes, so they are expressed in mid-development although chromosomal DNA is not replicated. Our data set includes 486 transcriptional profiles that can help identify new relationships between transcription and development and improve gene annotations. We show its utility by showing that cycles of aggregation and disaggregation in allorecognition-defective mutants involve dedifferentiation. We also show sensitivity to genetic and developmental conditions in two commonly used actin genes, act6 and act15, and robustness of the coaA gene. Finally, we propose that gpdA is a better mRNA quantitation standard because it is less sensitive to external conditions than commonly used standards. The data set is available for democratized exploration through the web application dictyExpress and the data mining environment Orange.


Assuntos
Dictyostelium , Dictyostelium/genética , Morfogênese , RNA Mensageiro/metabolismo , Regulon , Software
3.
PLoS Comput Biol ; 17(3): e1008671, 2021 03.
Artigo em Inglês | MEDLINE | ID: mdl-33661899

RESUMO

Overfitting is one of the critical problems in developing models by machine learning. With machine learning becoming an essential technology in computational biology, we must include training about overfitting in all courses that introduce this technology to students and practitioners. We here propose a hands-on training for overfitting that is suitable for introductory level courses and can be carried out on its own or embedded within any data science course. We use workflow-based design of machine learning pipelines, experimentation-based teaching, and hands-on approach that focuses on concepts rather than underlying mathematics. We here detail the data analysis workflows we use in training and motivate them from the viewpoint of teaching goals. Our proposed approach relies on Orange, an open-source data science toolbox that combines data visualization and machine learning, and that is tailored for education in machine learning and explorative data analysis.


Assuntos
Biologia Computacional , Ciência de Dados , Aprendizado de Máquina , Modelos Estatísticos , Biologia Computacional/educação , Biologia Computacional/métodos , Ciência de Dados/educação , Ciência de Dados/métodos , Humanos , Modelos Biológicos , Software
4.
Bioinformatics ; 35(14): i4-i12, 2019 07 15.
Artigo em Inglês | MEDLINE | ID: mdl-31510695

RESUMO

MOTIVATION: Single-cell RNA sequencing allows us to simultaneously profile the transcriptomes of thousands of cells and to indulge in exploring cell diversity, development and discovery of new molecular mechanisms. Analysis of scRNA data involves a combination of non-trivial steps from statistics, data visualization, bioinformatics and machine learning. Training molecular biologists in single-cell data analysis and empowering them to review and analyze their data can be challenging, both because of the complexity of the methods and the steep learning curve. RESULTS: We propose a workshop-style training in single-cell data analytics that relies on an explorative data analysis toolbox and a hands-on teaching style. The training relies on scOrange, a newly developed extension of a data mining framework that features workflow design through visual programming and interactive visualizations. Workshops with scOrange can proceed much faster than similar training methods that rely on computer programming and analysis through scripting in R or Python, allowing the trainer to cover more ground in the same time-frame. We here review the design principles of the scOrange toolbox that support such workshops and propose a syllabus for the course. We also provide examples of data analysis workflows that instructors can use during the training. AVAILABILITY AND IMPLEMENTATION: scOrange is an open-source software. The software, documentation and an emerging set of educational videos are available at http://singlecell.biolab.si.


Assuntos
Biologia Computacional , Ciência de Dados , Software , Análise de Sequência de RNA , Fluxo de Trabalho
5.
Genome Res ; 26(9): 1268-76, 2016 09.
Artigo em Inglês | MEDLINE | ID: mdl-27307293

RESUMO

Whole-genome sequencing is a useful approach for identification of chemical-induced lesions, but previous applications involved tedious genetic mapping to pinpoint the causative mutations. We propose that saturation mutagenesis under low mutagenic loads, followed by whole-genome sequencing, should allow direct implication of genes by identifying multiple independent alleles of each relevant gene. We tested the hypothesis by performing three genetic screens with chemical mutagenesis in the social soil amoeba Dictyostelium discoideum Through genome sequencing, we successfully identified mutant genes with multiple alleles in near-saturation screens, including resistance to intense illumination and strong suppressors of defects in an allorecognition pathway. We tested the causality of the mutations by comparison to published data and by direct complementation tests, finding both dominant and recessive causative mutations. Therefore, our strategy provides a cost- and time-efficient approach to gene discovery by integrating chemical mutagenesis and whole-genome sequencing. The method should be applicable to many microbial systems, and it is expected to revolutionize the field of functional genomics in Dictyostelium by greatly expanding the mutation spectrum relative to other common mutagenesis methods.


Assuntos
Dictyostelium/genética , Mutagênese/genética , Sequenciamento Completo do Genoma/métodos , Mapeamento Cromossômico , Dictyostelium/efeitos dos fármacos , Estudos de Associação Genética , Sequenciamento de Nucleotídeos em Larga Escala , Mutagênese/efeitos dos fármacos , Mutagênicos/toxicidade
6.
BMC Med ; 16(1): 150, 2018 08 27.
Artigo em Inglês | MEDLINE | ID: mdl-30145981

RESUMO

BACKGROUND: Personalized, precision, P4, or stratified medicine is understood as a medical approach in which patients are stratified based on their disease subtype, risk, prognosis, or treatment response using specialized diagnostic tests. The key idea is to base medical decisions on individual patient characteristics, including molecular and behavioral biomarkers, rather than on population averages. Personalized medicine is deeply connected to and dependent on data science, specifically machine learning (often named Artificial Intelligence in the mainstream media). While during recent years there has been a lot of enthusiasm about the potential of 'big data' and machine learning-based solutions, there exist only few examples that impact current clinical practice. The lack of impact on clinical practice can largely be attributed to insufficient performance of predictive models, difficulties to interpret complex model predictions, and lack of validation via prospective clinical trials that demonstrate a clear benefit compared to the standard of care. In this paper, we review the potential of state-of-the-art data science approaches for personalized medicine, discuss open challenges, and highlight directions that may help to overcome them in the future. CONCLUSIONS: There is a need for an interdisciplinary effort, including data scientists, physicians, patient advocates, regulatory agencies, and health insurance organizations. Partially unrealistic expectations and concerns about data science-based solutions need to be better managed. In parallel, computational methods must advance more to provide direct benefit to clinical practice.


Assuntos
Medicina de Precisão/métodos , Humanos , Estudos Prospectivos
7.
BMC Bioinformatics ; 18(1): 291, 2017 Jun 02.
Artigo em Inglês | MEDLINE | ID: mdl-28578698

RESUMO

BACKGROUND: Dictyostelium discoideum, a soil-dwelling social amoeba, is a model for the study of numerous biological processes. Research in the field has benefited mightily from the adoption of next-generation sequencing for genomics and transcriptomics. Dictyostelium biologists now face the widespread challenges of analyzing and exploring high dimensional data sets to generate hypotheses and discovering novel insights. RESULTS: We present dictyExpress (2.0), a web application designed for exploratory analysis of gene expression data, as well as data from related experiments such as Chromatin Immunoprecipitation sequencing (ChIP-Seq). The application features visualization modules that include time course expression profiles, clustering, gene ontology enrichment analysis, differential expression analysis and comparison of experiments. All visualizations are interactive and interconnected, such that the selection of genes in one module propagates instantly to visualizations in other modules. dictyExpress currently stores the data from over 800 Dictyostelium experiments and is embedded within a general-purpose software framework for management of next-generation sequencing data. dictyExpress allows users to explore their data in a broader context by reciprocal linking with dictyBase-a repository of Dictyostelium genomic data. In addition, we introduce a companion application called GenBoard, an intuitive graphic user interface for data management and bioinformatics analysis. CONCLUSIONS: dictyExpress and GenBoard enable broad adoption of next generation sequencing based inquiries by the Dictyostelium research community. Labs without the means to undertake deep sequencing projects can mine the data available to the public. The entire information flow, from raw sequence data to hypothesis testing, can be accomplished in an efficient workspace. The software framework is generalizable and represents a useful approach for any research community. To encourage more wide usage, the backend is open-source, available for extension and further development by bioinformaticians and data scientists.


Assuntos
Dictyostelium/metabolismo , Interface Usuário-Computador , Imunoprecipitação da Cromatina , Análise por Conglomerados , Dictyostelium/genética , Sequenciamento de Nucleotídeos em Larga Escala , Internet , Análise de Sequência de RNA , Transcriptoma
8.
Bioinformatics ; 32(12): i90-i100, 2016 06 15.
Artigo em Inglês | MEDLINE | ID: mdl-27307649

RESUMO

MOTIVATION: The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects-such as a gene and a disease-can be related in different ways, for example, directly via gene-disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different semantic meanings However, traditional methods disregard these semantics and thus cannot fully exploit their value in data modeling. RESULTS: We present Medusa, an approach to detect size-k modules of objects that, taken together, appear most significant to another set of objects. Medusa operates on large-scale collections of heterogeneous datasets and explicitly distinguishes between diverse data semantics. It advances research along two dimensions: it builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program. Medusa is flexible in choosing or combining semantic meanings and provides theoretical guarantees about detection quality. In a systematic study on 310 complex diseases, we show the effectiveness of Medusa in associating genes with diseases and detecting disease modules. We demonstrate that in predicting gene-disease associations Medusa compares favorably to methods that ignore diverse semantic meanings. We find that the utility of different semantics depends on disease categories and that, overall, Medusa recovers disease modules more accurately when combining different semantics. AVAILABILITY AND IMPLEMENTATION: Source code is at http://github.com/marinkaz/medusa CONTACT: marinka@cs.stanford.edu, blaz.zupan@fri.uni-lj.si.


Assuntos
Biologia Computacional/métodos , Compressão de Dados , Doença/genética , Semântica , Algoritmos , Ontologia Genética , Humanos
9.
Bioinformatics ; 32(10): 1527-35, 2016 05 15.
Artigo em Inglês | MEDLINE | ID: mdl-26787667

RESUMO

MOTIVATION: RNA binding proteins (RBPs) play important roles in post-transcriptional control of gene expression, including splicing, transport, polyadenylation and RNA stability. To model protein-RNA interactions by considering all available sources of information, it is necessary to integrate the rapidly growing RBP experimental data with the latest genome annotation, gene function, RNA sequence and structure. Such integration is possible by matrix factorization, where current approaches have an undesired tendency to identify only a small number of the strongest patterns with overlapping features. Because protein-RNA interactions are orchestrated by multiple factors, methods that identify discriminative patterns of varying strengths are needed. RESULTS: We have developed an integrative orthogonality-regularized nonnegative matrix factorization (iONMF) to integrate multiple data sources and discover non-overlapping, class-specific RNA binding patterns of varying strengths. The orthogonality constraint halves the effective size of the factor model and outperforms other NMF models in predicting RBP interaction sites on RNA. We have integrated the largest data compendium to date, which includes 31 CLIP experiments on 19 RBPs involved in splicing (such as hnRNPs, U2AF2, ELAVL1, TDP-43 and FUS) and processing of 3'UTR (Ago, IGF2BP). We show that the integration of multiple data sources improves the predictive accuracy of retrieval of RNA binding sites. In our study the key predictive factors of protein-RNA interactions were the position of RNA structure and sequence motifs, RBP co-binding and gene region type. We report on a number of protein-specific patterns, many of which are consistent with experimentally determined properties of RBPs. AVAILABILITY AND IMPLEMENTATION: The iONMF implementation and example datasets are available at https://github.com/mstrazar/ionmf CONTACT: : tomaz.curk@fri.uni-lj.si SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Modelos Moleculares , Proteínas de Ligação a RNA , Sítios de Ligação , Coleta de Dados , Conjuntos de Dados como Assunto , RNA
10.
Bioinformatics ; 31(12): i230-9, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-26072487

RESUMO

MOTIVATION: Markov networks are undirected graphical models that are widely used to infer relations between genes from experimental data. Their state-of-the-art inference procedures assume the data arise from a Gaussian distribution. High-throughput omics data, such as that from next generation sequencing, often violates this assumption. Furthermore, when collected data arise from multiple related but otherwise nonidentical distributions, their underlying networks are likely to have common features. New principled statistical approaches are needed that can deal with different data distributions and jointly consider collections of datasets. RESULTS: We present FuseNet, a Markov network formulation that infers networks from a collection of nonidentically distributed datasets. Our approach is computationally efficient and general: given any number of distributions from an exponential family, FuseNet represents model parameters through shared latent factors that define neighborhoods of network nodes. In a simulation study, we demonstrate good predictive performance of FuseNet in comparison to several popular graphical models. We show its effectiveness in an application to breast cancer RNA-sequencing and somatic mutation data, a novel application of graphical models. Fusion of datasets offers substantial gains relative to inference of separate networks for each dataset. Our results demonstrate that network inference methods for non-Gaussian data can help in accurate modeling of the data generated by emergent high-throughput technologies. AVAILABILITY AND IMPLEMENTATION: Source code is at https://github.com/marinkaz/fusenet.


Assuntos
Perfilação da Expressão Gênica/métodos , Redes Reguladoras de Genes , Algoritmos , Neoplasias da Mama/genética , Feminino , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Cadeias de Markov , Distribuição de Poisson , Análise de Sequência de RNA
11.
PLoS Comput Biol ; 11(10): e1004552, 2015 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-26465776

RESUMO

Data integration procedures combine heterogeneous data sets into predictive models, but they are limited to data explicitly related to the target object type, such as genes. Collage is a new data fusion approach to gene prioritization. It considers data sets of various association levels with the prediction task, utilizes collective matrix factorization to compress the data, and chaining to relate different object types contained in a data compendium. Collage prioritizes genes based on their similarity to several seed genes. We tested Collage by prioritizing bacterial response genes in Dictyostelium as a novel model system for prokaryote-eukaryote interactions. Using 4 seed genes and 14 data sets, only one of which was directly related to the bacterial response, Collage proposed 8 candidate genes that were readily validated as necessary for the response of Dictyostelium to Gram-negative bacteria. These findings establish Collage as a method for inferring biological knowledge from the integration of heterogeneous and coarsely related data sets.


Assuntos
Compressão de Dados/métodos , Bases de Dados Genéticas , Dictyostelium/metabolismo , Dictyostelium/microbiologia , Bactérias Gram-Negativas/fisiologia , Proteínas de Protozoários/metabolismo , Proliferação de Células/fisiologia , Mineração de Dados/métodos , Proteínas de Protozoários/genética
12.
BMC Bioinformatics ; 16 Suppl 16: S1, 2015.
Artigo em Inglês | MEDLINE | ID: mdl-26551454

RESUMO

BACKGROUND: Relation extraction is an essential procedure in literature mining. It focuses on extracting semantic relations between parts of text, called mentions. Biomedical literature includes an enormous amount of textual descriptions of biological entities, their interactions and results of related experiments. To extract them in an explicit, computer readable format, these relations were at first extracted manually from databases. Manual curation was later replaced with automatic or semi-automatic tools with natural language processing capabilities. The current challenge is the development of information extraction procedures that can directly infer more complex relational structures, such as gene regulatory networks. RESULTS: We develop a computational approach for extraction of gene regulatory networks from textual data. Our method is designed as a sieve-based system and uses linear-chain conditional random fields and rules for relation extraction. With this method we successfully extracted the sporulation gene regulation network in the bacterium Bacillus subtilis for the information extraction challenge at the BioNLP 2013 conference. To enable extraction of distant relations using first-order models, we transform the data into skip-mention sequences. We infer multiple models, each of which is able to extract different relationship types. Following the shared task, we conducted additional analysis using different system settings that resulted in reducing the reconstruction error of bacterial sporulation network from 0.73 to 0.68, measured as the slot error rate between the predicted and the reference network. We observe that all relation extraction sieves contribute to the predictive performance of the proposed approach. Also, features constructed by considering mention words and their prefixes and suffixes are the most important features for higher accuracy of extraction. Analysis of distances between different mention types in the text shows that our choice of transforming data into skip-mention sequences is appropriate for detecting relations between distant mentions. CONCLUSIONS: Linear-chain conditional random fields, along with appropriate data transformations, can be efficiently used to extract relations. The sieve-based architecture simplifies the system as new sieves can be easily added or removed and each sieve can utilize the results of previous ones. Furthermore, sieves with conditional random fields can be trained on arbitrary text data and hence are applicable to broad range of relation extraction tasks and data domains.


Assuntos
Redes Reguladoras de Genes , Armazenamento e Recuperação da Informação , Publicações , Algoritmos , Modelos Teóricos
13.
BMC Genomics ; 16: 294, 2015 Apr 13.
Artigo em Inglês | MEDLINE | ID: mdl-25887420

RESUMO

BACKGROUND: Development of the soil amoeba Dictyostelium discoideum is triggered by starvation. When placed on a solid substrate, the starving solitary amoebae cease growth, communicate via extracellular cAMP, aggregate by tens of thousands and develop into multicellular organisms. Early phases of the developmental program are often studied in cells starved in suspension while cAMP is provided exogenously. Previous studies revealed massive shifts in the transcriptome under both developmental conditions and a close relationship between gene expression and morphogenesis, but were limited by the sampling frequency and the resolution of the methods. RESULTS: Here, we combine the superior depth and specificity of RNA-seq-based analysis of mRNA abundance with high frequency sampling during filter development and cAMP pulsing in suspension. We found that the developmental transcriptome exhibits mostly gradual changes interspersed by a few instances of large shifts. For each time point we treated the entire transcriptome as single phenotype, and were able to characterize development as groups of similar time points separated by gaps. The grouped time points represented gradual changes in mRNA abundance, or molecular phenotype, and the gaps represented times during which many genes are differentially expressed rapidly, and thus the phenotype changes dramatically. Comparing developmental experiments revealed that gene expression in filter developed cells lagged behind those treated with exogenous cAMP in suspension. The high sampling frequency revealed many genes whose regulation is reproducibly more complex than indicated by previous studies. Gene Ontology enrichment analysis suggested that the transition to multicellularity coincided with rapid accumulation of transcripts associated with DNA processes and mitosis. Later development included the up-regulation of organic signaling molecules and co-factor biosynthesis. Our analysis also demonstrated a high level of synchrony among the developing structures throughout development. CONCLUSIONS: Our data describe D. discoideum development as a series of coordinated cellular and multicellular activities. Coordination occurred within fields of aggregating cells and among multicellular bodies, such as mounds or migratory slugs that experience both cell-cell contact and various soluble signaling regimes. These time courses, sampled at the highest temporal resolution to date in this system, provide a comprehensive resource for studies of developmental gene expression.


Assuntos
Dictyostelium/crescimento & desenvolvimento , Dictyostelium/genética , RNA Mensageiro/metabolismo , Transcriptoma , AMP Cíclico/metabolismo , Dictyostelium/metabolismo , Morfogênese
14.
Bioinformatics ; 30(12): i246-i254, 2014 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-24931990

RESUMO

MOTIVATION: Epistasis analysis is an essential tool of classical genetics for inferring the order of function of genes in a common pathway. Typically, it considers single and double mutant phenotypes and for a pair of genes observes whether a change in the first gene masks the effects of the mutation in the second gene. Despite the recent emergence of biotechnology techniques that can provide gene interaction data on a large, possibly genomic scale, few methods are available for quantitative epistasis analysis and epistasis-based network reconstruction. RESULTS: We here propose a conceptually new probabilistic approach to gene network inference from quantitative interaction data. The approach is founded on epistasis analysis. Its features are joint treatment of the mutant phenotype data with a factorized model and probabilistic scoring of pairwise gene relationships that are inferred from the latent gene representation. The resulting gene network is assembled from scored pairwise relationships. In an experimental study, we show that the proposed approach can accurately reconstruct several known pathways and that it surpasses the accuracy of current approaches. AVAILABILITY AND IMPLEMENTATION: Source code is available at http://github.com/biolab/red.


Assuntos
Epistasia Genética , Redes Reguladoras de Genes , Modelos Estatísticos , Algoritmos , Degradação Associada com o Retículo Endoplasmático/genética , Glicosilação , Mutação , Fenótipo , Fosfatidilserinas/metabolismo
15.
BMC Bioinformatics ; 15: 216, 2014 Jun 25.
Artigo em Inglês | MEDLINE | ID: mdl-24964802

RESUMO

BACKGROUND: The extent of data in a typical genome-wide association study (GWAS) poses considerable computational challenges to software tools for gene-gene interaction discovery. Exhaustive evaluation of all interactions among hundreds of thousands to millions of single nucleotide polymorphisms (SNPs) may require weeks or even months of computation. Massively parallel hardware within a modern Graphic Processing Unit (GPU) and Many Integrated Core (MIC) coprocessors can shorten the run time considerably. While the utility of GPU-based implementations in bioinformatics has been well studied, MIC architecture has been introduced only recently and may provide a number of comparative advantages that have yet to be explored and tested. RESULTS: We have developed a heterogeneous, GPU and Intel MIC-accelerated software module for SNP-SNP interaction discovery to replace the previously single-threaded computational core in the interactive web-based data exploration program SNPsyn. We report on differences between these two modern massively parallel architectures and their software environments. Their utility resulted in an order of magnitude shorter execution times when compared to the single-threaded CPU implementation. GPU implementation on a single Nvidia Tesla K20 runs twice as fast as that for the MIC architecture-based Xeon Phi P5110 coprocessor, but also requires considerably more programming effort. CONCLUSIONS: General purpose GPUs are a mature platform with large amounts of computing power capable of tackling inherently parallel problems, but can prove demanding for the programmer. On the other hand the new MIC architecture, albeit lacking in performance reduces the programming effort and makes it up with a more general architecture suitable for a wider range of problems.


Assuntos
Algoritmos , Biologia Computacional/métodos , Polimorfismo de Nucleotídeo Único , Software , Gráficos por Computador , Estudo de Associação Genômica Ampla , Sequenciamento de Nucleotídeos em Larga Escala , Internet , Fatores de Tempo
16.
Genome Res ; 21(10): 1572-82, 2011 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-21846794

RESUMO

Age is the most important risk factor for neurodegeneration; however, the effects of aging and neurodegeneration on gene expression in the human brain have most often been studied separately. Here, we analyzed changes in transcript levels and alternative splicing in the temporal cortex of individuals of different ages who were cognitively normal, affected by frontotemporal lobar degeneration (FTLD), or affected by Alzheimer's disease (AD). We identified age-related splicing changes in cognitively normal individuals and found that these were present also in 95% of individuals with FTLD or AD, independent of their age. These changes were consistent with increased polypyrimidine tract binding protein (PTB)-dependent splicing activity. We also identified disease-specific splicing changes that were present in individuals with FTLD or AD, but not in cognitively normal individuals. These changes were consistent with the decreased neuro-oncological ventral antigen (NOVA)-dependent splicing regulation, and the decreased nuclear abundance of NOVA proteins. As expected, a dramatic down-regulation of neuronal genes was associated with disease, whereas a modest down-regulation of glial and neuronal genes was associated with aging. Whereas our data indicated that the age-related splicing changes are regulated independently of transcript-level changes, these two regulatory mechanisms affected expression of genes with similar functions, including metabolism and DNA repair. In conclusion, the alternative splicing changes identified in this study provide a new link between aging and neurodegeneration.


Assuntos
Envelhecimento , Processamento Alternativo , Doença de Alzheimer/genética , Degeneração Lobar Frontotemporal/genética , Adolescente , Adulto , Fatores Etários , Idoso , Idoso de 80 Anos ou mais , Antígenos de Neoplasias/genética , Antígenos de Neoplasias/metabolismo , Moléculas de Adesão Celular/genética , Regulação para Baixo , Éxons , Perfilação da Expressão Gênica , Humanos , Canais Iônicos/genética , Pessoa de Meia-Idade , Proteínas do Tecido Nervoso/genética , Proteínas do Tecido Nervoso/metabolismo , Antígeno Neuro-Oncológico Ventral , Análise de Sequência com Séries de Oligonucleotídeos , Proteína de Ligação a Regiões Ricas em Polipirimidinas/metabolismo , Análise de Componente Principal , Isoformas de Proteínas/metabolismo , Proteínas de Ligação a RNA/genética , Proteínas de Ligação a RNA/metabolismo , Transmissão Sináptica/genética , Lobo Temporal/metabolismo , Transcrição Gênica , Adulto Jovem
17.
Yeast ; 31(7): 265-77, 2014 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-24752995

RESUMO

Genome sequencing is essential to understand individual variation and to study the mechanisms that explain relations between genotype and phenotype. The accumulated knowledge from large-scale genome sequencing projects of Saccharomyces cerevisiae isolates is being used to study the mechanisms that explain such relations. Our objective was to undertake genetic characterization of 172 S. cerevisiae strains from different geographical origins and technological groups, using 11 polymorphic microsatellites, and computationally relate these data with the results of 30 phenotypic tests. Genetic characterization revealed 280 alleles, with the microsatellite ScAAT1 contributing most to intrastrain variability, together with alleles 20, 9 and 16 from the microsatellites ScAAT4, ScAAT5 and ScAAT6. These microsatellite allelic profiles are characteristic for both the phenotype and origin of yeast strains. We confirm the strength of these associations by construction and cross-validation of computational models that can predict the technological application and origin of a strain from the microsatellite allelic profile. Associations between microsatellites and specific phenotypes were scored using information gain ratios, and significant findings were confirmed by permutation tests and estimation of false discovery rates. The phenotypes associated with higher number of alleles were the capacity to resist to sulphur dioxide (tested by the capacity to grow in the presence of potassium bisulphite) and the presence of galactosidase activity. Our study demonstrates the utility of computational modelling to estimate a strain technological group and phenotype from microsatellite allelic combinations as tools for preliminary yeast strain selection.


Assuntos
DNA Fúngico/genética , Variação Genética , Repetições de Microssatélites/genética , Modelos Genéticos , Saccharomyces cerevisiae/genética , Alelos , Simulação por Computador , Genótipo , Fenótipo , Análise de Componente Principal
18.
J Chem Inf Model ; 54(2): 431-41, 2014 Feb 24.
Artigo em Inglês | MEDLINE | ID: mdl-24490838

RESUMO

The vastness of chemical space and the relatively small coverage by experimental data recording molecular properties require us to identify subspaces, or domains, for which we can confidently apply QSAR models. The prediction of QSAR models in these domains is reliable, and potential subsequent investigations of such compounds would find that the predictions closely match the experimental values. Standard approaches in QSAR assume that predictions are more reliable for compounds that are "similar" to those in subspaces with denser experimental data. Here, we report on a study of an alternative set of techniques recently proposed in the machine learning community. These methods quantify prediction confidence through estimation of the prediction error at the point of interest. Our study includes 20 public QSAR data sets with continuous response and assesses the quality of 10 reliability scoring methods by observing their correlation with prediction error. We show that these new alternative approaches can outperform standard reliability scores that rely only on similarity to compounds in the training set. The results also indicate that the quality of reliability scoring methods is sensitive to data set characteristics and to the regression method used in QSAR. We demonstrate that at the cost of increased computational complexity these dependencies can be leveraged by integration of scores from various reliability estimation approaches. The reliability estimation techniques described in this paper have been implemented in an open source add-on package ( https://bitbucket.org/biolab/orange-reliability ) to the Orange data mining suite.


Assuntos
Inteligência Artificial , Descoberta de Drogas/métodos , Relação Quantitativa Estrutura-Atividade , Algoritmos , Análise de Regressão , Fatores de Tempo
19.
RNA Biol ; 11(2): 146-55, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24526010

RESUMO

Heterogeneous nuclear ribonucleoprotein L (hnRNP L) is a multifunctional RNA-binding protein that is involved in many different processes, such as regulation of transcription, translation, and RNA stability. We have previously characterized hnRNP L as a global regulator of alternative splicing, binding to CA-repeat, and CA-rich RNA elements. Interestingly, hnRNP L can both activate and repress splicing of alternative exons, but the precise mechanism of hnRNP L-mediated splicing regulation remained unclear. To analyze activities of hnRNP L on a genome-wide level, we performed individual-nucleotide resolution crosslinking-immunoprecipitation in combination with deep-sequencing (iCLIP-Seq). Sequence analysis of the iCLIP crosslink sites showed significant enrichment of C/A motifs, which perfectly agrees with the in vitro binding consensus obtained earlier by a SELEX approach, indicating that in vivo hnRNP L binding targets are mainly determined by the RNA-binding activity of the protein. Genome-wide mapping of hnRNP L binding revealed that the protein preferably binds to introns and 3' UTR. Additionally, position-dependent splicing regulation by hnRNP L was demonstrated: The protein represses splicing when bound to intronic regions upstream of alternative exons, and in contrast, activates splicing when bound to the downstream intron. These findings shed light on the longstanding question of differential hnRNP L-mediated splicing regulation. Finally, regarding 3' UTR binding, hnRNP L binding preferentially overlaps with predicted microRNA target sites, indicating global competition between hnRNP L and microRNA binding. Translational regulation by hnRNP L was validated for a subset of predicted target 3'UTRs.


Assuntos
Regiões 3' não Traduzidas , Processamento Alternativo , Ribonucleoproteínas Nucleares Heterogêneas Grupo L/metabolismo , Íntrons , MicroRNAs/metabolismo , RNA Mensageiro/metabolismo , Regulação da Expressão Gênica , Técnicas de Silenciamento de Genes , Redes Reguladoras de Genes , Genoma Humano , Células HeLa , Ribonucleoproteínas Nucleares Heterogêneas Grupo L/genética , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Imunoprecipitação
20.
Nat Genet ; 37(5): 471-7, 2005 May.
Artigo em Inglês | MEDLINE | ID: mdl-15821735

RESUMO

Classical epistasis analysis can determine the order of function of genes in pathways using morphological, biochemical and other phenotypes. It requires knowledge of the pathway's phenotypic output and a variety of experimental expertise and so is unsuitable for genome-scale analysis. Here we used microarray profiles of mutants as phenotypes for epistasis analysis. Considering genes that regulate activity of protein kinase A in Dictyostelium, we identified known and unknown epistatic relationships and reconstructed a genetic network with microarray phenotypes alone. This work shows that microarray data can provide a uniform, quantitative tool for large-scale genetic network analysis.


Assuntos
Dictyostelium/genética , Epistasia Genética , Transcrição Gênica , Animais , Proteínas Quinases Dependentes de AMP Cíclico/genética , Proteínas Quinases Dependentes de AMP Cíclico/metabolismo , Dictyostelium/enzimologia , Mutação , Proteína Quinase C/genética , Proteína Quinase C/metabolismo
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA