RESUMO
Aging is associated with progressive phenotypic changes. Virtually all cellular phenotypes are produced by proteins, and their structural alterations can lead to age-related diseases. However, we still lack comprehensive knowledge of proteins undergoing structural-functional changes during cellular aging and their contributions to age-related phenotypes. Here, we conducted proteome-wide analysis of early age-related protein structural changes in budding yeast using limited proteolysis-mass spectrometry (LiP-MS). The results, compiled in online ProtAge catalog, unraveled age-related functional changes in regulators of translation, protein folding, and amino acid metabolism. Mechanistically, we found that folded glutamate synthase Glt1 polymerizes into supramolecular self-assemblies during aging, causing breakdown of cellular amino acid homeostasis. Inhibiting Glt1 polymerization by mutating the polymerization interface restored amino acid levels in aged cells, attenuated mitochondrial dysfunction, and led to lifespan extension. Altogether, this comprehensive map of protein structural changes enables identifying mechanisms of age-related phenotypes and offers opportunities for their reversal.
Assuntos
Senescência Celular , Longevidade , Longevidade/genética , Polimerização , AminoácidosRESUMO
Increasing the proportion of locally produced plant protein in currently meat-rich diets could substantially reduce greenhouse gas emissions and loss of biodiversity1. However, plant protein production is hampered by the lack of a cool-season legume equivalent to soybean in agronomic value2. Faba bean (Vicia faba L.) has a high yield potential and is well suited for cultivation in temperate regions, but genomic resources are scarce. Here, we report a high-quality chromosome-scale assembly of the faba bean genome and show that it has expanded to a massive 13 Gb in size through an imbalance between the rates of amplification and elimination of retrotransposons and satellite repeats. Genes and recombination events are evenly dispersed across chromosomes and the gene space is remarkably compact considering the genome size, although with substantial copy number variation driven by tandem duplication. Demonstrating practical application of the genome sequence, we develop a targeted genotyping assay and use high-resolution genome-wide association analysis to dissect the genetic basis of seed size and hilum colour. The resources presented constitute a genomics-based breeding platform for faba bean, enabling breeders and geneticists to accelerate the improvement of sustainable protein production across the Mediterranean, subtropical and northern temperate agroecological zones.
Assuntos
Produtos Agrícolas , Diploide , Variação Genética , Genoma de Planta , Genômica , Melhoramento Vegetal , Proteínas de Plantas , Vicia faba , Cromossomos de Plantas/genética , Produtos Agrícolas/genética , Produtos Agrícolas/metabolismo , Variações do Número de Cópias de DNA/genética , DNA Satélite/genética , Amplificação de Genes/genética , Genes de Plantas/genética , Variação Genética/genética , Genoma de Planta/genética , Estudo de Associação Genômica Ampla , Geografia , Melhoramento Vegetal/métodos , Proteínas de Plantas/genética , Proteínas de Plantas/metabolismo , Recombinação Genética , Retroelementos/genética , Sementes/anatomia & histologia , Sementes/genética , Vicia faba/anatomia & histologia , Vicia faba/genética , Vicia faba/metabolismoRESUMO
Protein structure is key to understanding biological function. Structure comparison deciphers deep phylogenies, providing insight into functional conservation and functional shifts during evolution. Until recently, structural coverage of the protein universe was limited by the cost and labour involved in experimental structure determination. Recent breakthroughs in deep learning revolutionized structural bioinformatics by providing accurate structural models of numerous protein families for which no structural information existed. The Dali server for 3D protein structure comparison is widely used by crystallographers to relate new structures to pre-existing ones. Here, we report two most recent upgrades to the web server: (i) the foldomes of key organisms in the AlphaFold Database (version 1) are searchable by Dali, (ii) structural alignments are annotated with protein families. Using these new features, we discovered a novel functionally diverse subgroup within the WRKY/GCM1 clan. This was accomplished by linking the structurally characterized SWI/SNF and NAM families as well as the structural models of the CG-1 family and uncharacterized proteins to the structure of Gti1/Pac2, a previously known member of the WRKY/GCM1 clan. The Dali server is available at http://ekhidna2.biocenter.helsinki.fi/dali. This website is free and open to all users and there is no login requirement.
Assuntos
Bases de Dados de Proteínas , Proteínas , Software , Computadores , Internet , Proteínas/química , Conformação ProteicaRESUMO
[This corrects the article DOI: 10.1371/journal.pcbi.1007419.].
RESUMO
MOTIVATION: Protein structure comparison plays a fundamental role in understanding the evolutionary relationships between proteins. Here, we release a new version of the DaliLite standalone software. The novelties are hierarchical search of the structure database organized into sequence based clusters, and remote access to our knowledge base of structural neighbors. The detection of fold, superfamily and family level similarities by DaliLite and state-of-the-art competitors was benchmarked against a manually curated structural classification. RESULTS: Database search strategies were evaluated using Fmax with query-specific thresholds. DaliLite and DeepAlign outperformed TM-score based methods at all levels of the benchmark, and DaliLite outperformed DeepAlign at fold level. Hierarchical and knowledge-based searches got close to the performance of systematic pairwise comparison. The knowledge-based search was four times as efficient as the hierarchical search. The knowledge-based search dynamically adjusts the depth of the search, enabling a trade-off between speed and recall. AVAILABILITY AND IMPLEMENTATION: http://ekhidna2.biocenter.helsinki.fi/dali/README.v5.html. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Benchmarking , Algoritmos , Bases de Dados Factuais , Proteínas , Análise de Sequência de Proteína , SoftwareRESUMO
Automated protein annotation using the Gene Ontology (GO) plays an important role in the biosciences. Evaluation has always been considered central to developing novel annotation methods, but little attention has been paid to the evaluation metrics themselves. Evaluation metrics define how well an annotation method performs and allows for them to be ranked against one another. Unfortunately, most of these metrics were adopted from the machine learning literature without establishing whether they were appropriate for GO annotations. We propose a novel approach for comparing GO evaluation metrics called Artificial Dilution Series (ADS). Our approach uses existing annotation data to generate a series of annotation sets with different levels of correctness (referred to as their signal level). We calculate the evaluation metric being tested for each annotation set in the series, allowing us to identify whether it can separate different signal levels. Finally, we contrast these results with several false positive annotation sets, which are designed to expose systematic weaknesses in GO assessment. We compared 37 evaluation metrics for GO annotation using ADS and identified drastic differences between metrics. We show that some metrics struggle to differentiate between different signal levels, while others give erroneously high scores to the false positive data sets. Based on our findings, we provide guidelines on which evaluation metrics perform well with the Gene Ontology and propose improvements to several well-known evaluation metrics. In general, we argue that evaluation metrics should be tested for their performance and we provide software for this purpose (https://bitbucket.org/plyusnin/ads/). ADS is applicable to other areas of science where the evaluation of prediction results is non-trivial.
Assuntos
Biologia Computacional/métodos , Anotação de Sequência Molecular/classificação , Anotação de Sequência Molecular/métodos , Algoritmos , Benchmarking/métodos , Bases de Dados Genéticas , Bases de Dados de Proteínas , Ontologia Genética/tendências , Reprodutibilidade dos Testes , SoftwareRESUMO
We present AAI-profiler, a web server for exploratory analysis and quality control in comparative genomics. AAI-profiler summarizes proteome-wide sequence search results to identify novel species, assess the need for taxonomic reclassification and detect multi-isolate and contaminated samples. AAI-profiler visualises results using a scatterplot that shows the Average Amino-acid Identity (AAI) from the query proteome to all similar species in the sequence database. Taxonomic groups are indicated by colour and marker styles, making outliers easy to spot. AAI-profiler uses SANSparallel to perform high-performance homology searches, making proteome-wide analysis possible. We demonstrate the efficacy of AAI-profiler in the discovery of a close relationship between two bacterial symbionts of an omnivorous pirate bug (Orius) and a thrip (Frankliniella occidentalis), an important pest in agriculture. The symbionts represent novel species within the genus Rosenbergiella so far described only in floral nectar. AAI-profiler is easy to use, the analysis presented only required two mouse clicks and was completed in a few minutes. AAI-profiler is available at http://ekhidna2.biocenter.helsinki.fi/AAI.
Assuntos
Proteínas de Bactérias/genética , Chlamydia trachomatis/classificação , Erwinia/classificação , Filogenia , Proteoma/genética , Software , Sequência de Aminoácidos , Animais , Proteínas de Bactérias/classificação , Proteínas de Bactérias/metabolismo , Chlamydia trachomatis/genética , Chlamydia trachomatis/isolamento & purificação , Erwinia/genética , Erwinia/isolamento & purificação , Expressão Gênica , Genômica/métodos , Heterópteros/microbiologia , Internet , Anotação de Sequência Molecular , Proteoma/classificação , Proteoma/metabolismo , Homologia de Sequência de Aminoácidos , Simbiose/fisiologia , Tisanópteros/microbiologiaRESUMO
The unprecedented growth of high-throughput sequencing has led to an ever-widening annotation gap in protein databases. While computational prediction methods are available to make up the shortfall, a majority of public web servers are hindered by practical limitations and poor performance. Here, we introduce PANNZER2 (Protein ANNotation with Z-scoRE), a fast functional annotation web server that provides both Gene Ontology (GO) annotations and free text description predictions. PANNZER2 uses SANSparallel to perform high-performance homology searches, making bulk annotation based on sequence similarity practical. PANNZER2 can output GO annotations from multiple scoring functions, enabling users to see which predictions are robust across predictors. Finally, PANNZER2 predictions scored within the top 10 methods for molecular function and biological process in the CAFA2 NK-full benchmark. The PANNZER2 web server is updated on a monthly schedule and is accessible at http://ekhidna2.biocenter.helsinki.fi/sanspanz/. The source code is available under the GNU Public Licence v3.
Assuntos
Biologia Computacional/tendências , Ontologia Genética/tendências , Internet , Software , Algoritmos , Bases de Dados de Proteínas/tendências , Sequenciamento de Nucleotídeos em Larga Escala , Anotação de Sequência MolecularRESUMO
BACKGROUND: Protein homology search is an important, yet time-consuming, step in everything from protein annotation to metagenomics. Its application, however, has become increasingly challenging, due to the exponential growth of protein databases. In order to perform homology search at the required scale, many methods have been proposed as alternatives to BLAST that make an explicit trade-off between sensitivity and speed. One such method, SANSparallel, uses a parallel implementation of the suffix array neighbourhood search (SANS) technique to achieve high speed and provides several modes to allow for greater sensitivity at the expense of performance. RESULTS: We present a new approach called asymmetric SANS together with scored seeds and an alternative suffix array ordering scheme called optimal substitution ordering. These techniques dramatically improve both the sensitivity and speed of the SANS approach. Our implementation, TOPAZ, is one of the top performing methods in terms of speed, sensitivity and scalability. In our benchmark, searching UniProtKB for homologous proteins to the Dickeya solani proteome, TOPAZ took less than 3 minutes to achieve a sensitivity of 0.84 compared to BLAST. CONCLUSIONS: Despite the trade-off homology search methods have to make between sensitivity and speed, TOPAZ stands out as one of the most sensitive and highest performance methods currently available.
Assuntos
Bases de Dados de Proteínas , Software , Algoritmos , Sequência de Aminoácidos , Proteínas de Bactérias/química , Enterobacteriaceae/metabolismo , Alinhamento de SequênciaRESUMO
BACKGROUND: Current high-throughput sequencing platforms provide capacity to sequence multiple samples in parallel. Different samples are labeled by attaching a short sample specific nucleotide sequence, barcode, to each DNA molecule prior pooling them into a mix containing a number of libraries to be sequenced simultaneously. After sequencing, the samples are binned by identifying the barcode sequence within each sequence read. In order to tolerate sequencing errors, barcodes should be sufficiently apart from each other in sequence space. An additional constraint due to both nucleotide usage and basecalling accuracy is that the proportion of different nucleotides should be in balance in each barcode position. The number of samples to be mixed in each sequencing run may vary and this introduces a problem how to select the best subset of available barcodes at sequencing core facility for each sequencing run. There are plenty of tools available for de novo barcode design, but they are not suitable for subset selection. RESULTS: We have developed a tool which can be used for three different tasks: 1) selecting an optimal barcode set from a larger set of candidates, 2) checking the compatibility of user-defined set of barcodes, e.g. whether two or more libraries with existing barcodes can be combined in a single sequencing pool, and 3) augmenting an existing set of barcodes. In our approach the selection process is formulated as a minimization problem. We define the cost function and a set of constraints and use integer programming to solve the resulting combinatorial problem. Based on the desired number of barcodes to be selected and the set of candidate sequences given by user, the necessary constraints are automatically generated and the optimal solution can be found. The method is implemented in C programming language and web interface is available at http://ekhidna2.biocenter.helsinki.fi/barcosel . CONCLUSIONS: Increasing capacity of sequencing platforms raises the challenge of mixing barcodes. Our method allows the user to select a given number of barcodes among the larger existing barcode set so that both sequencing errors are tolerated and the nucleotide balance is optimized. The tool is easy to access via web browser.
Assuntos
Código de Barras de DNA Taxonômico/métodos , Ensaios de Triagem em Larga Escala/métodos , HumanosRESUMO
Colorectal cancer (CRC) genome is unstable and different types of instabilities, such as chromosomal instability (CIN) and microsatellite instability (MSI) are thought to reflect distinct cancer initiating mechanisms. Although 85% of sporadic CRC reveal CIN, 15% reveal mismatch repair (MMR) malfunction and MSI, the hallmarks of Lynch syndrome with inherited heterozygous germline mutations in MMR genes. Our study was designed to comprehensively follow genome-wide expression changes and their implications during colon tumorigenesis. We conducted a long-term feeding experiment in the mouse to address expression changes arising in histologically normal colonic mucosa as putative cancer preceding events, and the effect of inherited predisposition (Mlh1+/-) and Western-style diet (WD) on those. During the 21-month experiment, carcinomas developed mainly in WD-fed mice and were evenly distributed between genotypes. Unexpectedly, the heterozygote (B6.129-Mlh1tm1Rak) mice did not show MSI in their CRCs. Instead, both wildtype and heterozygote CRC mice showed a distinct mRNA expression profile and shortage of several chromosomal segregation gene-specific transcripts (Mlh1, Bub1, Mis18a, Tpx2, Rad9a, Pms2, Cenpe, Ncapd3, Odf2 and Dclre1b) in their colon mucosa, as well as an increased mitotic activity and abundant numbers of unbalanced/atypical mitoses in tumours. Our genome-wide expression profiling experiment demonstrates that cancer preceding changes are already seen in histologically normal colon mucosa and that decreased expressions of Mlh1 and other chromosomal segregation genes may form a field-defect in mucosa, which trigger MMR-proficient, chromosomally unstable CRC.
Assuntos
Colo/metabolismo , Neoplasias do Colo/genética , Mucosa Intestinal/metabolismo , Proteína 1 Homóloga a MutL/deficiência , Animais , Neoplasias do Colo/metabolismo , Neoplasias Colorretais Hereditárias sem Polipose/genética , Reparo de Erro de Pareamento de DNA/genética , Feminino , Predisposição Genética para Doença/genética , Mutação em Linhagem Germinativa/genética , Heterozigoto , Masculino , Camundongos , Camundongos Endogâmicos C57BL , Instabilidade de Microssatélites , Mitose/genéticaRESUMO
The Dali server (http://ekhidna2.biocenter.helsinki.fi/dali) is a network service for comparing protein structures in 3D. In favourable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences. The Dali server has been running in various places for over 20 years and is used routinely by crystallographers on newly solved structures. The latest update of the server provides enhanced analytics for the study of sequence and structure conservation. The server performs three types of structure comparisons: (i) Protein Data Bank (PDB) search compares one query structure against those in the PDB and returns a list of similar structures; (ii) pairwise comparison compares one query structure against a list of structures specified by the user; and (iii) all against all structure comparison returns a structural similarity matrix, a dendrogram and a multidimensional scaling projection of a set of structures specified by the user. Structural superimpositions are visualized using the Java-free WebGL viewer PV. The structural alignment view is enhanced by sequence similarity searches against Uniprot. The combined structure-sequence alignment information is compressed to a stack of aligned sequence logos. In the stack, each structure is structurally aligned to the query protein and represented by a sequence logo.
Assuntos
Algoritmos , Amidoidrolases/química , Filogenia , Interface Usuário-Computador , Amidoidrolases/classificação , Amidoidrolases/genética , Sequência de Aminoácidos , Gráficos por Computador , Bases de Dados Genéticas , Humanos , Imageamento Tridimensional , Internet , Modelos Moleculares , Domínios Proteicos , Estrutura Secundária de Proteína , Alinhamento de Sequência , Análise de Sequência de Proteína , Homologia Estrutural de ProteínaRESUMO
Proteins evolve by mutations and natural selection. The network of sequence similarities is a rich source for mining homologous relationships that inform on protein structure and function. There are many servers available to browse the network of homology relationships but one has to wait up to a minute for results. The SANSparallel webserver provides protein sequence database searches with immediate response and professional alignment visualization by third-party software. The output is a list, pairwise alignment or stacked alignment of sequence-similar proteins from Uniprot, UniRef90/50, Swissprot or Protein Data Bank. The stacked alignments are viewed in Jalview or as sequence logos. The database search uses the suffix array neighborhood search (SANS) method, which has been re-implemented as a client-server, improved and parallelized. The method is extremely fast and as sensitive as BLAST above 50% sequence identity. Benchmarks show that the method is highly competitive compared to previously published fast database search programs: UBLAST, DIAMOND, LAST, LAMBDA, RAPSEARCH2 and BLAT. The web server can be accessed interactively or programmatically at http://ekhidna2.biocenter.helsinki.fi/cgi-bin/sans/sans.cgi. It can be used to make protein functional annotation pipelines more efficient, and it is useful in interactive exploration of the detailed evidence supporting the annotation of particular proteins of interest.
Assuntos
Alinhamento de Sequência/métodos , Homologia de Sequência de Aminoácidos , Software , Algoritmos , Bases de Dados de Proteínas , Internet , Análise de Sequência de ProteínaRESUMO
Nonribosomal peptides and polyketides are a diverse group of natural products with complex chemical structures and enormous pharmaceutical potential. They are synthesized on modular nonribosomal peptide synthetase (NRPS) and polyketide synthase (PKS) enzyme complexes by a conserved thiotemplate mechanism. Here, we report the widespread occurrence of NRPS and PKS genetic machinery across the three domains of life with the discovery of 3,339 gene clusters from 991 organisms, by examining a total of 2,699 genomes. These gene clusters display extraordinarily diverse organizations, and a total of 1,147 hybrid NRPS/PKS clusters were found. Surprisingly, 10% of bacterial gene clusters lacked modular organization, and instead catalytic domains were mostly encoded as separate proteins. The finding of common occurrence of nonmodular NRPS differs substantially from the current classification. Sequence analysis indicates that the evolution of NRPS machineries was driven by a combination of common descent and horizontal gene transfer. We identified related siderophore NRPS gene clusters that encoded modular and nonmodular NRPS enzymes organized in a gradient. A higher frequency of the NRPS and PKS gene clusters was detected from bacteria compared with archaea or eukarya. They commonly occurred in the phyla of Proteobacteria, Actinobacteria, Firmicutes, and Cyanobacteria in bacteria and the phylum of Ascomycota in fungi. The majority of these NRPS and PKS gene clusters have unknown end products highlighting the power of genome mining in identifying novel genetic machinery for the biosynthesis of secondary metabolites.
Assuntos
Bactérias/genética , Evolução Molecular , Genoma Bacteriano , Policetídeo Sintases/genética , Policetídeos , Sideróforos/genética , Família Multigênica/fisiologia , Estrutura Terciária de Proteína , Análise de Sequência de Proteína/métodosRESUMO
BACKGROUND: Competitive gene set analysis is a standard exploratory tool for gene expression data. Permutation-based competitive gene set analysis methods are preferable to parametric ones because the latter make strong statistical assumptions which are not always met. For permutation-based methods, we permute samples, as opposed to genes, as doing so preserves the inter-gene correlation structure. Unfortunately, up until now, sample permutation-based methods have required a minimum of six replicates per sample group. RESULTS: We propose a new permutation-based competitive gene set analysis method for multi-group gene expression data with as few as three replicates per group. The method is based on advanced sample permutation technique that utilizes all groups within a data set for pairwise comparisons. We present a comprehensive evaluation of different permutation techniques, using multiple data sets and contrast the performance of our method, mGSZm, with other state of the art methods. We show that mGSZm is robust, and that, despite only using less than six replicates, we are able to consistently identify a high proportion of the top ranked gene sets from the analysis of a substantially larger data set. Further, we highlight other methods where performance is highly variable and appears dependent on the underlying data set being analyzed. CONCLUSIONS: Our results demonstrate that robust gene set analysis of multi-group gene expression data is permissible with as few as three replicates. In doing so, we have extended the applicability of such approaches to resource constrained experiments where additional data generation is prohibitively difficult or expensive. An R package implementing the proposed method and supplementary materials are available from the website http://ekhidna.biocenter.helsinki.fi/downloads/pashupati/mGSZm.html .
Assuntos
Perfilação da Expressão Gênica/métodos , Análise de Sequência com Séries de Oligonucleotídeos/métodos , Animais , Interpretação Estatística de Dados , Humanos , CamundongosRESUMO
Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.
Assuntos
Biologia Computacional/métodos , Biologia Molecular/métodos , Anotação de Sequência Molecular , Proteínas/fisiologia , Algoritmos , Animais , Bases de Dados de Proteínas , Exorribonucleases/classificação , Exorribonucleases/genética , Exorribonucleases/fisiologia , Previsões , Humanos , Proteínas/química , Proteínas/classificação , Proteínas/genética , Especificidade da EspécieRESUMO
MOTIVATION: The last decade has seen a remarkable growth in protein databases. This growth comes at a price: a growing number of submitted protein sequences lack functional annotation. Approximately 32% of sequences submitted to the most comprehensive protein database UniProtKB are labelled as 'Unknown protein' or alike. Also the functionally annotated parts are reported to contain 30-40% of errors. Here, we introduce a high-throughput tool for more reliable functional annotation called Protein ANNotation with Z-score (PANNZER). PANNZER predicts Gene Ontology (GO) classes and free text descriptions about protein functionality. PANNZER uses weighted k-nearest neighbour methods with statistical testing to maximize the reliability of a functional annotation. RESULTS: Our results in free text description line prediction show that we outperformed all competing methods with a clear margin. In GO prediction we show clear improvement to our older method that performed well in CAFA 2011 challenge.
Assuntos
Mineração de Dados , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Proteínas/metabolismo , Vocabulário Controlado , Análise por Conglomerados , Biologia Computacional/métodos , Interpretação Estatística de Dados , Bases de Dados Genéticas , Ontologia Genética , Humanos , Proteínas/genéticaRESUMO
Pfam, available via servers in the UK (http://pfam.sanger.ac.uk/) and the USA (http://pfam.janelia.org/), is a widely used database of protein families, containing 14 831 manually curated entries in the current release, version 27.0. Since the last update article 2 years ago, we have generated 1182 new families and maintained sequence coverage of the UniProt Knowledgebase (UniProtKB) at nearly 80%, despite a 50% increase in the size of the underlying sequence database. Since our 2012 article describing Pfam, we have also undertaken a comprehensive review of the features that are provided by Pfam over and above the basic family data. For each feature, we determined the relevance, computational burden, usage statistics and the functionality of the feature in a website context. As a consequence of this review, we have removed some features, enhanced others and developed new ones to meet the changing demands of computational biology. Here, we describe the changes to Pfam content. Notably, we now provide family alignments based on four different representative proteome sequence data sets and a new interactive DNA search interface. We also discuss the mapping between Pfam and known 3D structures.
Assuntos
Bases de Dados de Proteínas , Alinhamento de Sequência , Análise de Sequência de Proteína , Internet , Proteínas Intrinsicamente Desordenadas/química , Conformação Proteica , Proteínas/química , Proteínas/classificação , Proteínas/genética , Proteoma/química , Análise de Sequência de DNARESUMO
MOTIVATION: Gene set analysis is the analysis of a set of genes that collectively contribute to a biological process. Most popular gene set analysis methods are based on empirical P-value that requires large number of permutations. Despite numerous gene set analysis methods developed in the past decade, the most popular methods still suffer from serious limitations. RESULTS: We present a gene set analysis method (mGSZ) based on Gene Set Z-scoring function (GSZ) and asymptotic P-values. Asymptotic P-value calculation requires fewer permutations, and thus speeds up the gene set analysis process. We compare the GSZ-scoring function with seven popular gene set scoring functions and show that GSZ stands out as the best scoring function. In addition, we show improved performance of the GSA method when the max-mean statistics is replaced by the GSZ scoring function. We demonstrate the importance of both gene and sample permutations by showing the consequences in the absence of one or the other. A comparison of asymptotic and empirical methods of P-value estimation demonstrates a clear advantage of asymptotic P-value over empirical P-value. We show that mGSZ outperforms the state-of-the-art methods based on two different evaluations. We compared mGSZ results with permutation and rotation tests and show that rotation does not improve our asymptotic P-values. We also propose well-known asymptotic distribution models for three of the compared methods. AVAILABILITY AND IMPLEMENTATION: mGSZ is available as R package from cran.r-project.org.
Assuntos
Biologia Computacional/métodos , Perfilação da Expressão Gênica/métodos , Algoritmos , Interpretação Estatística de Dados , Escherichia coli/genética , Feminino , Regulação Leucêmica da Expressão Gênica , Humanos , Leucemia/genética , Masculino , Modelos Estatísticos , Fatores Sexuais , Software , Proteína Supressora de Tumor p53/genéticaRESUMO
Insect flight is one of the most energetically demanding activities in the animal kingdom, yet for many insects flight is necessary for reproduction and foraging. Moreover, dispersal by flight is essential for the viability of species living in fragmented landscapes. Here, working on the Glanville fritillary butterfly (Melitaea cinxia), we use transcriptome sequencing to investigate gene expression changes caused by 15 min of flight in two contrasting populations and the two sexes. Male butterflies and individuals from a large metapopulation had significantly higher peak flight metabolic rate (FMR) than female butterflies and those from a small inbred population. In the pooled data, FMR was significantly positively correlated with genome-wide heterozygosity, a surrogate of individual inbreeding. The flight experiment changed the expression level of 1513 genes, including genes related to major energy metabolism pathways, ribosome biogenesis and RNA processing, and stress and immune responses. Males and butterflies from the population with high FMR had higher basal expression of genes related to energy metabolism, whereas females and butterflies from the small population with low FMR had higher expression of genes related to ribosome/RNA processing and immune response. Following the flight treatment, genes related to energy metabolism were generally down-regulated, while genes related to ribosome/RNA processing and immune response were up-regulated. These results suggest that common molecular mechanisms respond to flight and can influence differences in flight metabolic capacity between populations and sexes.