RESUMO
Heracleum sosnowskyi, belonging to a group of giant hogweeds, is a plant with large effects on ecosystems and human health. It is an invasive species that contributes to the deterioration of grassland ecosystems. The ability of H. sosnowskyi to produce linear furanocoumarins (FCs), photosensitizing compounds, makes it very dangerous. At the same time, linear FCs are compounds with high pharmaceutical value used in skin disease therapies. Despite this high importance, it has not been the focus of genetic and genomic studies. Here, we report a chromosome-scale assembly of Sosnowsky's hogweed genome. Genomic analysis revealed an unusually high number of genes (55106) in the hogweed genome, in contrast to the 25-35 thousand found in most plants. However, we did not find any traces of recent whole-genome duplications not shared with its confamiliar, Daucus carota (carrot), which has approximately thirty thousand genes. The analysis of the genomic proximity of duplicated genes indicates on tandem duplications as a main reason for this increase. We performed a genome-wide search of the genes of the FC biosynthesis pathway and surveyed their expression in aboveground plant parts. Using a combination of expression data and phylogenetic analysis, we found candidate genes for psoralen synthase and experimentally showed the activity of one of them using a heterologous yeast expression system. These findings expand our knowledge on the evolution of gene space in plants and lay a foundation for further analysis of hogweed as an invasive plant and as a source of FCs.
Assuntos
Daucus carota , Heracleum , Humanos , Heracleum/genética , Espécies Introduzidas , Ecossistema , Filogenia , Duplicação GênicaRESUMO
BACKGROUND: Capsella bursa-pastoris, a cosmopolitan weed of hybrid origin, is an emerging model object for the study of early consequences of polyploidy, being a fast growing annual and a close relative of Arabidopsis thaliana. The development of this model is hampered by the absence of a reference genome sequence. RESULTS: We present here a subgenome-resolved chromosome-scale assembly and a genetic map of the genome of Capsella bursa-pastoris. It shows that the subgenomes are mostly colinear, with no massive deletions, insertions, or rearrangements in any of them. A subgenome-aware annotation reveals the lack of genome dominance-both subgenomes carry similar number of genes. While most chromosomes can be unambiguously recognized as derived from either paternal or maternal parent, we also found homeologous exchange between two chromosomes. It led to an emergence of two hybrid chromosomes; this event is shared between distant populations of C. bursa-pastoris. The whole-genome analysis of 119 samples belonging to C. bursa-pastoris and its parental species C. grandiflora/rubella and C. orientalis reveals introgression from C. orientalis but not from C. grandiflora/rubella. CONCLUSIONS: C. bursa-pastoris does not show genome dominance. In the earliest stages of evolution of this species, a homeologous exchange occurred; its presence in all present-day populations of C. bursa-pastoris indicates on a single origin of this species. The evidence coming from whole-genome analysis challenges the current view that C. grandiflora/rubella was a direct progenitor of C. bursa-pastoris; we hypothesize that it was an extinct (or undiscovered) species sister to C. grandiflora/rubella.
Assuntos
Arabidopsis , Capsella , Rubéola (Sarampo Alemão) , Capsella/genética , Genômica , PoliploidiaRESUMO
Interspecific gene comparisons are the keystones for many areas of biological research and are especially important for the translation of knowledge from model organisms to economically important species. Currently they are hampered by the low resolution of methods based on sequence analysis and by the complex evolutionary history of eukaryotic genes. This is especially critical for plants, whose genomes are shaped by multiple whole genome duplications and subsequent gene loss. This requires the development of new methods for comparing the functions of genes in different species. Here, we report ISEEML (Interspecific Similarity of Expression Evaluated using Machine Learning)-a novel machine learning-based algorithm for interspecific gene classification. In contrast to previous studies focused on sequence similarity, our algorithm focuses on functional similarity inferred from the comparison of gene expression profiles. We propose novel metrics for expression pattern similarity-expression score (ES)-that is suitable for species with differing morphologies. As a proof of concept, we compare detailed transcriptome maps of Arabidopsis thaliana, the model species, Zea mays (maize) and Fagopyrum esculentum (common buckwheat), which are species that represent distant clades within flowering plants. The classifier resulted in an AUC of 0.91; under the ES threshold of 0.5, the specificity was 94%, and sensitivity was 72%.
Assuntos
Arabidopsis , Transcriptoma , Transcriptoma/genética , Arabidopsis/genética , Evolução Biológica , Regulação da Expressão Gênica de Plantas/genética , Zea mays/genéticaRESUMO
BACKGROUND: Transcriptome map is a powerful tool for a variety of biological studies; transcriptome maps that include different organs, tissues, cells and stages of development are currently available for at least 30 plants. Some of them include samples treated by environmental or biotic stresses. However, most studies explore only limited set of organs and developmental stages (leaves or seedlings). In order to provide broader view of organ-specific strategies of cold stress response we studied expression changes that follow exposure to cold (+ 4 °C) in different aerial parts of plant: cotyledons, hypocotyl, leaves, young flowers, mature flowers and seeds using RNA-seq. RESULTS: The results on differential expression in leaves are congruent with current knowledge on stress response pathways, in particular, the role of CBF genes. In other organs, both essence and dynamics of gene expression changes are different. We show the involvement of genes that are confined to narrow expression patterns in non-stress conditions into stress response. In particular, the genes that control cell wall modification in pollen, are activated in leaves. In seeds, predominant pattern is the change of lipid metabolism. CONCLUSIONS: Stress response is highly organ-specific; different pathways are involved in this process in each type of organs. The results were integrated with previously published transcriptome map of Arabidopsis thaliana and used for an update of a public database TraVa: http://travadb.org/browse/Species=AthStress .
Assuntos
Proteínas de Arabidopsis/metabolismo , Arabidopsis/metabolismo , Arabidopsis/genética , Proteínas de Arabidopsis/genética , Resposta ao Choque Frio/genética , Resposta ao Choque Frio/fisiologia , Perfilação da Expressão Gênica , Regulação da Expressão Gênica de Plantas , Transcriptoma/genéticaRESUMO
Polyploidization and subsequent sub- and neofunctionalization of duplicated genes represent a major mechanism of plant genome evolution. Capsella bursa-pastoris, a widespread ruderal plant, is a recent allotetraploid and, thus, is an ideal model organism for studying early changes following polyploidization. We constructed a high-quality assembly of C. bursa-pastoris genome and a transcriptome atlas covering a broad sample of organs and developmental stages (available online at http://travadb.org/browse/Species=Cbp). We demonstrate that expression of homeologs is mostly symmetric between subgenomes, and identify a set of homeolog pairs with discordant expression. Comparison of promoters within such pairs revealed emerging asymmetry of regulatory elements. Among them there are multiple binding sites for transcription factors controlling the regulation of photosynthesis and plant development by light (PIF3, HY5) and cold stress response (CBF). These results suggest that polyploidization in C. bursa-pastoris enhanced its plasticity of response to light and temperature, and allowed substantial expansion of its distribution range.
Assuntos
Capsella/genética , Regulação da Expressão Gênica de Plantas , Genoma de Planta , Poliploidia , Sequências Reguladoras de Ácido Nucleico , Anotação de Sequência MolecularRESUMO
Arabidopsis thaliana is a long established model species for plant molecular biology, genetics and genomics, and studies of A. thaliana gene function provide the basis for formulating hypotheses and designing experiments involving other plants, including economically important species. A comprehensive understanding of the A. thaliana genome and a detailed and accurate understanding of the expression of its associated genes is therefore of great importance for both fundamental research and practical applications. Such goal is reliant on the development of new genetic and genomic resources, involving new methods of data acquisition and analysis. We present here the genome-wide analysis of A. thaliana gene expression profiles across different organs and developmental stages using high-throughput transcriptome sequencing. The expression of 25 706 protein-coding genes, as well as their stability and their spatiotemporal specificity, was assessed in 79 organs and developmental stages. A search for alternative splicing events identified 37 873 previously unreported splice junctions, approximately 30% of them occurred in intergenic regions. These potentially represent novel spliced genes that are not included in the TAIR10 database. These data are housed in an open-access web-based database, TraVA (Transcriptome Variation Analysis, http://travadb.org/), which allows visualization and analysis of gene expression profiles and differential gene expression between organs and developmental stages.
Assuntos
Proteínas de Arabidopsis/metabolismo , Arabidopsis/metabolismo , Transcriptoma/genética , Processamento Alternativo/genética , Arabidopsis/genética , Proteínas de Arabidopsis/genética , Biologia Computacional , Perfilação da Expressão Gênica , Regulação da Expressão Gênica de Plantas/genética , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismoRESUMO
Signaling lymphocytic activation molecule family member 1 (SLAMF1)/CD150 is a co-stimulatory receptor expressed on a variety of hematopoietic cells, in particular on mature lymphocytes activated by specific antigen, costimulation and cytokines. Changes in CD150 expression level have been reported in association with autoimmunity and with B-cell chronic lymphocytic leukemia. We characterized the core promoter for SLAMF1 gene in human B-cell lines and explored binding sites for a number of transcription factors involved in B cell differentiation and activation. Mutations of SP1, STAT6, IRF4, NF-kB, ELF1, TCF3, and SPI1/PU.1 sites resulted in significantly decreased promoter activity of varying magnitude, depending on the cell line tested. The most profound effect on the promoter strength was observed upon mutation of the binding site for Early B-cell factor 1 (EBF1). This mutation produced a 10-20 fold drop in promoter activity and pinpointed EBF1 as the master regulator of human SLAMF1 gene in B cells. We also identified three potent transcriptional enhancers in human SLAMF1 locus, each containing functional EBF1 binding sites. Thus, EBF1 interacts with specific binding sites located both in the promoter and in the enhancer regions of the SLAMF1 gene and is critical for its expression in human B cells.
Assuntos
Regulação da Expressão Gênica , Membro 1 da Família de Moléculas de Sinalização da Ativação Linfocitária/genética , Transativadores/genética , Transcrição Gênica , Linfócitos B/citologia , Linfócitos B/metabolismo , Fatores de Transcrição Hélice-Alça-Hélice Básicos/genética , Fatores de Transcrição Hélice-Alça-Hélice Básicos/metabolismo , Sítios de Ligação , Linhagem Celular Tumoral , Elementos Facilitadores Genéticos , Genes Reporter , Células HEK293 , Humanos , Fatores Reguladores de Interferon/genética , Fatores Reguladores de Interferon/metabolismo , Luciferases/genética , Mutação , NF-kappa B/genética , NF-kappa B/metabolismo , Proteínas Nucleares/genética , Proteínas Nucleares/metabolismo , Cultura Primária de Células , Regiões Promotoras Genéticas , Ligação Proteica , Proteínas Proto-Oncogênicas/genética , Proteínas Proto-Oncogênicas/metabolismo , Fator de Transcrição STAT6/genética , Fator de Transcrição STAT6/metabolismo , Transdução de Sinais , Membro 1 da Família de Moléculas de Sinalização da Ativação Linfocitária/metabolismo , Fator de Transcrição Sp1/genética , Fator de Transcrição Sp1/metabolismo , Transativadores/metabolismo , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismoRESUMO
Endemic species flocks inhabiting ancient lakes, oceanic islands and other long-lived isolated habitats are often interpreted as adaptive radiations. Yet molecular evidence for directional selection during species flocks radiation is scarce. Using partial transcriptomes of 64 species of Lake Baikal (Siberia, Russia) endemic amphipods and two nonendemic outgroups, we report a revised phylogeny of this species flock and analyse evidence for positive selection within the endemic lineages. We confirm two independent invasions of amphipods into Baikal and demonstrate that several morphological features of Baikal amphipods, such as body armour and reduction in appendages and sensory organs, evolved in several lineages in parallel. Radiation of Baikal amphipods has been characterized by short phylogenetic branches and frequent episodes of positive selection which tended to be more frequent in the early phase of the second invasion of amphipods into Baikal when the most intensive diversification occurred. Notably, signatures of positive selection are frequent in genes encoding mitochondrial membrane proteins with electron transfer chain and ATP synthesis functionality. In particular, subunits of both the membrane and substrate-level ATP synthases show evidence of positive selection in the plankton species Macrohectopus branickii, possibly indicating adaptation to active plankton lifestyle and to survival under conditions of low temperature and high hydrostatic pressures known to affect membranes functioning. Other functional categories represented among genes likely to be under positive selection include Ca-binding muscle-related proteins, possibly indicating adaptation to Ca-deficient low mineralization Baikal waters.
Assuntos
Anfípodes/classificação , Especiação Genética , Filogenia , Seleção Genética , Transcriptoma , Adaptação Biológica/genética , Animais , Lagos , SibériaRESUMO
Populations of different species vary in the amounts of genetic diversity they possess. Nucleotide diversity π, the fraction of nucleotides that are different between two randomly chosen genotypes, has been known to range in eukaryotes between 0.0001 in Lynx lynx and 0.16 in Caenorhabditis brenneri. Here, we report the results of a comparative analysis of 24 haploid genotypes (12 from the United States and 12 from European Russia) of a split-gill fungus Schizophyllum commune. The diversity at synonymous sites is 0.20 in the American population of S. commune and 0.13 in the Russian population. This exceptionally high level of nucleotide diversity also leads to extreme amino acid diversity of protein-coding genes. Using whole-genome resequencing of 2 parental and 17 offspring haploid genotypes, we estimate that the mutation rate in S. commune is high, at 2.0 × 10(-8) (95% CI: 1.1 × 10(-8) to 4.1 × 10(-8)) per nucleotide per generation. Therefore, the high diversity of S. commune is primarily determined by its elevated mutation rate, although high effective population size likely also plays a role. Small genome size, ease of cultivation and completion of the life cycle in the laboratory, free-living haploid life stages and exceptionally high variability of S. commune make it a promising model organism for population, quantitative, and evolutionary genetics.
Assuntos
Agaricales/genética , Variação Genética , Madeira/microbiologia , Nucleotídeos/genética , Polimorfismo GenéticoRESUMO
BACKGROUND: Floral transition is a critical event in the life cycle of a flowering plant as it determines its reproductive success. Despite extensive studies of specific genes that regulate this process, the global changes in transcript expression profiles at the point when a vegetative meristem transitions into an inflorescence have not been reported. We analyzed gene expression during Arabidopsis thaliana meristem development under long day conditions from day 7 to 16 after germination in one-day increments. RESULTS: The dynamics of the expression of the main flowering regulators was consistent with previous reports: notably, the expression of FLOWERING LOCUS C (FLC) decreased over the course of the time series while expression of LEAFY (LFY) increased. This analysis revealed a developmental time point between 10 and 12 days after germination where FLC expression had decreased but LFY expression had not yet increased, which was characterized by a peak in the number of differentially expressed genes. Gene Ontology (GO) enrichment analysis of these genes identified an overrepresentation of genes related to the cell cycle. CONCLUSIONS: We discovered an unprecedented burst of differential expression of cell cycle related genes at one particular point during transition to flowering. We suggest that acceleration of rate of the divisions and partial cell cycling synchronization takes place at this point.
Assuntos
Arabidopsis/genética , Flores/genética , Regulação da Expressão Gênica de Plantas/genética , Meristema/genética , RNA de Plantas/genética , Proteínas de Arabidopsis/genética , Ciclo Celular/genética , Regulação da Expressão Gênica no Desenvolvimento/genética , Genes de Plantas/genética , Germinação/genética , Inflorescência/genética , Folhas de Planta/genética , Análise de Sequência de RNARESUMO
Recombination between double-stranded DNA molecules is a key genetic process which occurs in a wide variety of organisms. Usually, crossing-over (CO) occurs during meiosis between genotypes with 98.0-99.9% sequence identity, because within-population nucleotide diversity only rarely exceeds 2%. However, some species are hypervariable and it is unclear how CO can occur between genotypes with less than 90% sequence identity. Here, we study CO in Schizophyllum commune, a hypervariable cosmopolitan basidiomycete mushroom, a frequently encountered decayer of woody substrates. We crossed two haploid individuals, from the United States and from Russia, and obtained genome sequences for their 17 offspring. The average genetic distance between the parents was 14%, making it possible to study CO at very high resolution. We found reduced levels of linkage disequilibrium between loci flanking the CO sites indicating that they are mostly confined to hotspots of recombination. Furthermore, CO events preferentially occurred in regions under stronger negative selection, in particular within exons that showed reduced levels of nucleotide diversity. Apparently, in hypervariable species CO must avoid regions of higher divergence between the recombining genomes due to limitations imposed by the mismatch repair system, with regions under strong negative selection providing the opportunity for recombination. These patterns are opposite to those observed in a number of less variable species indicating that population genomics of hypervariable species may reveal novel biological phenomena.
Assuntos
Troca Genética , DNA/genética , Variação Genética , Schizophyllum/genética , Composição de Bases , Pareamento de Bases , Cruzamentos Genéticos , DNA/química , Loci Gênicos , Haploidia , Desequilíbrio de Ligação , Seleção GenéticaRESUMO
BACKGROUND: As genomes of many eukaryotic species, especially plants, are large and complex, their de novo sequencing and assembly is still a difficult task despite progress in sequencing technologies. An alternative to genome assembly is the assembly of transcriptome, the set of RNA products of the expressed genes. While a bunch of de novo transcriptome assemblers exists, the challenges of transcriptomes (the existence of isoforms, the uneven expression levels across genes) complicates the generation of high-quality assemblies suitable for downstream analyses. RESULTS: We developed Trans2express - a web-based tool and a pipeline of de novo hybrid transcriptome assembly and postprocessing based on rnaSPAdes with a set of subsequent filtrations. The pipeline was tested on Arabidopsis thaliana cDNA sequencing data obtained using Illumina and Oxford Nanopore Technologies platforms and three non-model plant species. The comparison of structural characteristics of the transcriptome assembly with reference Arabidopsis genome revealed the high quality of assembled transcriptome with 86.1% of Arabidopsis expressed genes assembled as a single contig. We tested the applicability of the transcriptome assembly for gene expression analysis. For both Arabidopsis and non-model species the results showed high congruence of gene expression levels and sets of differentially expressed genes between analyses based on genome and based on the transcriptome assembly. CONCLUSIONS: We present Trans2express - a protocol for de novo hybrid transcriptome assembly aimed at recovering of a single transcript per gene. We expect this protocol to promote the characterization of transcriptomes and gene expression analysis in non-model plants and web-based tool to be of use to a wide range of plant biologists.
RESUMO
The vast diversity of Orchidaceae together with sophisticated adaptations to pollinators and other unique features make this family an attractive model for evolutionary and functional studies. The sequenced genome of Phalaenopsis equestris facilitates Orchidaceae research. Here, we present an RNA-seq-based transcriptome map of P. equestris that covers 19 organs of the plant, including leaves, roots, floral organs and the shoot apical meristem. We demonstrated the high quality of the data and showed the similarity of the P. equestris transcriptome map with the gene expression atlases of other plants. The transcriptome map can be easily accessed through our database Transcriptome Variation Analysis (TraVA) for visualizing gene expression profiles. As an example of the application, we analyzed the expression of Phalaenopsis "orphan" genes-those that do not have recognizable similarity with the genes of other plants. We found that approximately half of these genes were not expressed; the ones that were expressed were predominantly expressed in reproductive structures.
RESUMO
Common buckwheat (Fagopyrum esculentum) is an important non-cereal grain crop and a prospective component of functional food. Despite this, the genomic resources for this species and for the whole family Polygonaceae, to which it belongs, are scarce. Here, we report the assembly of the buckwheat genome using long-read technology and a high-resolution expression atlas including 46 organs and developmental stages. We found that the buckwheat genome has an extremely high content of transposable elements, including several classes of recently (0.5-1 Mya) multiplied TEs ("transposon burst") and gradually accumulated TEs. The difference in TE content is a major factor contributing to the three-fold increase in the genome size of F. esculentum compared with its sister species F. tataricum. Moreover, we detected the differences in TE content between the wild ancestral subspecies F. esculentum ssp. ancestrale and buckwheat cultivars, suggesting that TE activity accompanied buckwheat domestication. Expression profiling allowed us to test a hypothesis about the genetic control of petaloidy of tepals in buckwheat. We showed that it is not mediated by B-class gene activity, in contrast to the prediction from the ABC model. Based on a survey of expression profiles and phylogenetic analysis, we identified the MYB family transcription factor gene tr_18111 as a potential candidate for the determination of conical cells in buckwheat petaloid tepals. The information on expression patterns has been integrated into the publicly available database TraVA: http://travadb.org/browse/Species=Fesc/. The improved genome assembly and transcriptomic resources will enable research on buckwheat, including practical applications.
RESUMO
Naturally occurring mutants whose phenotype recapitulates the changes that distinguish closely related species are of special interest from the evolutionary point of view. They can give a key about the genetic control of the changes that led to speciation. In this study, we described lepidium-like (lel), a naturally occurring variety of an allotetraploid species Capsella bursa-pastoris that is characterized by the typical loss of all four petals. In some cases, one or two basal flowers in the raceme had one or two small petals. The number and structure of other floral organs are not affected. Our study of flower development in the mutant showed that once initiated, petals either cease further development and cannot be traced in anthetic flowers or sometimes develop to various degrees. lel plants showed an earlier beginning of floral organ initiation and delayed petal initiation compared to the wild-type plants. lel phenotype has a wide geographical distribution, being found at the northern extremity of the species range as well as in the central part. The genetic analysis of inheritance demonstrated that lel phenotype is controlled by two independent loci. While the flower in the family Cruciferae generally has a very stable structure (i.e., four sepals, four petals, six stamens, and two carpels), several deviations from this ground plan are known, in particular in the genus Lepidium, C. bursa-pastoris is an emerging model for the study of polyploidy (which is also very widespread in Cruciferae); the identification and characterization of the apetalous mutant lays a foundation for further research of morphological evolution in polyploids.
RESUMO
For many years, progress in the identification of gene functions has been based on classical genetic approaches. However, considerable recent omics developments have brought to the fore indirect but high-resolution methods of gene function identification such as transcriptomics, proteomics, and metabolomics. A transcriptome map is a powerful source of functional information and the result of the genome-wide expression analysis of a broad sampling of tissues and/or organs from different developmental stages and/or environmental conditions. In plant science, the application of transcriptome maps extends from the inference of gene regulatory networks to evolutionary studies. However, only some of these data have been integrated into databases, thus enabling analyses to be conducted without raw data; without this integration, extensive data preprocessing is required, which limits data usability. In this review, we summarize the state of plant transcriptome maps, analyze the problems associated with the combined analysis of large-scale data from various studies, and outline possible solutions to these problems.
RESUMO
The knowledge of gene functions in model organisms is the starting point for the analysis of gene function in non-model species, including economically important ones. Usually, the assignment of gene functions is based on sequence similarity. In plants, due to a highly intricate gene landscape, this approach has some limitations. It is often impossible to directly match gene sets from one plant species to another species based only on their sequences. Thus, it is necessary to use additional information to identify functionally similar genes. Expression patterns have great potential to serve as a source of such information. An important prerequisite for the comparative analysis of transcriptomes is the existence of high-resolution expression maps consisting of comparable samples. Here, we present a transcriptome atlas of tomato (Solanum lycopersicum) consisting of 30 samples of different organs and developmental stages. The samples were selected in a way that allowed for side-by-side comparison with the Arabidopsis thaliana transcriptome map. Newly obtained data are integrated in the TraVA database and are available online, together with tools for their analysis. In this paper, we demonstrate the potential of comparing transcriptome maps for inferring shifts in the expression of paralogous genes.
Assuntos
Arabidopsis/genética , Regulação da Expressão Gênica no Desenvolvimento , Solanum lycopersicum/genética , Transcriptoma , Arabidopsis/crescimento & desenvolvimento , Regulação da Expressão Gênica de Plantas , Solanum lycopersicum/crescimento & desenvolvimento , Homologia de SequênciaRESUMO
Recently developed high-throughput analytical techniques (e.g., protein mass spectrometry and nucleic acid sequencing) allow unprecedentedly sensitive, in-depth studies in molecular biology of cell proliferation, differentiation, aging, and death. However, the initial population of asynchronous cultured cells is highly heterogeneous by cell cycle stage, which complicates immediate analysis of some biological processes. Widely used cell synchronization protocols are time-consuming and can affect the finely tuned biochemical pathways leading to biased results. Besides, certain cell lines cannot be effectively synchronized. The current methodological challenge is thus to provide an effective tool for cell cycle phase-based population enrichment compatible with other required experimental procedures. Here, we describe an optimized approach to live cell FACS based on Hoechst 33342 cell-permeable DNA-binding fluorochrome staining. The proposed protocol is fast compared to traditional synchronization methods and yields reasonably pure fractions of viable cells for further experimental studies including high-throughput RNA-seq analysis.
Assuntos
Variação Biológica da População , Ciclo Celular/genética , Citometria de Fluxo , Análise de Sequência de RNA , Análise de Célula Única , Biologia Computacional , Replicação do DNA , Citometria de Fluxo/métodos , Humanos , Células K562 , Microscopia , Análise de Célula Única/métodos , Coloração e RotulagemRESUMO
BACKGROUND: RNA-seq is a useful tool for analysis of gene expression. However, its robustness is greatly affected by a number of artifacts. One of them is the presence of duplicated reads. RESULTS: To infer the influence of different methods of removal of duplicated reads on estimation of gene expression in cancer genomics, we analyzed paired samples of hepatocellular carcinoma (HCC) and non-tumor liver tissue. Four protocols of data analysis were applied to each sample: processing without deduplication, deduplication using a method implemented in SAMtools, and deduplication based on one or two molecular indices (MI). We also analyzed the influence of sequencing layout (single read or paired end) and read length. We found that deduplication without MI greatly affects estimated expression values; this effect is the most pronounced for highly expressed genes. CONCLUSION: The use of unique molecular identifiers greatly improves accuracy of RNA-seq analysis, especially for highly expressed genes. We developed a set of scripts that enable handling of MI and their incorporation into RNA-seq analysis pipelines. Deduplication without MI affects results of differential gene expression analysis, producing a high proportion of false negative results. The absence of duplicate read removal is biased towards false positives. In those cases where using MI is not possible, we recommend using paired-end sequencing layout.