RESUMEN
The sensor RIG-I detects double-stranded RNA derived from RNA viruses. Although RIG-I is also known to have a role in the antiviral response to DNA viruses, physiological RNA species recognized by RIG-I during infection with a DNA virus are largely unknown. Using next-generation RNA sequencing (RNAseq), we found that host-derived RNAs, most prominently 5S ribosomal RNA pseudogene 141 (RNA5SP141), bound to RIG-I during infection with herpes simplex virus 1 (HSV-1). Infection with HSV-1 induced relocalization of RNA5SP141 from the nucleus to the cytoplasm, and virus-induced shutoff of host protein synthesis downregulated the abundance of RNA5SP141-interacting proteins, which allowed RNA5SP141 to bind RIG-I and induce the expression of type I interferons. Silencing of RNA5SP141 strongly dampened the antiviral response to HSV-1 and the related virus Epstein-Barr virus (EBV), as well as influenza A virus (IAV). Our findings reveal that antiviral immunity can be triggered by host RNAs that are unshielded following depletion of their respective binding proteins by the virus.
Asunto(s)
Proteína 58 DEAD Box/inmunología , Herpesvirus Humano 1/inmunología , Inmunidad/inmunología , ARN Ribosómico 5S/inmunología , Animales , Células Cultivadas , Chlorocebus aethiops , Proteína 58 DEAD Box/metabolismo , Expresión Génica/inmunología , Células HEK293 , Herpesvirus Humano 1/fisiología , Interacciones Huésped-Patógeno/inmunología , Humanos , Interferón Tipo I/genética , Interferón Tipo I/inmunología , Interferón Tipo I/metabolismo , Ratones Noqueados , Seudogenes/genética , Transporte de ARN/inmunología , ARN Ribosómico 5S/genética , ARN Ribosómico 5S/metabolismo , Receptores Inmunológicos , Células VeroRESUMEN
Research over the past decade has suggested important roles for pseudogenes in physiology and disease. In vitro experiments demonstrated that pseudogenes contribute to cell transformation through several mechanisms. However, in vivo evidence for a causal role of pseudogenes in cancer development is lacking. Here, we report that mice engineered to overexpress either the full-length murine B-Raf pseudogene Braf-rs1 or its pseudo "CDS" or "3' UTR" develop an aggressive malignancy resembling human diffuse large B cell lymphoma. We show that Braf-rs1 and its human ortholog, BRAFP1, elicit their oncogenic activity, at least in part, as competitive endogenous RNAs (ceRNAs) that elevate BRAF expression and MAPK activation in vitro and in vivo. Notably, we find that transcriptional or genomic aberrations of BRAFP1 occur frequently in multiple human cancers, including B cell lymphomas. Our engineered mouse models demonstrate the oncogenic potential of pseudogenes and indicate that ceRNA-mediated microRNA sequestration may contribute to the development of cancer.
Asunto(s)
Linfoma de Células B Grandes Difuso/genética , Proteínas Proto-Oncogénicas B-raf/genética , Seudogenes , ARN/metabolismo , Animales , Secuencia de Bases , Humanos , Linfoma de Células B Grandes Difuso/metabolismo , Ratones , Datos de Secuencia Molecular , Proteínas Proto-Oncogénicas B-raf/metabolismoRESUMEN
Mutualisms that become evolutionarily stable give rise to organismal interdependencies. Some insects have developed intracellular associations with communities of bacteria, where the interdependencies are manifest in patterns of complementary gene loss and retention among members of the symbiosis. Here, using comparative genomics and microscopy, we show that a three-member symbiotic community has become a four-way assemblage through a novel bacterial lineage-splitting event. In some but not all cicada species of the genus Tettigades, the endosymbiont Candidatus Hodgkinia cicadicola has split into two new cytologically distinct but metabolically interdependent species. Although these new bacterial genomes are partitioned into discrete cell types, the intergenome patterns of gene loss and retention are almost perfectly complementary. These results defy easy classification: they show genomic patterns consistent with those observed after both speciation and whole-genome duplication. We suggest that our results highlight the potential power of nonadaptive forces in shaping organismal complexity.
Asunto(s)
Alphaproteobacteria/clasificación , Alphaproteobacteria/genética , Genoma Bacteriano , Hemípteros/microbiología , Alphaproteobacteria/aislamiento & purificación , Alphaproteobacteria/fisiología , Animales , Evolución Molecular , Hemípteros/citología , Hemípteros/fisiología , Datos de Secuencia Molecular , Seudogenes , SimbiosisRESUMEN
The prevalence of highly repetitive sequences within the human Y chromosome has prevented its complete assembly to date1 and led to its systematic omission from genomic analyses. Here we present de novo assemblies of 43 Y chromosomes spanning 182,900 years of human evolution and report considerable diversity in size and structure. Half of the male-specific euchromatic region is subject to large inversions with a greater than twofold higher recurrence rate compared with all other chromosomes2. Ampliconic sequences associated with these inversions show differing mutation rates that are sequence context dependent, and some ampliconic genes exhibit evidence for concerted evolution with the acquisition and purging of lineage-specific pseudogenes. The largest heterochromatic region in the human genome, Yq12, is composed of alternating repeat arrays that show extensive variation in the number, size and distribution, but retain a 1:1 copy-number ratio. Finally, our data suggest that the boundary between the recombining pseudoautosomal region 1 and the non-recombining portions of the X and Y chromosomes lies 500 kb away from the currently established1 boundary. The availability of fully sequence-resolved Y chromosomes from multiple individuals provides a unique opportunity for identifying new associations of traits with specific Y-chromosomal variants and garnering insights into the evolution and function of complex regions of the human genome.
Asunto(s)
Cromosomas Humanos Y , Evolución Molecular , Humanos , Masculino , Cromosomas Humanos Y/genética , Genoma Humano/genética , Genómica , Tasa de Mutación , Fenotipo , Eucromatina/genética , Seudogenes , Variación Genética/genética , Cromosomas Humanos X/genética , Regiones Pseudoautosómicas/genéticaRESUMEN
Scientists have been trying to identify every gene in the human genome since the initial draft was published in 2001. In the years since, much progress has been made in identifying protein-coding genes, currently estimated to number fewer than 20,000, with an ever-expanding number of distinct protein-coding isoforms. Here we review the status of the human gene catalogue and the efforts to complete it in recent years. Beside the ongoing annotation of protein-coding genes, their isoforms and pseudogenes, the invention of high-throughput RNA sequencing and other technological breakthroughs have led to a rapid growth in the number of reported non-coding RNA genes. For most of these non-coding RNAs, the functional relevance is currently unclear; we look at recent advances that offer paths forward to identifying their functions and towards eventually completing the human gene catalogue. Finally, we examine the need for a universal annotation standard that includes all medically significant genes and maintains their relationships with different reference genomes for the use of the human gene catalogue in clinical settings.
Asunto(s)
Genes , Genoma Humano , Anotación de Secuencia Molecular , Isoformas de Proteínas , Humanos , Genoma Humano/genética , Anotación de Secuencia Molecular/normas , Anotación de Secuencia Molecular/tendencias , Isoformas de Proteínas/genética , Proyecto Genoma Humano , Seudogenes , ARN/genéticaRESUMEN
Pseudogene transcripts can provide a novel tier of gene regulation through generation of endogenous siRNAs or miRNA-binding sites. Characterization of pseudogene expression, however, has remained confined to anecdotal observations due to analytical challenges posed by the extremely close sequence similarity with their counterpart coding genes. Here, we describe a systematic analysis of pseudogene "transcription" from an RNA-Seq resource of 293 samples, representing 13 cancer and normal tissue types, and observe a surprisingly prevalent, genome-wide expression of pseudogenes that could be categorized as ubiquitously expressed or lineage and/or cancer specific. Further, we explore disease subtype specificity and functions of selected expressed pseudogenes. Taken together, we provide evidence that transcribed pseudogenes are a significant contributor to the transcriptional landscape of cells and are positioned to play significant roles in cellular differentiation and cancer progression, especially in light of the recently described ceRNA networks. Our work provides a transcriptome resource that enables high-throughput analyses of pseudogene expression.
Asunto(s)
Estudio de Asociación del Genoma Completo , Neoplasias/genética , Seudogenes/genética , Transcriptoma , Secuencia de Aminoácidos , Secuencia de Bases , Neoplasias de la Mama/genética , Femenino , Humanos , Masculino , Datos de Secuencia Molecular , Neoplasias de la Próstata/genética , Análisis de Secuencia de ARNRESUMEN
Protein evolution is guided by structural, functional, and dynamical constraints ensuring organismal viability. Pseudogenes are genomic sequences identified in many eukaryotes that lack translational activity due to sequence degradation and thus over time have undergone "devolution." Previously pseudogenized genes sometimes regain their protein-coding function, suggesting they may still encode robust folding energy landscapes despite multiple mutations. We study both the physical folding landscapes of protein sequences corresponding to human pseudogenes using the Associative Memory, Water Mediated, Structure and Energy Model, and the evolutionary energy landscapes obtained using direct coupling analysis (DCA) on their parent protein families. We found that generally mutations that have occurred in pseudogene sequences have disrupted their native global network of stabilizing residue interactions, making it harder for them to fold if they were translated. In some cases, however, energetic frustration has apparently decreased when the functional constraints were removed. We analyzed this unexpected situation for Cyclophilin A, Profilin-1, and Small Ubiquitin-like Modifier 2 Protein. Our analysis reveals that when such mutations in the pseudogene ultimately stabilize folding, at the same time, they likely alter the pseudogenes' former biological activity, as estimated by DCA. We localize most of these stabilizing mutations generally to normally frustrated regions required for binding to other partners.
Asunto(s)
Evolución Molecular , Proteínas , Seudogenes , Ciclofilina A/genética , Familia de Multigenes , Pliegue de Proteína , Proteínas/química , Proteínas/genética , Proteínas/metabolismo , Proteínas Modificadoras Pequeñas Relacionadas con Ubiquitina , Humanos , Modelos GenéticosRESUMEN
Pseudogenes are defined as regions of the genome that contain defective copies of genes. They exist across almost all forms of life, and in mammalian genomes are annotated in similar numbers to recognized protein-coding genes. Although often presumed to lack function, growing numbers of pseudogenes are being found to play important biological roles. In consideration of their evolutionary origins and inherent limitations in genome annotation practices, we posit that pseudogenes have been classified on a scientifically unsubstantiated basis. We reflect that a broad misunderstanding of pseudogenes, perpetuated in part by the pejorative inference of the 'pseudogene' label, has led to their frequent dismissal from functional assessment and exclusion from genomic analyses. With the advent of technologies that simplify the study of pseudogenes, we propose that an objective reassessment of these genomic elements will reveal valuable insights into genome function and evolution.
Asunto(s)
Seudogenes , Animales , Evolución Molecular , Genómica , HumanosRESUMEN
Here, we present a unifying hypothesis about how messenger RNAs, transcribed pseudogenes, and long noncoding RNAs "talk" to each other using microRNA response elements (MREs) as letters of a new language. We propose that this "competing endogenous RNA" (ceRNA) activity forms a large-scale regulatory network across the transcriptome, greatly expanding the functional genetic information in the human genome and playing important roles in pathological conditions, such as cancer.
Asunto(s)
Perfilación de la Expresión Génica , ARN/genética , ARN/metabolismo , Animales , Regulación de la Expresión Génica , Humanos , MicroARNs/genética , Neoplasias/genética , Neoplasias/metabolismo , Seudogenes , ARN Mensajero/genética , ARN no Traducido/genéticaRESUMEN
Evidence for gene non-functionalization due to mutational processes is found in genomes in the form of pseudogenes. Pseudogenes are known to be rare in prokaryote chromosomes, with the exception of lineages that underwent an extreme genome reduction (e.g. obligatory symbionts). Much less is known about the frequency of pseudogenes in prokaryotic plasmids; those are genetic elements that can transfer between cells and may encode beneficial traits for their host. Non-functionalization of plasmid-encoded genes may alter the plasmid characteristics, e.g. mobility, or their effect on the host. Analyzing 10 832 prokaryotic genomes, we find that plasmid genomes are characterized by threefold-higher pseudogene density compared to chromosomes. The majority of plasmid pseudogenes correspond to deteriorated transposable elements. A detailed analysis of enterobacterial plasmids furthermore reveals frequent gene non-functionalization events associated with the loss of plasmid self-transmissibility. Reconstructing the evolution of closely related plasmids reveals that non-functionalization of the conjugation machinery led to the emergence of non-mobilizable plasmid types. Examples are virulence plasmids in Escherichia and Salmonella. Our study highlights non-functionalization of core plasmid mobility functions as one route for the evolution of domesticated plasmids. Pseudogenes in plasmids supply insights into past transitions in plasmid mobility that are akin to transitions in bacterial lifestyle.
Asunto(s)
Evolución Molecular , Genoma Bacteriano , Plásmidos , Seudogenes , Seudogenes/genética , Plásmidos/genética , Genoma Bacteriano/genética , Elementos Transponibles de ADN/genética , FilogeniaRESUMEN
Recently, substantial evidence has demonstrated that pseudogene-derived long noncoding RNAs (lncRNAs) as regulatory RNAs have been implicated in basic physiological processes and disease development through multiple modes of functional interaction with DNA, RNA, and proteins. Here, we report an important role for GBP1P1, the pseudogene of guanylate-binding protein 1, in regulating influenza A virus (IAV) replication in A549 cells. GBP1P1 was dramatically upregulated after IAV infection, which is controlled by JAK/STAT signaling. Functionally, ectopic expression of GBP1P1 in A549 cells resulted in significant suppression of IAV replication. Conversely, silencing GBP1P1 facilitated IAV replication and virus production, suggesting that GBP1P1 is one of the interferon-inducible antiviral effectors. Mechanistically, GBP1P1 is localized in the cytoplasm and functions as a sponge to trap DHX9 (DExH-box helicase 9), which subsequently restricts IAV replication. Together, these studies demonstrate that GBP1P1 plays an important role in antagonizing IAV replication.IMPORTANCELong noncoding RNAs (lncRNAs) are extensively expressed in mammalian cells and play a crucial role as regulators in various biological processes. A growing body of evidence suggests that host-encoded lncRNAs are important regulators involved in host-virus interactions. Here, we define a novel function of GBP1P1 as a decoy to compete with viral mRNAs for DHX9 binding. We demonstrate that GBP1P1 induction by IAV is mediated by JAK/STAT activation. In addition, GBP1P1 has the ability to inhibit IAV replication. Importantly, we reveal that GBP1P1 acts as a decoy to bind and titrate DHX9 away from viral mRNAs, thereby attenuating virus production. This study provides new insight into the role of a previously uncharacterized GBP1P1, a pseudogene-derived lncRNA, in the host antiviral process and a further understanding of the complex GBP network.
Asunto(s)
ARN Helicasas DEAD-box , Virus de la Influenza A , Seudogenes , Replicación Viral , Humanos , Células A549 , ARN Helicasas DEAD-box/metabolismo , ARN Helicasas DEAD-box/genética , Virus de la Influenza A/genética , Proteínas de Unión al GTP/genética , Proteínas de Unión al GTP/metabolismo , Transducción de Señal , Gripe Humana/virología , Gripe Humana/genética , Gripe Humana/metabolismo , Animales , ARN Largo no Codificante/genética , ARN Largo no Codificante/metabolismo , Células HEK293 , Interacciones Huésped-Patógeno , Perros , Proteínas de NeoplasiasRESUMEN
Recent advances in long read technologies not only enable large consortia to aim to sequence all eukaryotes on Earth, but they also allow individual laboratories to sequence their species of interest with relatively low investment. Long read technologies embody the promise of overcoming scaffolding problems associated with repeats and low complexity sequences, but the number of contigs often far exceeds the number of chromosomes and they may contain many insertion and deletion errors around homopolymer tracts. To overcome these issues, we have implemented the ILRA pipeline to correct long read-based assemblies. Contigs are first reordered, renamed, merged, circularized, or filtered if erroneous or contaminated. Illumina short reads are used subsequently to correct homopolymer errors. We successfully tested our approach by improving the genome sequences of Homo sapiens, Trypanosoma brucei, and Leptosphaeria spp., and by generating four novel Plasmodium falciparum assemblies from field samples. We found that correcting homopolymer tracts reduced the number of genes incorrectly annotated as pseudogenes, but an iterative approach seems to be required to correct more sequencing errors. In summary, we describe and benchmark the performance of our new tool, which improved the quality of novel long read assemblies up to 1 Gbp. The pipeline is available at GitHub: https://github.com/ThomasDOtto/ILRA.
Asunto(s)
Genoma , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADN , Seudogenes , CromosomasRESUMEN
Ferroptosis is a mode of regulated cell death characterized by iron-dependent accumulation of lipid peroxidation. It is closely linked to the pathophysiological processes in many diseases. Since our publication of the first ferroptosis database in 2020 (FerrDb V1), many new findings have been published. To keep up with the rapid progress in ferroptosis research and to provide timely and high-quality data, here we present the successor, FerrDb V2. It contains 1001 ferroptosis regulators and 143 ferroptosis-disease associations manually curated from 3288 articles. Specifically, there are 621 gene regulators, of which 264 are drivers, 238 are suppressors, 9 are markers, and 110 are unclassified genes; and there are 380 substance regulators, with 201 inducers and 179 inhibitors. Compared to FerrDb V1, curated articles increase by >300%, ferroptosis regulators increase by 175%, and ferroptosis-disease associations increase by 50.5%. Circular RNA and pseudogene are novel regulators in FerrDb V2, and the percentage of non-coding RNA increases from 7.3% to 13.6%. External gene-related data were integrated, enabling thought-provoking and gene-oriented analysis in FerrDb V2. In conclusion, FerrDb V2 will help to acquire deeper insights into ferroptosis. FerrDb V2 is freely accessible at http://www.zhounan.org/ferrdb/.
Asunto(s)
Ferroptosis , Ferroptosis/genética , Exactitud de los Datos , Bases de Datos Factuales , Peroxidación de Lípido , SeudogenesRESUMEN
Several atlasing efforts aim to profile human gene and protein expression across tissues, cell types and cell lines in normal physiology, development and disease. One utility of these resources is to examine the expression of a single gene across all cell types, tissues and cell lines in each atlas. However, there is currently no centralized place that integrates data from several atlases to provide this type of data in a uniform format for visualization, analysis and download, and via an application programming interface. To address this need, GeneRanger is a web server that provides access to processed data about gene and protein expression across normal human cell types, tissues and cell lines from several atlases. At the same time, TargetRanger is a related web server that takes as input RNA-seq data from profiled human cells and tissues, and then compares the uploaded input data to expression levels across the atlases to identify genes that are highly expressed in the input and lowly expressed across normal human cell types and tissues. Identified targets can be filtered by transmembrane or secreted proteins. The results from GeneRanger and TargetRanger are visualized as box and scatter plots, and as interactive tables. GeneRanger and TargetRanger are available from https://generanger.maayanlab.cloud and https://targetranger.maayanlab.cloud, respectively.
Asunto(s)
Proteómica , Seudogenes , Programas Informáticos , Humanos , Línea Celular , RNA-Seq , InternetRESUMEN
The HUGO Gene Nomenclature Committee (HGNC) assigns unique symbols and names to human genes. The HGNC database (www.genenames.org) currently contains over 43 000 approved gene symbols, over 19 200 of which are assigned to protein-coding genes, 14 000 to pseudogenes and nearly 9000 to non-coding RNA genes. The public website, www.genenames.org, displays all approved nomenclature within Symbol Reports that contain data curated by HGNC nomenclature advisors and links to related genomic, clinical, and proteomic information. Here, we describe updates to our resource, including improvements to our search facility and new download features.
Asunto(s)
Bases de Datos Genéticas , Humanos , Genoma , Genómica , Proteómica , Seudogenes , Terminología como AsuntoRESUMEN
Chromatin structure is tightly intertwined with transcription regulation. Here we compared the chromosomal architectures of fetal and adult human erythroblasts and found that, globally, chromatin structures and compartments A/B are highly similar at both developmental stages. At a finer scale, we detected distinct folding patterns at the developmentally controlled ß-globin locus. Specifically, new fetal stage-specific contacts were uncovered between a region separating the fetal (γ) and adult (δ and ß) globin genes (encompassing the HBBP1 and BGLT3 noncoding genes) and two distal chromosomal sites (HS5 and 3'HS1) that flank the locus. In contrast, in adult cells, the HBBP1-BGLT3 region contacts the embryonic ε-globin gene, physically separating the fetal globin genes from the enhancer (locus control region [LCR]). Deletion of the HBBP1 region in adult cells alters contact landscapes in ways more closely resembling those of fetal cells, including increased LCR-γ-globin contacts. These changes are accompanied by strong increases in γ-globin transcription. Notably, the effects of HBBP1 removal on chromatin architecture and gene expression closely mimic those of deleting the fetal globin repressor BCL11A, implicating BCL11A in the function of the HBBP1 region. Our results uncover a new critical regulatory region as a potential target for therapeutic genome editing for hemoglobinopathies and highlight the power of chromosome conformation analysis in discovering new cis control elements.
Asunto(s)
Cromatina/química , Eritroblastos/metabolismo , Regulación del Desarrollo de la Expresión Génica , Elementos Reguladores de la Transcripción , Globinas beta/genética , Adulto , Proteínas Portadoras/genética , Feto , Silenciador del Gen , Humanos , Región de Control de Posición , Proteínas Nucleares/genética , Seudogenes , Proteínas Represoras , Transcriptoma , gamma-Globinas/genéticaRESUMEN
PMS2 germline pathogenic variants are one of the major causes for Lynch syndrome and constitutional mismatch repair deficiencies. Variant identification in the 3' region of this gene is complicated by the presence of the pseudogene PMS2CL which shares a high sequence homology with PMS2. Consequently, short-fragment screening strategies (NGS, Sanger) may fail to discriminate variant's gene localization. Using a comprehensive analysis strategy, we assessed 42 NGS-detected variants in 76 patients and found 32 localized on PMS2 while 6 on PMS2CL. Interestingly, four variants were detected in either of them in different patients. Clinical phenotype was well correlated to genotype, making it very helpful in variant assessment. Our findings emphasize the necessity of more specific complementary analyses to confirm the gene origin of each variant detected in different individuals in order to avoid variant misinterpretation. In addition, we characterized two PMS2 genomic alterations involving Alu-mediated tandem duplication and gene conversion. Those mechanisms seemed to be particularly favored in PMS2 which contribute to frequent genomic rearrangements in the 3' region of the gene.
Asunto(s)
Neoplasias Colorrectales Hereditarias sin Poliposis , Neoplasias Colorrectales , Humanos , Endonucleasa PMS2 de Reparación del Emparejamiento Incorrecto/genética , Neoplasias Colorrectales/genética , Seudogenes , Neoplasias Colorrectales Hereditarias sin Poliposis/genética , Mutación de Línea GerminalRESUMEN
Pseudoalteromonas viridis strain BBR56 was isolated from seawater at Dutungan Island, South Sulawesi, Indonesia. Bacterial DNA was isolated using Promega Genomic DNA TM050. DNA purity and quantity were assessed using NanoDrop spectrophotometers and Qubit fluorometers. The DNA library and sequencing were prepared using Oxford Nanopore Technology GridION MinKNOW 20.06.9 with long read, direct, and comprehensive analysis. High accuracy base calling was assessed with Guppy version 4.0.11. Filtlong and NanoPlot were used for filtering and visualizing the FASTQ data. Flye (2.8.1) was used for de novo assembly analysis. Variant calls and consensus sequences were created using Medaka. The annotation of the genome was elaborated by DFAST. The assembled genome and annotation were tested using Busco and CheckM. Herein, we found that the highest similarity of the BBR56 isolate was 98.37% with the 16 S rRNA gene sequence of P. viridis G-1387. The genome size was 5.5 Mb and included chromosome 1 (4.2 Mbp) and chromosome 2 (1.3 Mbp), which encoded 61 pseudogenes, 4 noncoding RNAs, 113 tRNAs, 31 rRNAs, 4,505 coding DNA sequences, 4 clustered regularly interspaced short palindromic repeats, 4,444 coding genes, and a GC content of 49.5%. The sequence of the whole genome of P. viridis BBR56 was uploaded to GenBank under the accession numbers CP072425-CP072426, biosample number SAMN18435505, and bioproject number PRJNA716373. The sequence read archive (SRR14179986) was successfully obtained from NCBI for BBR56 raw sequencing reads. Digital DNA-DNA hybridization results showed that the genome of BBR56 had the potential to be a new species because no other bacterial genomes were similar to the sample. Biosynthetic gene clusters (BGCs) were assessed using BAGEL4 and the antiSMASH bacterial version. The genome harbored diverse BGCs, including genes that encoded polyketide synthase, nonribosomal peptide synthase, RiPP-like, NRP-metallophore, hydrogen cyanide, betalactone, thioamide-NRP, Lant class I, sactipeptide, and prodigiosin. Thus, BBR56 has considerable potential for further exploration regarding the use of its secondary metabolite products in the human and fisheries sectors.
Asunto(s)
Pseudoalteromonas , Humanos , Pseudoalteromonas/genética , Seudogenes , Biblioteca de Genes , ADN BacterianoRESUMEN
BACKGROUND: Microbial genomes are largely comprised of protein coding sequences, yet some genomes contain many pseudogenes caused by frameshifts or internal stop codons. These pseudogenes are believed to result from gene degradation during evolution but could also be technical artifacts of genome sequencing or assembly. RESULTS: Using a combination of observational and experimental data, we show that many putative pseudogenes are attributable to errors that are incorporated into genomes during assembly. Within 126,564 publicly available genomes, we observed that nearly identical genomes often substantially differed in pseudogene counts. Causal inference implicated assembler, sequencing platform, and coverage as likely causative factors. Reassembly of genomes from raw reads confirmed that each variable affects the number of putative pseudogenes in an assembly. Furthermore, simulated sequencing reads corroborated our observations that the quality and quantity of raw data can significantly impact the number of pseudogenes in an assembler dependent fashion. The number of unexpected pseudogenes due to internal stops was highly correlated (R2 = 0.96) with average nucleotide identity to the ground truth genome, implying relative pseudogene counts can be used as a proxy for overall assembly correctness. Applying our method to assemblies in RefSeq resulted in rejection of 3.6% of assemblies due to significantly elevated pseudogene counts. Reassembly from real reads obtained from high coverage genomes showed considerable variability in spurious pseudogenes beyond that observed with simulated reads, reinforcing the finding that high coverage is necessary to mitigate assembly errors. CONCLUSIONS: Collectively, these results demonstrate that many pseudogenes in microbial genome assemblies are actually genes. Our results suggest that high read coverage is required for correct assembly and indicate an inflated number of pseudogenes due to internal stops is indicative of poor overall assembly quality.