RESUMEN
Virtually all genome sequencing efforts in national biobanks, complex and Mendelian disease programs, and medical genetic initiatives are reliant upon short-read whole-genome sequencing (srWGS), which presents challenges for the detection of structural variants (SVs) relative to emerging long-read WGS (lrWGS) technologies. Given this ubiquity of srWGS in large-scale genomics initiatives, we sought to establish expectations for routine SV detection from this data type by comparison with lrWGS assembly, as well as to quantify the genomic properties and added value of SVs uniquely accessible to each technology. Analyses from the Human Genome Structural Variation Consortium (HGSVC) of three families captured ~11,000 SVs per genome from srWGS and ~25,000 SVs per genome from lrWGS assembly. Detection power and precision for SV discovery varied dramatically by genomic context and variant class: 9.7% of the current GRCh38 reference is defined by segmental duplication (SD) and simple repeat (SR), yet 91.4% of deletions that were specifically discovered by lrWGS localized to these regions. Across the remaining 90.3% of reference sequence, we observed extremely high (93.8%) concordance between technologies for deletions in these datasets. In contrast, lrWGS was superior for detection of insertions across all genomic contexts. Given that non-SD/SR sequences encompass 95.9% of currently annotated disease-associated exons, improved sensitivity from lrWGS to discover novel pathogenic deletions in these currently interpretable genomic regions is likely to be incremental. However, these analyses highlight the considerable added value of assembly-based lrWGS to create new catalogs of insertions and transposable elements, as well as disease-associated repeat expansions in genomic sequences that were previously recalcitrant to routine assessment.
Asunto(s)
Genoma Humano/genética , Variación Estructural del Genoma , Genómica/métodos , Objetivos , Secuenciación Completa del Genoma/métodos , Secuenciación Completa del Genoma/normas , Variaciones en el Número de Copia de ADN , Exones/genética , Humanos , Proyectos de Investigación , Duplicaciones Segmentarias en el Genoma , Alineación de SecuenciaRESUMEN
The classical genetic code maps nucleotide triplets to amino acids. The associated sequence composition is complex, representing many elaborations during evolution of form and function. Other genomic elements code for the expression and processing of RNA transcripts. However, over 50% of the human genome consists of widely dispersed repetitive sequences. Among these are simple sequence repeats (SSRs), representing a class of flipons, that under physiological conditions, form alternative nucleic acid conformations such as Z-DNA, G4 quartets, I-motifs, and triplexes. Proteins that bind in a structure-specific manner enable the seeding of condensates with the potential to regulate a wide range of biological processes. SSRs also encode the low complexity peptide repeats to patch condensates together, increasing the number of combinations possible. In situations where SSRs are transcribed, SSR-specific, single-stranded binding proteins may further impact condensate formation. Jointly, flipons and patches speed evolution by enhancing the functionality of condensates. Here, the focus is on the selection of SSR flipons and peptide patches that solve for survival under a wide range of environmental contexts, generating complexity with simple parts.
Asunto(s)
ADN de Forma Z/química , ADN de Forma Z/genética , Evolución Molecular , Conformación de Ácido Nucleico , Proteínas/química , Proteínas/genética , Animales , Codón , ADN de Forma Z/metabolismo , Genética , Humanos , Repeticiones de Microsatélite/genética , Proteínas/metabolismoRESUMEN
Pre-mRNA molecules can form a variety of structures, and both secondary and tertiary structures have important effects on processing, function and stability of these molecules. The prediction of RNA secondary structure is a challenging problem and various algorithms that use minimum free energy, maximum expected accuracy and comparative evolutionary based methods have been developed to predict secondary structures. However, these tools are not perfect, and this remains an active area of research. The secondary structure of pre-mRNA molecules can have an enhancing or inhibitory effect on pre-mRNA splicing. An example of enhancing structure can be found in a novel class of introns in zebrafish. About 10% of zebrafish genes contain a structured intron that forms a bridging hairpin that enforces correct splice site pairing. Negative examples of splicing include local structures around splice sites that decrease splicing efficiency and potentially cause mis-splicing leading to disease. Splicing mutations are a frequent cause of hereditary disease. The transcripts of disease genes are significantly more structured around the splice sites, and point mutations that increase the local structure often cause splicing disruptions. Post-splicing, RNA secondary structure can also affect the stability of the spliced intron and regulatory RNA interference pathway intermediates, such as pre-microRNAs. Additionally, RNA secondary structure has important roles in the innate immune defense against viruses. Finally, tertiary structure can also play a large role in pre-mRNA splicing. One example is the G-quadruplex structure, which, similar to secondary structure, can either enhance or inhibit splicing through mechanisms such as creating or obscuring RNA binding protein sites.
Asunto(s)
Inmunidad Innata/genética , Intrones/genética , Pliegue del ARN/genética , Precursores del ARN/química , Empalme del ARN , ARN Bicatenario/química , Animales , Exones/genética , G-Cuádruplex , Humanos , Mutación , Pliegue del ARN/inmunología , Precursores del ARN/genética , Precursores del ARN/metabolismo , ARN Bicatenario/genética , ARN Bicatenario/inmunología , ARN Bicatenario/metabolismo , Pez Cebra/genéticaRESUMEN
A set X of 20 trinucleotides was identified in genes of bacteria, eukaryotes, plasmids and viruses, which has in average the highest occurrence in reading frame compared to its two shifted frames (Michel, 2015; Arquès and Michel, 1996). This set X has an interesting mathematical property as X is a circular code (Arquès and Michel, 1996). Thus, the motifs from this circular code X, called X motifs, have the property to always retrieve, synchronize and maintain the reading frame in genes. The origin of this circular code X in genes is an open problem since its discovery in 1996. Here, we first show that the unitary circular codes (UCC), i.e. sets of one word, allow to generate unitary circular code motifs (UCC motifs), i.e. a concatenation of the same motif (simple repeats) leading to low complexity DNA. Three classes of UCC motifs are studied here: repeated dinucleotides (D+ motifs), repeated trinucleotides (T+ motifs) and repeated tetranucleotides (T+ motifs). Thus, the D+, T+ and T+ motifs allow to retrieve, synchronize and maintain a frame modulo 2, modulo 3 and modulo 4, respectively, and their shifted frames (1 modulo 2; 1 and 2 modulo 3; 1, 2 and 3 modulo 4 according to the C2, C3 and C4 properties, respectively) in the DNA sequences. The statistical distribution of the D+, T+ and T+ motifs is analyzed in the genomes of eukaryotes. A UCC motif and its comp lementary UCC motif have the same distribution in the eukaryotic genomes. Furthermore, a UCC motif and its complementary UCC motif have increasing occurrences contrary to their number of hydrogen bonds, very significant with the T+ motifs. The longest D+, T+ and T+ motifs in the studied eukaryotic genomes are also given. Surprisingly, a scarcity of repeated trinucleotides (T+ motifs) in the large eukaryotic genomes is observed compared to the D+ and T+ motifs. This result has been investigated and may be explained by two outcomes. Repeated trinucleotides (T+ motifs) are identified in the X motifs of low composition (cardinality less than 10) in the genomes of eukaryotes. Furthermore, identical trinucleotide pairs of the circular code X are preferentially used in the gene sequences of eukaryotes. These two results suggest that the unitary circular codes of trinucleotides may have been involved in the formation of the trinucleotide circular code X. Indeed, repeated trinucleotides in the X motifs in the genomes of eukaryotes may represent an intermediary evolution from repeated trinucleotides of cardinality 1 (T+ motifs) in the genomes of eukaryotes up to the X motifs of cardinality 20 in the gene sequences of eukaryotes.