ABSTRACT
The 3D structure of chromatin plays a key role in genome function, including gene expression, DNA replication, chromosome segregation, and DNA repair. Furthermore the location of genomic loci within the nucleus, especially relative to each other and nuclear structures such as the nuclear envelope and nuclear bodies strongly correlates with aspects of function such as gene expression. Therefore, determining the 3D position of the 6 billion DNA base pairs in each of the 23 chromosomes inside the nucleus of a human cell is a central challenge of biology. Recent advances of super-resolution microscopy in principle enable the mapping of specific molecular features with nanometer precision inside cells. Combined with highly specific, sensitive and multiplexed fluorescence labeling of DNA sequences this opens up the possibility of mapping the 3D path of the genome sequence in situ. Here we develop computational methodologies to reconstruct the sequence configuration of all human chromosomes in the nucleus from a super-resolution image of a set of fluorescent in situ probes hybridized to the genome in a cell. To test our approach, we develop a method for the simulation of DNA in an idealized human nucleus. Our reconstruction method, ChromoTrace, uses suffix trees to assign a known linear ordering of in situ probes on the genome to an unknown set of 3D in-situ probe positions in the nucleus from super-resolved images using the known genomic probe spacing as a set of physical distance constraints between probes. We find that ChromoTrace can assign the 3D positions of the majority of loci with high accuracy and reasonable sensitivity to specific genome sequences. By simulating appropriate spatial resolution, label multiplexing and noise scenarios we assess our algorithms performance. Our study shows that it is feasible to achieve genome-wide reconstruction of the 3D DNA path based on super-resolution microscopy images.
Subject(s)
Chromatin/ultrastructure , Image Processing, Computer-Assisted/methods , Microscopy, Fluorescence/methods , Algorithms , Cell Nucleus/genetics , Chromatin/metabolism , Chromosomes/metabolism , Chromosomes/ultrastructure , Computational Biology/methods , DNA/metabolism , DNA Replication/physiology , Fluorescent Dyes/chemistry , Genome , Humans , Imaging, Three-Dimensional/methods , In Situ Hybridization, Fluorescence , Nucleic Acid ConformationABSTRACT
BACKGROUND: An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation also provides a fast alternative for measuring approximation in sequence comparison. There exists an [Formula: see text]-time and [Formula: see text]-space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata (Crochemore et al., 1998). No implementation of this algorithm is publicly available. There also exists an [Formula: see text]-time and [Formula: see text]-space algorithm for the same problem based on the construction of suffix arrays (Pinho et al., 2009). An implementation of this algorithm was also provided by the authors and is currently the fastest available. RESULTS: Our contribution in this article is twofold: first, we bridge this unpleasant gap by presenting an [Formula: see text]-time and [Formula: see text]-space algorithm for computing all minimal absent words based on the construction of suffix arrays; and second, we provide the respective implementation of this algorithm. Experimental results, using real and synthetic data, show that this implementation outperforms the one by Pinho et al. The open-source code of our implementation is freely available at http://github.com/solonas13/maw . CONCLUSIONS: Classical notions for sequence comparison are increasingly being replaced by other similarity measures that refer to the composition of sequences in terms of their constituent patterns. One such measure is the minimal absent words. In this article, we present a new linear-time and linear-space algorithm for the computation of minimal absent words based on the suffix array.
Subject(s)
Algorithms , Computational Biology/methods , DNA/genetics , Genome , Genomics/methods , Sequence Analysis, DNA/methods , Animals , Bacteria/genetics , Eukaryota/genetics , Humans , Programming LanguagesABSTRACT
T cell receptor (TCR) repertoire diversity enables the orchestration of antigen-specific immune responses against the vast space of possible pathogens. Identifying TCR/antigen binding pairs from the large TCR repertoire and antigen space is crucial for biomedical research. Here, we introduce copepodTCR, an open-access tool for the design and interpretation of high-throughput experimental assays to determine TCR specificity. copepodTCR implements a combinatorial peptide pooling scheme for efficient experimental testing of T cell responses against large overlapping peptide libraries, useful for "deorphaning" TCRs of unknown specificity. The scheme detects experimental errors and, coupled with a hierarchical Bayesian model for unbiased results interpretation, identifies the response-eliciting peptide for a TCR of interest out of hundreds of peptides tested using a simple experimental set-up. We experimentally validated our approach on a library of 253 overlapping peptides covering the SARS-CoV-2 spike protein. We provide experimental guides for efficient design of larger screens covering thousands of peptides which will be crucial for the identification of antigen-specific T cells and their targets from limited clinical material.
ABSTRACT
Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2), the causative agent of COVID -19, is constantly evolving, requiring continuous genomic surveillance. In this study, we used whole-genome sequencing to investigate the genetic epidemiology of SARS-CoV-2 in Bangladesh, with particular emphasis on identifying dominant variants and associated mutations. We used high-throughput next-generation sequencing (NGS) to obtain DNA sequences from COVID-19 patient samples and compared these sequences to the Wuhan SARS-CoV-2 reference genome using the Global Initiative for Sharing All Influenza Data (GISAID). Our phylogenetic and mutational analyzes revealed that the majority (88%) of the samples belonged to the pangolin lineage B.1.1.25, whereas the remaining 11% were assigned to the parental lineage B.1.1. Two main mutations, D614G and P681R, were identified in the spike protein sequences of the samples. The D614G mutation, which is the most common, decreases S1 domain flexibility, whereas the P681R mutation may increase the severity of viral infections by increasing the binding affinity between the spike protein and the ACE2 receptor. We employed molecular modeling techniques, including protein modeling, molecular docking, and quantum mechanics/molecular mechanics (QM/MM) geometry optimization, to build and validate three-dimensional models of the S_D614G-ACE2 and S_P681R-ACE2 complexes from the predominant strains. The description of the binding mode and intermolecular contacts of the referenced systems suggests that the P681R mutation may be associated with increased viral pathogenicity in Bangladeshi patients due to enhanced electrostatic interactions between the mutant spike protein and the human ACE2 receptor, underscoring the importance of continuous genomic surveillance in the fight against COVID -19. Finally, the binding profile of the S_D614G-ACE2 and S_P681R-ACE2 complexes offer valuable insights to deeply understand the binding site characteristics that could help to develop antiviral therapeutics that inhibit protein-protein interactions between SARS-CoV-2 spike protein and human ACE2 receptor.
Subject(s)
COVID-19 , Animals , Humans , Angiotensin-Converting Enzyme 2/genetics , Angiotensin-Converting Enzyme 2/metabolism , Molecular Docking Simulation , Molecular Dynamics Simulation , Mutation , Pangolins/metabolism , Phylogeny , Protein Binding , SARS-CoV-2/genetics , SARS-CoV-2/metabolism , Spike Glycoprotein, Coronavirus/metabolism , VirulenceABSTRACT
BACKGROUND: Unraveling the relationship between genetic variation and phenotypic traits remains a fundamental challenge in biology. Mapping variants underlying complex traits while controlling for confounding environmental factors is often problematic. To address this, we establish a vertebrate genetic resource specifically to allow for robust genotype-to-phenotype investigations. The teleost medaka (Oryzias latipes) is an established genetic model system with a long history of genetic research and a high tolerance to inbreeding from the wild. RESULTS: Here we present the Medaka Inbred Kiyosu-Karlsruhe (MIKK) panel: the first near-isogenic panel of 80 inbred lines in a vertebrate model derived from a wild founder population. Inbred lines provide fixed genomes that are a prerequisite for the replication of studies, studies which vary both the genetics and environment in a controlled manner, and functional testing. The MIKK panel will therefore enable phenotype-to-genotype association studies of complex genetic traits while allowing for careful control of interacting factors, with numerous applications in genetic research, human health, drug development, and fundamental biology. CONCLUSIONS: Here we present a detailed characterization of the genetic variation across the MIKK panel, which provides a rich and unique genetic resource to the community by enabling large-scale experiments for mapping complex traits.
Subject(s)
Oryzias , Animals , Genome , Inbreeding , Oryzias/genetics , PhenotypeABSTRACT
BACKGROUND: The teleost medaka (Oryzias latipes) is a well-established vertebrate model system, with a long history of genetic research, and multiple high-quality reference genomes available for several inbred strains. Medaka has a high tolerance to inbreeding from the wild, thus allowing one to establish inbred lines from wild founder individuals. RESULTS: We exploit this feature to create an inbred panel resource: the Medaka Inbred Kiyosu-Karlsruhe (MIKK) panel. This panel of 80 near-isogenic inbred lines contains a large amount of genetic variation inherited from the original wild population. We use Oxford Nanopore Technologies (ONT) long read data to further investigate the genomic and epigenomic landscapes of a subset of the MIKK panel. Nanopore sequencing allows us to identify a large variety of high-quality structural variants, and we present results and methods using a pan-genome graph representation of 12 individual medaka lines. This graph-based reference MIKK panel genome reveals novel differences between the MIKK panel lines and standard linear reference genomes. We find additional MIKK panel-specific genomic content that would be missing from linear reference alignment approaches. We are also able to identify and quantify the presence of repeat elements in each of the lines. Finally, we investigate line-specific CpG methylation and performed differential DNA methylation analysis across these 12 lines. CONCLUSIONS: We present a detailed analysis of the MIKK panel genomes using long and short read sequence technologies, creating a MIKK panel-specific pan genome reference dataset allowing for investigation of novel variation types that would be elusive using standard approaches.
Subject(s)
Oryzias , Animals , Epigenomics , Genome , Genomics/methods , Humans , Oryzias/geneticsABSTRACT
OBJECTIVE: The major objective of the study was to sequence the whole genome of four Bangladeshi individuals and identify variants that are known to be associated with functional changes or disease states. We also carried out an ontology analysis to identify the functions and pathways most likely to be affected by these variants. RESULTS: We identified around 900,000 common variants and close to 5 million unique ones in all four of the individuals. This included over 11,500 variants that caused nonsynonymous changes in proteins. Heart function associated pathways were heavily implicated by the ontology analysis; corroborating previous studies that claimed the Bangladeshi population as highly susceptible to heart disorders. Two variants were found that have been previously identified as pathogenic factors in familial hypercholesteremia and structural disorders of the heart. Other pathogenic variants we found were associated with pseudoxanthoma elasticum, cancer progression, polyagglutinable erythrocyte syndrome, preeclampsia, and others.
Subject(s)
Genome , Polymorphism, Single Nucleotide , Chromosome Mapping , Ethnicity , HumansABSTRACT
BACKGROUND: Mammalian species exhibit a wide range of lifespans. To date, a robust and dynamic molecular readout of these lifespan differences has not yet been identified. Recent studies have established the existence of ageing-associated differentially methylated positions (aDMPs) in human and mouse. These are CpG sites at which DNA methylation dynamics show significant correlations with age. We hypothesise that aDMPs are pan-mammalian and are a dynamic molecular readout of lifespan variation among different mammalian species. RESULTS: A large-scale integrated analysis of aDMPs in six different mammals reveals a strong negative relationship between rate of change of methylation levels at aDMPs and lifespan. This relationship also holds when comparing two different dog breeds with known differences in lifespans. In an ageing cohort of aneuploid mice carrying a complete copy of human chromosome 21, aDMPs accumulate far more rapidly than is seen in human tissues, revealing that DNA methylation at aDMP sites is largely shaped by the nuclear trans-environment and represents a robust molecular readout of the ageing cellular milieu. CONCLUSIONS: Overall, we define the first dynamic molecular readout of lifespan differences among mammalian species and propose that aDMPs will be an invaluable molecular tool for future evolutionary and mechanistic studies aimed at understanding the biological factors that determine lifespan in mammals.
Subject(s)
DNA Methylation , Longevity/genetics , Mammals/genetics , Aging/genetics , Animals , Dogs , Humans , MiceABSTRACT
This work investigates the role of isochores during preimplantation process. Using RNA-seq data from human and mouse preimplantation stages, we created the spatio-temporal transcriptional profiles of the isochores during preimplantation. We found that from early to late stages, GC-rich isochores increase their expression while GC-poor ones decrease it. Network analysis revealed that modules with few coexpressed isochores are GC-poorer than medium-large ones, characterized by an opposite expression as preimplantation advances, decreasing and increasing respectively. Our results reveal a functional contribution of the isochores, supporting the presence of structural-functional interactions during maturation and early-embryonic development.
Subject(s)
Blastocyst/metabolism , Gene Expression Regulation, Developmental/physiology , Isochores/metabolism , Transcriptome/physiology , Animals , Humans , Mice , Species SpecificityABSTRACT
BACKGROUND: Tandem duplication, in the context of molecular biology, occurs as a result of mutational events in which an original segment of DNA is converted into a sequence of individual copies. More formally, a repetition or tandem repeat in a string of letters consists of exact concatenations of identical factors of the string. Biologists are interested in approximate tandem repeats and not necessarily only in exact tandem repeats. A weighted sequence is a string in which a set of letters may occur at each position with respective probabilities of occurrence. It naturally arises in many biological contexts and provides a method to realise the approximation among distinct adjacent occurrences of the same DNA segment. RESULTS: Crochemore's repetitions algorithm, also referred to as Crochemore's partitioning algorithm, was introduced in 1981, and was the first optimal [Formula: see text]-time algorithm to compute all repetitions in a string of length n. In this article, we present a novel variant of Crochemore's partitioning algorithm for weighted sequences, which requires optimal [Formula: see text] time, thus improving on the best known [Formula: see text]-time algorithm (Zhang et al., 2013) for computing all repetitions in a weighted sequence of length n.
ABSTRACT
BACKGROUND: Circular string matching is a problem which naturally arises in many biological contexts. It consists in finding all occurrences of the rotations of a pattern of length m in a text of length n. There exist optimal average-case algorithms for exact circular string matching. Approximate circular string matching is a rather undeveloped area. RESULTS: In this article, we present a suboptimal average-case algorithm for exact circular string matching requiring time O(n). Based on our solution for the exact case, we present two fast average-case algorithms for approximate circular string matching with k-mismatches, under the Hamming distance model, requiring time O(n) for moderate values of k, that is k=O(m/logm). We show how the same results can be easily obtained under the edit distance model. The presented algorithms are also implemented as library functions. Experimental results demonstrate that the functions provided in this library accelerate the computations by more than three orders of magnitude compared to a naïve approach. CONCLUSIONS: We present two fast average-case algorithms for approximate circular string matching with k-mismatches; and show that they also perform very well in practice. The importance of our contribution is underlined by the fact that the provided functions may be seamlessly integrated into any biological pipeline. The source code of the library is freely available at http://www.inf.kcl.ac.uk/research/projects/asmf/.
ABSTRACT
In this paper, we present a solution to the extreme similarity sequencing problem. The extreme similarity sequencing problem consists of finding occurrences of a pattern p in a set S(0), S(1), , S(k), of sequences of equal length, where S(i), for all 1≤i≤k, differs from S(0) by a constant number of errors - around 10 in practice. We present an asymptotically fast O(n + occ logocc) time algorithm, as well as a practical O(nk/w) time algorithm for solving this problem, where n is the length of a sequence, occ is the number of candidate occurrences reported by our technique, w is the size of the machine word, and the total number of errors is bounded by k - the number of sequences.