RESUMO
Alternative splicing (AS) is an essential post-transcriptional mechanism that regulates many biological processes. However, identifying comprehensive types of AS events without guidance from a reference genome is still a challenge. Here, we proposed a novel method, MkcDBGAS, to identify all seven types of AS events using transcriptome alone, without a reference genome. MkcDBGAS, modeled by full-length transcripts of human and Arabidopsis thaliana, consists of three modules. In the first module, MkcDBGAS, for the first time, uses a colored de Bruijn graph with dynamic- and mixed- kmers to identify bubbles generated by AS with precision higher than 98.17% and detect AS types overlooked by other tools. In the second module, to further classify types of AS, MkcDBGAS added the motifs of exons to construct the feature matrix followed by the XGBoost-based classifier with the accuracy of classification greater than 93.40%, which outperformed other widely used machine learning models and the state-of-the-art methods. Highly scalable, MkcDBGAS performed well when applied to Iso-Seq data of Amborella and transcriptome of mouse. In the third module, MkcDBGAS provides the analysis of differential splicing across multiple biological conditions when RNA-sequencing data is available. MkcDBGAS is the first accurate and scalable method for detecting all seven types of AS events using the transcriptome alone, which will greatly empower the studies of AS in a wider field.
Assuntos
Processamento Alternativo , Arabidopsis , Animais , Humanos , Camundongos , Transcriptoma , Splicing de RNA , Análise de Sequência de RNA/métodos , RNA , Arabidopsis/genética , Perfilação da Expressão Gênica/métodosRESUMO
For identification of marker-trait associations (MTAs) for complex traits in animals and plants, thousands of genome-wide association studies (GWAS) were conducted during the past two decades. This involved regular improvement in methodology. Initially, a reference genome and SNPs were used; more recently pan-genomes and the markers structural variations (SVs)/k-mers are also being used.
Assuntos
Estudo de Associação Genômica Ampla/métodos , Estudo de Associação Genômica Ampla/normas , Animais , Genoma/genética , Humanos , Fenótipo , Plantas/genética , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
Knowledge of plant recognition of insects is largely limited to a few resistance (R) genes against sap-sucking insects. Hypersensitive response (HR) characterizes monogenic plant traits relying on R genes in several pathosystems. HR-like cell death can be triggered by eggs of cabbage white butterflies (Pieris spp.), pests of cabbage crops (Brassica spp.), reducing egg survival and representing an effective plant resistance trait before feeding damage occurs. Here, we performed genetic mapping of HR-like cell death induced by Pieris brassicae eggs in the black mustard Brassica nigra (B. nigra). We show that HR-like cell death segregates as a Mendelian trait and identified a single dominant locus on chromosome B3, named PEK (Pieris egg- killing). Eleven genes are located in an approximately 50 kb region, including a cluster of genes encoding intracellular TIR-NBS-LRR (TNL) receptor proteins. The PEK locus is highly polymorphic between the parental accessions of our mapping populations and among B. nigra reference genomes. Our study is the first one to identify a single locus potentially involved in HR-like cell death induced by insect eggs in B. nigra. Further fine-mapping, comparative genomics and validation of the PEK locus will shed light on the role of these TNL receptors in egg-killing HR.
Assuntos
Borboletas , Mostardeira , Animais , Mostardeira/genética , Borboletas/genética , Plantas , Mapeamento CromossômicoRESUMO
Antimicrobial peptides (AMPs) are promising candidates for new antibiotics due to their broad-spectrum activity against pathogens and reduced susceptibility to resistance development. Deep-learning techniques, such as deep generative models, offer a promising avenue to expedite the discovery and optimization of AMPs. A remarkable example is the Feedback Generative Adversarial Network (FBGAN), a deep generative model that incorporates a classifier during its training phase. Our study aims to explore the impact of enhanced classifiers on the generative capabilities of FBGAN. To this end, we introduce two alternative classifiers for the FBGAN framework, both surpassing the accuracy of the original classifier. The first classifier utilizes the k-mers technique, while the second applies transfer learning from the large protein language model Evolutionary Scale Modeling 2 (ESM2). Integrating these classifiers into FBGAN not only yields notable performance enhancements compared to the original FBGAN but also enables the proposed generative models to achieve comparable or even superior performance to established methods such as AMPGAN and HydrAMP. This achievement underscores the effectiveness of leveraging advanced classifiers within the FBGAN framework, enhancing its computational robustness for AMP de novo design and making it comparable to existing literature.
Assuntos
Peptídeos Antimicrobianos , Peptídeos Antimicrobianos/química , Peptídeos Antimicrobianos/farmacologia , Desenho de Fármacos/métodos , Redes Neurais de Computação , Aprendizado Profundo , AlgoritmosRESUMO
Early detection of human disease is associated with improved clinical outcomes. However, many diseases are often detected at an advanced, symptomatic stage where patients are past efficacious treatment periods and can result in less favorable outcomes. Therefore, methods that can accurately detect human disease at a presymptomatic stage are urgently needed. Here, we introduce "frequentmers"; short sequences that are specific and recurrently observed in either patient or healthy control samples, but not in both. We showcase the utility of frequentmers for the detection of liver cirrhosis using metagenomic Next Generation Sequencing data from stool samples of patients and controls. We develop classification models for the detection of liver cirrhosis and achieve an AUC score of 0.91 using ten-fold cross-validation. A small subset of 200 frequentmers can achieve comparable results in detecting liver cirrhosis. Finally, we identify the microbial organisms in liver cirrhosis samples, which are associated with the most predictive frequentmer biomarkers.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala , Cirrose Hepática , Humanos , Cirrose Hepática/diagnóstico , Cirrose Hepática/genética , Nível de Saúde , Metagenoma , Metagenômica , Sensibilidade e EspecificidadeRESUMO
Most living organisms rely on double-stranded DNA (dsDNA) to store their genetic information and perpetuate themselves. This biological information has been considered as the main target of evolution. However, here we show that symmetries and patterns in the dsDNA sequence can emerge from the physical peculiarities of the dsDNA molecule itself and the maximum entropy principle alone, rather than from biological or environmental evolutionary pressure. The randomness justifies the human codon biases and context-dependent mutation patterns in human populations. Thus, the DNA 'exceptional symmetries,' emerged from the randomness, have to be taken into account when looking for the DNA encoded information. Our results suggest that the double helix energy constraints and, more generally, the physical properties of the dsDNA are the hard drivers of the overall DNA sequence architecture, whereas the selective biological processes act as soft drivers, which only under extraordinary circumstances overtake the overall entropy content of the genome.
Assuntos
DNA/genética , Evolução Molecular , Análise de Sequência de DNA/métodos , HumanosRESUMO
The development of improved methods for genome-wide association studies (GWAS) for genetics of quantitative traits has been an active area of research during the last 25 years. This activity initially started with the use of mixed linear model (MLM), which was variously modified. During the last decade, however, with the availability of high throughput next generation sequencing (NGS) technology, development and use of pangenomes and novel markers including structural variations (SVs) and k-mers for GWAS has taken over as a new thrust area of research. Pangenomes and SVs are now available in humans, livestock, and a number of plant species, so that these resources along with k-mers are being used in GWAS for exploring additional genetic variation that was hitherto not available for analysis. These developments have resulted in significant improvement in GWAS methodology for detection of marker-trait associations (MTAs) that are relevant to human healthcare and crop improvement.
Assuntos
Estudo de Associação Genômica Ampla , Polimorfismo de Nucleotídeo Único , Genoma de Planta/genética , Humanos , Fenótipo , Polimorfismo de Nucleotídeo Único/genéticaRESUMO
Complex functioning of the genome in the cell nucleus is controlled at different levels: (a) the DNA base sequence containing all relevant inherited information; (b) epigenetic pathways consisting of protein interactions and feedback loops; (c) the genome architecture and organization activating or suppressing genetic interactions between different parts of the genome. Most research so far has shed light on the puzzle pieces at these levels. This article, however, attempts an integrative approach to genome expression regulation incorporating these different layers. Under environmental stress or during cell development, differentiation towards specialized cell types, or to dysfunctional tumor, the cell nucleus seems to react as a whole through coordinated changes at all levels of control. This implies the need for a framework in which biological, chemical, and physical manifestations can serve as a basis for a coherent theory of gene self-organization. An international symposium held at the Biomedical Research and Study Center in Riga, Latvia, on 25 July 2022 addressed novel aspects of the abovementioned topic. The present article reviews the most recent results and conclusions of the state-of-the-art research in this multidisciplinary field of science, which were delivered and discussed by scholars at the Riga symposium.
Assuntos
Núcleo Celular , Genoma , Núcleo Celular/metabolismo , Diferenciação Celular/genéticaRESUMO
Enhancers are small regions of DNA that bind to proteins, which enhance the transcription of genes. The enhancer may be located upstream or downstream of the gene. It is not necessarily close to the gene to be acted on, because the entanglement structure of chromatin allows the positions far apart in the sequence to have the opportunity to contact each other. Therefore, identifying enhancers and their strength is a complex and challenging task. In this article, a new prediction method based on deep learning is proposed to identify enhancers and enhancer strength, called iEnhancer-DCLA. Firstly, we use word2vec to convert k-mers into number vectors to construct an input matrix. Secondly, we use convolutional neural network and bidirectional long short-term memory network to extract sequence features, and finally use the attention mechanism to extract relatively important features. In the task of predicting enhancers and their strengths, this method has improved to a certain extent in most evaluation indexes. In summary, we believe that this method provides new ideas in the analysis of enhancers.
Assuntos
Aprendizado Profundo , Elementos Facilitadores Genéticos , Redes Neurais de Computação , Cromatina/genética , DNA/genética , DNA/químicaRESUMO
In this study, six candidate female-specific DNA sequences of octaploid Amur sturgeon (Acipenser schrenckii) were identified using comparative genomic approaches with high-throughput sequencing data. Their specificity was confirmed by traditional PCR. Two of these sex-specific sequences were also validated as female-specific in other eight sturgeon species and two hybrid sturgeons. The identified female-specific DNA fragments suggest that the family Acipenseridae has a ZZ/ZW sex-determining system. However, one of the two DNA sequences has been deleted in some sturgeons such as Sterlet sturgeon (Acipenser ruthenus), Beluga (Huso huso) and Kaluga (H. dauricus). The difference of sex-specific sequences among sturgeons indicates that there are different sex-specific regions among species of sturgeon. This study not only provided the sex-specific DNA sequences for management, conservation and studies of sex-determination mechanisms in sturgeons, but also confirmed the capability of the workflow to identify sex-specific DNA sequences in the polyploid species with complex genomes.
Assuntos
Peixes , Genoma , Animais , Sequência de Bases , Feminino , Peixes/genética , Genômica , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
BACKGROUND: The detection of genome variants, including point mutations, indels and structural variants, is a fundamental and challenging computational problem. We address here the problem of variant detection between two deep-sequencing (DNA-seq) samples, such as two human samples from an individual patient, or two samples from distinct bacterial strains. The preferred strategy in such a case is to align each sample to a common reference genome, collect all variants and compare these variants between samples. Such mapping-based protocols have several limitations. DNA sequences with large indels, aggregated mutations and structural variants are hard to map to the reference. Furthermore, DNA sequences cannot be mapped reliably to genomic low complexity regions and repeats. RESULTS: We introduce 2-kupl, a k-mer based, mapping-free protocol to detect variants between two DNA-seq samples. On simulated and actual data, 2-kupl achieves higher accuracy than other mapping-free protocols. Applying 2-kupl to prostate cancer whole exome sequencing data, we identify a number of candidate variants in hard-to-map regions and propose potential novel recurrent variants in this disease. CONCLUSIONS: We developed a mapping-free protocol for variant calling between matched DNA-seq samples. Our protocol is suitable for variant detection in unmappable genome regions or in the absence of a reference genome.
Assuntos
Genômica , Sequenciamento de Nucleotídeos em Larga Escala , DNA , Genoma Humano , Humanos , Análise de Sequência de DNARESUMO
BACKGROUND: Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution. RESULT: Here, we present the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are ~ tenfold faster than BBhash, with no need to re-analyze the original data for further insertions or deletions. CONCLUSIONS: The MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data.
Assuntos
Metadados , Software , Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNARESUMO
KATK is a fast and accurate software tool for calling variants directly from raw next-generation sequencing reads. It uses predefined k-mers to retrieve only the reads of interest from the FASTQ file and calls genotypes by aligning retrieved reads locally. KATK does not use data about known polymorphisms and has NC (no call) as the default genotype. The reference or variant allele is called only if there is sufficient evidence for their presence in data. Thus it is not biased against rare variants or de-novo mutations. With simulated datasets, we achieved a false-negative rate of 0.23% (sensitivity 99.77%) and a false discovery rate of 0.19%. Calling all human exonic regions with KATK requires 1-2 h, depending on sequencing coverage.
Assuntos
Análise Mutacional de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Alelos , Mapeamento Cromossômico/métodos , Conjuntos de Dados como Assunto , Feminino , Genoma Humano , Genótipo , Humanos , Masculino , Polimorfismo de Nucleotídeo Único , Reprodutibilidade dos Testes , Análise de Sequência de DNA/métodosRESUMO
BACKGROUND: Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. RESULTS: In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. CONLUSIONS: We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Sequências Repetitivas de Ácido Nucleico/genética , Software , Animais , Sequência de Bases , Bases de Dados Genéticas , Drosophila melanogaster/genética , Biblioteca Gênica , Genoma Humano , Humanos , Camundongos , Padrões de Referência , Reprodutibilidade dos Testes , Saccharomyces cerevisiae/genética , Alinhamento de Sequência , Análise de Sequência de DNA/métodos , Estatística como Assunto , Fatores de TempoRESUMO
BACKGROUND: RNA viruses mutate at extremely high rates, forming an intra-host viral population of closely related variants, which allows them to evade the host's immune system and makes them particularly dangerous. Viral outbreaks pose a significant threat for public health, and, in order to deal with it, it is critical to infer transmission clusters, i.e., decide whether two viral samples belong to the same outbreak. Next-generation sequencing (NGS) can significantly help in tackling outbreak-related problems. While NGS data is first obtained as short reads, existing methods rely on assembled sequences. This requires reconstruction of the entire viral population, which is complicated, error-prone and time-consuming. RESULTS: The experimental validation using sequencing data from HCV outbreaks shows that the proposed algorithm can successfully identify genetic relatedness between viral populations, infer transmission direction, transmission clusters and outbreak sources, as well as decide whether the source is present in the sequenced outbreak sample and identify it. CONCLUSIONS: Introduced algorithm allows to cluster genetically related samples, infer transmission directions and predict sources of outbreaks. Validation on experimental data demonstrated that algorithm is able to reconstruct various transmission characteristics. Advantage of the method is the ability to bypass cumbersome read assembly, thus eliminating the chance to introduce new errors, and saving processing time by allowing to use raw NGS reads.
Assuntos
Hepacivirus , Vírus de RNA , Algoritmos , Surtos de Doenças , Hepacivirus/genética , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
Genetic tools are increasingly used to identify and discriminate between species. One key transition in this process was the recognition of the potential of the ca 658bp fragment of the organelle cytochrome c oxidase I (COI) as a barcode region, which revolutionized animal bioidentification and lead, among others, to the instigation of the Barcode of Life Database (BOLD), containing currently barcodes from >7.9 million specimens. Following this discovery, suggestions for other organellar regions and markers, and the primers with which to amplify them, have been continuously proposed. Most recently, the field has taken the leap from PCR-based generation of DNA references into shotgun sequencing-based "genome skimming" alternatives, with the ultimate goal of assembling organellar reference genomes. Unfortunately, in genome skimming approaches, much of the nuclear genome (as much as 99% of the sequence data) is discarded, which is not only wasteful, but can also limit the power of discrimination at, or below, the species level. Here, we advocate that the full shotgun sequence data can be used to assign an identity (that we term for convenience its "DNA-mark") for both voucher and query samples, without requiring any computationally intensive pretreatment (e.g. assembly) of reads. We argue that if reference databases are populated with such "DNA-marks," it will enable future DNA-based taxonomic identification to complement, or even replace PCR of barcodes with genome skimming, and we discuss how such methodology ultimately could enable identification to population, or even individual, level.
Assuntos
Código de Barras de DNA Taxonômico , DNA , Genômica/métodos , Animais , Primers do DNA , Bases de Dados Genéticas , Reação em Cadeia da PolimeraseRESUMO
The need for a comparative analysis of natural metagenomes stimulated the development of new methods for their taxonomic profiling. Alignment-free approaches based on the search for marker k-mers turned out to be capable of identifying not only species, but also strains of microorganisms with known genomes. Here, we evaluated the ability of genus-specific k-mers to distinguish eight phylogroups of Escherichia coli (A, B1, C, E, D, F, G, B2) and assessed the presence of their unique 22-mers in clinical samples from microbiomes of four healthy people and four patients with Crohn's disease. We found that a phylogenetic tree inferred from the pairwise distance matrix for unique 18-mers and 22-mers of 124 genomes was fully consistent with the topology of the tree, obtained with concatenated aligned sequences of orthologous genes. Therefore, we propose strain-specific "barcodes" for rapid phylotyping. Using unique 22-mers for taxonomic analysis, we detected microbes of all groups in human microbiomes; however, their presence in the five samples was significantly different. Pointing to the intraspecies heterogeneity of E. coli in the natural microflora, this also indicates the feasibility of further studies of the role of this heterogeneity in maintaining population homeostasis.
Assuntos
Doença de Crohn/genética , Código de Barras de DNA Taxonômico/métodos , Infecções por Escherichia coli/genética , Escherichia coli/genética , Genes Bacterianos , Genoma Bacteriano , Microbiota , Algoritmos , Estudos de Casos e Controles , Biologia Computacional , Doença de Crohn/microbiologia , Escherichia coli/classificação , Escherichia coli/isolamento & purificação , Infecções por Escherichia coli/microbiologia , Humanos , MetagenomaRESUMO
HIV-1 viruses, which are predominant in the family of HIV viruses, have strong pathogenicity and infectivity. They can evolve into many different variants in a very short time. In this study, we propose a new and effective alignment-free method for the phylogenetic analysis of HIV-1 viruses using complete genome sequences. Our method combines the position distribution information and the counts of the k-mers together. We also propose a metric to determine the optimal k value. We name our method the Position-Weighted k-mers (PWkmer) method. Validation and comparison with the Robinson-Foulds distance method and the modified bootstrap method on a benchmark dataset show that our method is reliable for the phylogenetic analysis of HIV-1 viruses. PWkmer can resolve within-group variations for different known subtypes of Group M of HIV-1 viruses. This method is simple and computationally fast for whole genome phylogenetic analysis.
RESUMO
BACKGROUND: Human papillomavirus (HPV) is a common sexually transmitted infection associated with cervical cancer that frequently occurs as a coinfection of types and subtypes. Highly similar sublineages that show over 100-fold differences in cancer risk are not distinguishable in coinfections with current typing methods. RESULTS: We describe an efficient set of computational tools, rkmh, for analyzing complex mixed infections of related viruses based on sequence data. rkmh makes extensive use of MinHash similarity measures, and includes utilities for removing host DNA and classifying reads by type, lineage, and sublineage. We show that rkmh is capable of assigning reads to their HPV type as well as HPV16 lineage and sublineages. CONCLUSIONS: Accurate read classification enables estimates of percent composition when there are multiple infecting lineages or sublineages. While we demonstrate rkmh for HPV with multiple sequencing technologies, it is also applicable to other mixtures of related sequences.
Assuntos
Coinfecção/diagnóstico , Coinfecção/virologia , Biologia Computacional/métodos , Papillomavirus Humano 16/fisiologia , Software , DNA Viral/genética , Papillomavirus Humano 16/classificação , Humanos , Infecções por Papillomavirus/virologia , Filogenia , Análise de Sequência de DNA , Fatores de TempoRESUMO
MOTIVATION: Sequencing technologies allow the sequencing of microbial communities directly from the environment without prior culturing. Because assembly typically produces only genome fragments, also known as contigs, it is crucial to group them into putative species for further taxonomic profiling and down-streaming functional analysis. Taxonomic analysis of microbial communities requires contig clustering, a process referred to as binning, that is still one of the most challenging tasks when analyzing metagenomic data. The major problems are the lack of taxonomically related genomes in existing reference databases, the uneven abundance ratio of species, sequencing errors, and the limitations due to binning contig of different lengths. RESULTS: In this context we present MetaCon a novel tool for unsupervised metagenomic contig binning based on probabilistic k-mers statistics and coverage. MetaCon uses a signature based on k-mers statistics that accounts for the different probability of appearance of a k-mer in different species, also contigs of different length are clustered in two separate phases. The effectiveness of MetaCon is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, MaxBin and MetaBAT.