Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Resultados 1 - 20 de 255
Filtrar
1.
Trends Genet ; 40(7): 621-631, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38637269

RESUMEN

Whole-genome duplications (WGDs) are widespread genomic events in eukaryotes that are hypothesized to contribute to the evolutionary success of many lineages, including flowering plants, Saccharomyces yeast, and vertebrates. WGDs generally can be classified into autopolyploids (ploidy increase descended from one species) or allopolyploids (ploidy increase descended from multiple species). Assignment of allopolyploid progenitor species (called subgenomes in the polyploid) is important to understanding the biology and evolution of polyploids, including the asymmetric subgenome evolution following hybridization (biased fractionation). Here, I review the different methodologies used to identify the ancestors of allopolyploid subgenomes, discuss the advantages and disadvantages of these methods, and outline the implications of how these methods affect the subsequent evolutionary analysis of these genomes.


Asunto(s)
Evolución Molecular , Poliploidía , Filogenia , Animales , Genoma/genética , Genómica/métodos , Duplicación de Gen/genética
2.
RNA ; 2024 Aug 26.
Artículo en Inglés | MEDLINE | ID: mdl-39187382

RESUMEN

SEquence Evaluation through k-mer Representation (SEEKR) is a method of sequence comparison that utilizes sequence substrings called k-mers to quantify non-linear similarity between nucleic acid species. We describe the development of new functions within SEEKR that enable end-users to estimate p values that ascribe statistical significance to SEEKR-derived similarities as well as visualize different aspects of k-mer similarity. We apply the new functions to identify chromatin-enriched lncRNAs that contain XIST-like sequence features and demonstrate the utility of applying SEEKR on lncRNA fragments to identify potential RNA-protein interaction domains. We also highlight ways in which SEEKR can be applied to augment studies of lncRNA conservation, and outline the best practice of visualizing RNA-Seq read density to evaluate support for lncRNA annotations prior to their in-depth study in cell types of interest.

3.
Brief Bioinform ; 25(2)2024 Jan 22.
Artículo en Inglés | MEDLINE | ID: mdl-38344864

RESUMEN

Bacteriophages can help the treatment of bacterial infections yet require in-silico models to deal with the great genetic diversity between phages and bacteria. Despite the tolerable prediction performance, the application scope of current approaches is limited to the prediction at the species level, which cannot accurately predict the relationship of phages across strain mutants. This has hindered the development of phage therapeutics based on the prediction of phage-bacteria relationships. In this paper, we present, PB-LKS, to predict the phage-bacteria interaction based on local K-mer strategy with higher performance and wider applicability. The utility of PB-LKS is rigorously validated through (i) large-scale historical screening, (ii) case study at the class level and (iii) in vitro simulation of bacterial antiphage resistance at the strain mutant level. The PB-LKS approach could outperform the current state-of-the-art methods and illustrate potential clinical utility in pre-optimized phage therapy design.


Asunto(s)
Infecciones Bacterianas , Bacteriófagos , Humanos , Bacteriófagos/genética , Bacterias/genética
4.
Plant J ; 2024 Sep 11.
Artículo en Inglés | MEDLINE | ID: mdl-39259496

RESUMEN

Genome-wide association study (GWAS) with single nucleotide polymorphisms (SNPs) has been widely used to explore genetic controls of phenotypic traits. Alternatively, GWAS can use counts of substrings of length k from longer sequencing reads, k-mers, as genotyping data. Using maize cob and kernel color traits, we demonstrated that k-mer GWAS can effectively identify associated k-mers. Co-expression analysis of kernel color k-mers and genes directly found k-mers from known causal genes. Analyzing complex traits of kernel oil and leaf angle resulted in k-mers from both known and candidate genes. A gene encoding a MADS transcription factor was functionally validated by showing that ectopic expression of the gene led to less upright leaves. Evolution analysis revealed most k-mers positively correlated with kernel oil were strongly selected against in maize populations, while most k-mers for upright leaf angle were positively selected. In addition, genomic prediction of kernel oil, leaf angle, and flowering time using k-mer data resulted in a similarly high prediction accuracy to the standard SNP-based method. Collectively, we showed k-mer GWAS is a powerful approach for identifying trait-associated genetic elements. Further, our results demonstrated the bridging role of k-mers for data integration and functional gene discovery.

5.
Brief Bioinform ; 24(4)2023 07 20.
Artículo en Inglés | MEDLINE | ID: mdl-37249547

RESUMEN

Pathogen detection from biological and environmental samples is important for global disease control. Despite advances in pathogen detection using deep learning, current algorithms have limitations in processing long genomic sequences. Through the deep cross-fusion of cross, residual and deep neural networks, we developed DCiPatho for accurate pathogen detection based on the integrated frequency features of 3-to-7 k-mers. Compared with the existing state-of-the-art algorithms, DCiPatho can be used to accurately identify distinct pathogenic bacteria infecting humans, animals and plants. We evaluated DCiPatho on both learned and unlearned pathogen species using both genomics and metagenomics datasets. DCiPatho is an effective tool for the genomic-scale identification of pathogens by integrating the frequency of k-mers into deep cross-fusion networks. The source code is publicly available at https://github.com/LorMeBioAI/DCiPatho.


Asunto(s)
Algoritmos , Programas Informáticos , Humanos , Redes Neurales de la Computación , Genoma , Genómica
6.
Genomics ; 116(5): 110906, 2024 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-39084477

RESUMEN

Enhancers are crucial in gene expression regulation, dictating the specificity and timing of transcriptional activity, which highlights the importance of their identification for unravelling the intricacies of genetic regulation. Therefore, it is critical to identify enhancers and their strengths. Repeated sequences in the genome are repeats of the same or symmetrical fragments. There has been a great deal of evidence that repetitive sequences contain enormous amounts of genetic information. Thus, We introduce the W2V-Repeated Index, designed to identify enhancer sequence fragments and evaluates their strength through the analysis of repeated K-mer sequences in enhancer regions. Utilizing the word2vector algorithm for numerical conversion and Manta Ray Foraging Optimization for feature selection, this method effectively captures the frequency and distribution of K-mer sequences. By concentrating on repeated K-mer sequences, it minimizes computational complexity and facilitates the analysis of larger K values. Experiments indicate that our method performs better than all other advanced methods on almost all indicators.


Asunto(s)
Algoritmos , Elementos de Facilitación Genéticos , Secuencias Repetitivas de Ácidos Nucleicos , Humanos
7.
BMC Bioinformatics ; 25(1): 241, 2024 Jul 16.
Artículo en Inglés | MEDLINE | ID: mdl-39014300

RESUMEN

BACKGROUND: Using next-generation sequencing technologies, scientists can sequence complex microbial communities directly from the environment. Significant insights into the structure, diversity, and ecology of microbial communities have resulted from the study of metagenomics. The assembly of reads into longer contigs, which are then binned into groups of contigs that correspond to different species in the metagenomic sample, is a crucial step in the analysis of metagenomics. It is necessary to organize these contigs into operational taxonomic units (OTUs) for further taxonomic profiling and functional analysis. For binning, which is synonymous with the clustering of OTUs, the tetra-nucleotide frequency (TNF) is typically utilized as a compositional feature for each OTU. RESULTS: In this paper, we present AFIT, a new l-mer statistic vector for each contig, and AFITBin, a novel method for metagenomic binning based on AFIT and a matrix factorization method. To evaluate the performance of the AFIT vector, the t-SNE algorithm is used to compare species clustering based on AFIT and TNF information. In addition, the efficacy of AFITBin is demonstrated on both simulated and real datasets in comparison to state-of-the-art binning methods such as MetaBAT 2, MaxBin 2.0, CONCOT, MetaCon, SolidBin, BusyBee Web, and MetaBinner. To further analyze the performance of the purposed AFIT vector, we compare the barcodes of the AFIT vector and the TNF vector. CONCLUSION: The results demonstrate that AFITBin shows superior performance in taxonomic identification compared to existing methods, leveraging the AFIT vector for improved results in metagenomic binning. This approach holds promise for advancing the analysis of metagenomic data, providing more reliable insights into microbial community composition and function. AVAILABILITY: A python package is available at: https://github.com/SayehSobhani/AFITBin .


Asunto(s)
Algoritmos , Metagenómica , Metagenómica/métodos , Nucleótidos/genética , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Programas Informáticos , Microbiota/genética , Análisis de Secuencia de ADN/métodos , Análisis por Conglomerados , Mapeo Contig/métodos , Metagenoma/genética
8.
Brief Bioinform ; 23(1)2022 01 17.
Artículo en Inglés | MEDLINE | ID: mdl-34849572

RESUMEN

Lactic acid bacteria consortia are commonly present in food, and some of these bacteria possess probiotic properties. However, discovery and experimental validation of probiotics require extensive time and effort. Therefore, it is of great interest to develop effective screening methods for identifying probiotics. Advances in sequencing technology have generated massive genomic data, enabling us to create a machine learning-based platform for such purpose in this work. This study first selected a comprehensive probiotics genome dataset from the probiotic database (PROBIO) and literature surveys. Then, k-mer (from 2 to 8) compositional analysis was performed, revealing diverse oligonucleotide composition in strain genomes and apparently more probiotic (P-) features in probiotic genomes than non-probiotic genomes. To reduce noise and improve computational efficiency, 87 376 k-mers were refined by an incremental feature selection (IFS) method, and the model achieved the maximum accuracy level at 184 core features, with a high prediction accuracy (97.77%) and area under the curve (98.00%). Functional genomic analysis using annotations from gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Rapid Annotation using Subsystem Technology (RAST) databases, as well as analysis of genes associated with host gastrointestinal survival/settlement, carbohydrate utilization, drug resistance and virulence factors, revealed that the distribution of P-features was biased toward genes/pathways related to probiotic function. Our results suggest that the role of probiotics is not determined by a single gene, but by a combination of k-mer genomic components, providing new insights into the identification and underlying mechanisms of probiotics. This work created a novel and free online bioinformatic tool, iProbiotics, which would facilitate rapid screening for probiotics.


Asunto(s)
Probióticos , Tracto Gastrointestinal , Genoma , Genómica/métodos , Aprendizaje Automático , Probióticos/análisis
9.
BMC Cancer ; 24(1): 607, 2024 May 20.
Artículo en Inglés | MEDLINE | ID: mdl-38769480

RESUMEN

BACKGROUND: Cancerous cells' identity is determined via a mixture of multiple factors such as genomic variations, epigenetics, and the regulatory variations that are involved in transcription. The differences in transcriptome expression as well as abnormal structures in peptides determine phenotypical differences. Thus, bulk RNA-seq and more recent single-cell RNA-seq data (scRNA-seq) are important to identify pathogenic differences. In this case, we rely on k-mer decomposition of sequences to identify pathogenic variations in detail which does not need a reference, so it outperforms more traditional Next-Generation Sequencing (NGS) analysis techniques depending on the alignment of the sequences to a reference. RESULTS: Via our alignment-free analysis, over esophageal and glioblastoma cancer patients, high-frequency variations over multiple different locations (repeats, intergenic regions, exons, introns) as well as multiple different forms (fusion, polyadenylation, splicing, etc.) could be discovered. Additionally, we have analyzed the importance of less-focused events systematically in a classic transcriptome analysis pipeline where these events are considered as indicators for tumor prognosis, tumor prediction, tumor neoantigen inference, as well as their connection with respect to the immune microenvironment. CONCLUSIONS: Our results suggest that esophageal cancer (ESCA) and glioblastoma processes can be explained via pathogenic microbial RNA, repeated sequences, novel splicing variants, and long intergenic non-coding RNAs (lincRNAs). We expect our application of reference-free process and analysis to be helpful in tumor and normal samples differential scRNA-seq analysis, which in turn offers a more comprehensive scheme for major cancer-associated events.


Asunto(s)
Glioblastoma , Análisis de la Célula Individual , Transcriptoma , Humanos , Análisis de la Célula Individual/métodos , Glioblastoma/genética , Glioblastoma/patología , Perfilación de la Expresión Génica/métodos , Neoplasias Esofágicas/genética , Neoplasias Esofágicas/patología , Secuenciación de Nucleótidos de Alto Rendimiento , RNA-Seq/métodos , Análisis de Secuencia de ARN/métodos , Regulación Neoplásica de la Expresión Génica , Neoplasias/genética , Neoplasias/patología
10.
Syst Biol ; 72(5): 1101-1118, 2023 11 01.
Artículo en Inglés | MEDLINE | ID: mdl-37314057

RESUMEN

In the age of genome sequencing, whole-genome data is readily and frequently generated, leading to a wealth of new information that can be used to advance various fields of research. New approaches, such as alignment-free phylogenetic methods that utilize k-mer-based distance scoring, are becoming increasingly popular given their ability to rapidly generate phylogenetic information from whole-genome data. However, these methods have not yet been tested using environmental data, which often tends to be highly fragmented and incomplete. Here, we compare the results of one alignment-free approach (which utilizes the D2 statistic) to traditional multi-gene maximum likelihood trees in 3 algal groups that have high-quality genome data available. In addition, we simulate lower-quality, fragmented genome data using these algae to test method robustness to genome quality and completeness. Finally, we apply the alignment-free approach to environmental metagenome assembled genome data of unclassified Saccharibacteria and Trebouxiophyte algae, and single-cell amplified data from uncultured marine stramenopiles to demonstrate its utility with real datasets. We find that in all instances, the alignment-free method produces phylogenies that are comparable, and often more informative, than those created using the traditional multi-gene approach. The k-mer-based method performs well even when there are significant missing data that include marker genes traditionally used for tree reconstruction. Our results demonstrate the value of alignment-free approaches for classifying novel, often cryptic or rare, species, that may not be culturable or are difficult to access using single-cell methods, but fill important gaps in the tree of life.


Asunto(s)
Genoma , Metagenómica , Metagenómica/métodos , Filogenia , Secuencia de Bases
11.
J Theor Biol ; 595: 111943, 2024 Sep 12.
Artículo en Inglés | MEDLINE | ID: mdl-39277166

RESUMEN

Of Chargaff's four rules on DNA base quantity, his second parity rule (PR-2) is the most contentious. Various biometricians (e.g., Sueoka, Lobry) regarded PR-2 compliance as a non-adaptive feature of modern genomes that could be modeled through interrelations among mutation rates. However, PR-2 compliance with stem-loop potential was considered adaptively relevant by biochemists familiar with analyses of nucleic acid structure (e.g., of Crick) and of meiotic recombination (e.g., of Kleckner). Meanwhile, other biometricians had shown that PR-2 complementarity extended beyond individual bases (1-mers) to oligonucleotides (k-mers), possibly reflecting "advantageous DNA structure" (Nussinov). An "introns early" hypothesis (Reanney, Forsdyke) had suggested a primordial nucleic acid world with recombination-mediated error-correction requiring genome-wide stem-loop potential to have evolved prior to localized intrusions of protein-encoding potential (exons). Thus, a primordial genome was equivalent to one long intron. Indeed, when assessed as the base order-dependent component (correcting for local influences of GC%), modern genes, especially when evolving rapidly under positive Darwinian selection, display high intronic stem-loop potential. This suggests forced migration from neighboring exons by competing protein-encoding potential. PR-2 compliance may have first arisen non-adaptively. Primary prototypic structures were later strengthened by their adaptive contribution to recombination. Thus, contentious views may actually be in harmony.

12.
Methods ; 212: 21-30, 2023 04.
Artículo en Inglés | MEDLINE | ID: mdl-36813016

RESUMEN

Long non-coding RNAs are a class of essential non-coding RNAs with a length of more than 200 nts. Recent studies have indicated that lncRNAs have various complex regulatory functions, which play great impacts on many fundamental biological processes. However, measuring the functional similarity between lncRNAs by traditional wet-experiments is time-consuming and labor intensive, computational-based approaches have been an effective choice to tackle this problem. Meanwhile, most sequences-based computation methods measure the functional similarity of lncRNAs with their fixed length vector representations, which could not capture the features on larger k-mers. Therefore, it is urgent to improve the predict performance of the potential regulatory functions of lncRNAs. In this study, we propose a novel approach called MFSLNC to comprehensively measure functional similarity of lncRNAs based on variable k-mer profiles of nucleotide sequences. MFSLNC employs the dictionary tree storage, which could comprehensively represent lncRNAs with long k-mers. The functional similarity between lncRNAs is evaluated by the Jaccard similarity. MFSLNC verified the similarity between two lncRNAs with the same mechanism, detecting homologous sequence pairs between human and mouse. Besides, MFSLNC is also applied to lncRNA-disease associations, combined with the association prediction model WKNKN. Moreover, we also proved that our method can more effectively calculate the similarity of lncRNAs by comparing with the classical methods based on the lncRNA-mRNA association data. The detected AUC value of prediction is 0.867, which achieves good performance in the comparison of similar models.


Asunto(s)
ARN Largo no Codificante , Humanos , Animales , Ratones , ARN Largo no Codificante/genética , Secuencia de Bases , Biología Computacional/métodos , Algoritmos
13.
Int J Mol Sci ; 25(15)2024 Jul 26.
Artículo en Inglés | MEDLINE | ID: mdl-39125755

RESUMEN

The recent increase in Group A Streptococcus (GAS) incidences in several countries across Europe and some areas of the Unites States (U.S.) has raised concerns. To understand GAS diversity and prevalence, we conducted a local genomic surveillance in Eastern North Carolina (ENC) in 2022-2023 with 95 isolates and compared its results to those of the existing national genomic surveillance in the U.S. in 2015-2021 with 13,064 isolates. We observed their epidemiological changes before and during the COVID-19 pandemic and detected a unique sub-lineage in ENC among the most common invasive GAS strain, ST28/emm1. We further discovered a multiple-copy insertion sequence, ISLgar5, in ST399/emm77 and its single-copy variants in some other GAS strains. We discovered ISLgar5 was linked to a Tn5801-like tetM-carrying integrative and conjugative element, and its copy number was associated with an ermT-carrying pRW35-like plasmid. The dynamic insertions of ISLgar5 may play a vital role in genome fitness and adaptation, driving GAS evolution relevant to antimicrobial resistance and potentially GAS virulence.


Asunto(s)
Infecciones Estreptocócicas , Streptococcus pyogenes , Streptococcus pyogenes/genética , Streptococcus pyogenes/patogenicidad , North Carolina/epidemiología , Infecciones Estreptocócicas/epidemiología , Infecciones Estreptocócicas/microbiología , Humanos , Genoma Bacteriano , COVID-19/epidemiología , COVID-19/virología , Genómica/métodos , Filogenia , Elementos Transponibles de ADN/genética , SARS-CoV-2/genética
14.
BMC Bioinformatics ; 24(1): 261, 2023 Jun 22.
Artículo en Inglés | MEDLINE | ID: mdl-37349705

RESUMEN

BACKGROUND: Autism spectrum disorders (ASD) are a group of neurodevelopmental disorders characterized by difficulty communicating with society and others, behavioral difficulties, and a brain that processes information differently than normal. Genetics has a strong impact on ASD associated with early onset and distinctive signs. Currently, all known ASD risk genes are able to encode proteins, and some de novo mutations disrupting protein-coding genes have been demonstrated to cause ASD. Next-generation sequencing technology enables high-throughput identification of ASD risk RNAs. However, these efforts are time-consuming and expensive, so an efficient computational model for ASD risk gene prediction is necessary. RESULTS: In this study, we propose DeepASDPerd, a predictor for ASD risk RNA based on deep learning. Firstly, we use K-mer to feature encode the RNA transcript sequences, and then fuse them with corresponding gene expression values to construct a feature matrix. After combining chi-square test and logistic regression to select the best feature subset, we input them into a binary classification prediction model constructed by convolutional neural network and long short-term memory for training and classification. The results of the tenfold cross-validation proved our method outperformed the state-of-the-art methods. Dataset and source code are available at https://github.com/Onebear-X/DeepASDPred is freely available. CONCLUSIONS: Our experimental results show that DeepASDPred has outstanding performance in identifying ASD risk RNA genes.


Asunto(s)
Trastorno del Espectro Autista , Aprendizaje Profundo , Humanos , Trastorno del Espectro Autista/genética , ARN/genética , Redes Neurales de la Computación , Programas Informáticos
15.
BMC Bioinformatics ; 24(1): 485, 2023 Dec 18.
Artículo en Inglés | MEDLINE | ID: mdl-38110863

RESUMEN

BACKGROUND: Numerous tools exist for biological sequence comparisons and search. One case of particular interest for immunologists is finding matches for linear peptide T cell epitopes, typically between 8 and 15 residues in length, in a large set of protein sequences. Both to find exact matches or matches that account for residue substitutions. The utility of such tools is critical in applications ranging from identifying conservation across viral epitopes, identifying putative epitope targets for allergens, and finding matches for cancer-associated neoepitopes to examine the role of tolerance in tumor recognition. RESULTS: We defined a set of benchmarks that reflect the different practical applications of short peptide sequence matching. We evaluated a suite of existing methods for speed and recall and developed a new tool, PEPMatch. The tool uses a deterministic k-mer mapping algorithm that preprocesses proteomes before searching, achieving a 50-fold increase in speed over methods such as the Basic Local Alignment Search Tool (BLAST) without compromising recall. PEPMatch's code and benchmark datasets are publicly available. CONCLUSIONS: PEPMatch offers significant speed and recall advantages for peptide sequence matching. While it is of immediate utility for immunologists, the developed benchmarking framework also provides a standard against which future tools can be evaluated for improvements. The tool is available at https://nextgen-tools.iedb.org , and the source code can be found at https://github.com/IEDB/PEPMatch .


Asunto(s)
Neoplasias , Programas Informáticos , Humanos , Secuencia de Aminoácidos , Péptidos/química , Algoritmos , Epítopos de Linfocito T , Proteoma
16.
BMC Genomics ; 24(1): 266, 2023 May 18.
Artículo en Inglés | MEDLINE | ID: mdl-37202721

RESUMEN

BACKGROUND: The prevalence of the COVID-19 disease in recent years and its widespread impact on mortality, as well as various aspects of life around the world, has made it important to study this disease and its viral cause. However, very long sequences of this virus increase the processing time, complexity of calculation, and memory consumption required by the available tools to compare and analyze the sequences. RESULTS: We present a new encoding method, named PC-mer, based on the k-mer and physic-chemical properties of nucleotides. This method minimizes the size of encoded data by around 2 k times compared to the classical k-mer based profiling method. Moreover, using PC-mer, we designed two tools: 1) a machine-learning-based classification tool for coronavirus family members with the ability to recive input sequences from the NCBI database, and 2) an alignment-free computational comparison tool for calculating dissimilarity scores between coronaviruses at the genus and species levels. CONCLUSIONS: PC-mer achieves 100% accuracy despite the use of very simple classification algorithms based on Machine Learning. Assuming dynamic programming-based pairwise alignment as the ground truth approach, we achieved a degree of convergence of more than 98% for coronavirus genus-level sequences and 93% for SARS-CoV-2 sequences using PC-mer in the alignment-free classification method. This outperformance of PC-mer suggests that it can serve as a replacement for alignment-based approaches in certain sequence analysis applications that rely on similarity/dissimilarity scores, such as searching sequences, comparing sequences, and certain types of phylogenetic analysis methods that are based on sequence comparison.


Asunto(s)
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Filogenia , Análisis de Secuencia de ADN , Nucleótidos/genética , Secuencia de Bases , Algoritmos
17.
BMC Genomics ; 24(1): 597, 2023 Oct 07.
Artículo en Inglés | MEDLINE | ID: mdl-37805453

RESUMEN

BACKGROUND: Transcription factors (TFs) exhibit heterogeneous DNA-binding specificities in individual cells and whole organisms under natural conditions, and de novo motif discovery usually provides multiple motifs, even from a single chromatin immunoprecipitation-sequencing (ChIP-seq) sample. Despite the accumulation of ChIP-seq data and ChIP-seq-derived motifs, the diversity of DNA-binding specificities across different TFs and cell types remains largely unexplored. RESULTS: Here, we applied MOCCS2, our k-mer-based motif discovery method, to a collection of human TF ChIP-seq samples across diverse TFs and cell types, and systematically computed profiles of TF-binding specificity scores for all k-mers. After quality control, we compiled a set of TF-binding specificity score profiles for 2,976 high-quality ChIP-seq samples, comprising 473 TFs and 398 cell types. Using these high-quality samples, we confirmed that the k-mer-based TF-binding specificity profiles reflected TF- or TF-family dependent DNA-binding specificities. We then compared the binding specificity scores of ChIP-seq samples with the same TFs but with different cell type classes and found that half of the analyzed TFs exhibited differences in DNA-binding specificities across cell type classes. Additionally, we devised a method to detect differentially bound k-mers between two ChIP-seq samples and detected k-mers exhibiting statistically significant differences in binding specificity scores. Moreover, we demonstrated that differences in the binding specificity scores between k-mers on the reference and alternative alleles could be used to predict the effect of variants on TF binding, as validated by in vitro and in vivo assay datasets. Finally, we demonstrated that binding specificity score differences can be used to interpret disease-associated non-coding single-nucleotide polymorphisms (SNPs) as TF-affecting SNPs and provide candidates responsible for TFs and cell types. CONCLUSIONS: Our study provides a basis for investigating the regulation of gene expression in a TF-, TF family-, or cell-type-dependent manner. Furthermore, our differential analysis of binding-specificity scores highlights noncoding disease-associated variants in humans.


Asunto(s)
Polimorfismo de Nucleótido Simple , Factores de Transcripción , Humanos , Sitios de Unión/genética , Unión Proteica/genética , Factores de Transcripción/genética , Factores de Transcripción/metabolismo , ADN/metabolismo
18.
Brief Bioinform ; 22(6)2021 11 05.
Artículo en Inglés | MEDLINE | ID: mdl-34017982

RESUMEN

Understanding post-transcriptional gene regulation is a key challenge in today's biology. The new technologies of RNAcompete and RNA Bind-n-Seq enable the measurement of the binding intensities of one RNA-binding protein (RBP) to numerous synthetic RNA sequences in a single experiment. Recently, Van Nostrand et al. reported the results of RNA Bind-n-Seq experiments measuring binding of 78 human RBPs. Because 31 of these RBPs were also covered by RNAcompete technology, a large-scale comparison between implementations of these two in vitro technologies is now possible. Here, we assessed the similarities and differences between binding models, represented as a list of $k$-mer scores, inferred from RNAcompete and RNA Bind-n-Seq, and also measured how well these models predict in vivo binding. Our results show that RNA Bind-n-Seq- and RNAcompete-derived models agree (Pearson correlation $> 0.5$) for most RBPs (23 out of 31). RNA Bind-n-Seq-derived $k$-mer scores predict RNAcompete binding measurements quite well (average Pearson correlation 0.26), and both technologies produce $k$-mer scores that achieve comparable results in predicting in vivo binding (average AUC 0.7). When inspecting RNA structural preferences inferred from the data of RNA Bind-n-Seq and RNAcompete, we observed high concordance in binding preferences. Through our study, we developed a new $k$-mer score for RNA Bind-n-Seq and extended it to include RNA structural preferences.


Asunto(s)
Biología Computacional , Bases de Datos Genéticas , Regulación de la Expresión Génica , Proteínas de Unión al ARN , ARN , Sitios de Unión , ARN/genética , ARN/metabolismo , Proteínas de Unión al ARN/genética , Proteínas de Unión al ARN/metabolismo
19.
Brief Bioinform ; 22(2): 924-935, 2021 03 22.
Artículo en Inglés | MEDLINE | ID: mdl-33003197

RESUMEN

In this paper, we present a toolset and related resources for rapid identification of viruses and microorganisms from short-read or long-read sequencing data. We present fastv as an ultra-fast tool to detect microbial sequences present in sequencing data, identify target microorganisms and visualize coverage of microbial genomes. This tool is based on the k-mer mapping and extension method. K-mer sets are generated by UniqueKMER, another tool provided in this toolset. UniqueKMER can generate complete sets of unique k-mers for each genome within a large set of viral or microbial genomes. For convenience, unique k-mers for microorganisms and common viruses that afflict humans have been generated and are provided with the tools. As a lightweight tool, fastv accepts FASTQ data as input and directly outputs the results in both HTML and JSON formats. Prior to the k-mer analysis, fastv automatically performs adapter trimming, quality pruning, base correction and other preprocessing to ensure the accuracy of k-mer analysis. Specifically, fastv provides built-in support for rapid severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) identification and typing. Experimental results showed that fastv achieved 100% sensitivity and 100% specificity for detecting SARS-CoV-2 from sequencing data; and can distinguish SARS-CoV-2 from SARS, Middle East respiratory syndrome and other coronaviruses. This toolset is available at: https://github.com/OpenGene/fastv.


Asunto(s)
SARS-CoV-2/aislamiento & purificación , Análisis de Secuencia/métodos , Virus/aislamiento & purificación , Algoritmos , Genes Virales , SARS-CoV-2/genética , Virus/genética
20.
Brief Bioinform ; 22(3)2021 05 20.
Artículo en Inglés | MEDLINE | ID: mdl-32591772

RESUMEN

DNA repeats are abundant in eukaryotic genomes and have been proved to play a vital role in genome evolution and regulation. A large number of approaches have been proposed to identify various repeats in the genome. Some de novo repeat identification tools can efficiently generate sequence repetitive scores based on k-mer counting for repeat detection. However, we noticed that these tools can still be improved in terms of repetitive score calculation, sensitivity to segmental duplications and detection specificity. Therefore, here, we present a new computational approach named Repeat Locator (RepLoc), which is based on weighted k-mer coverage to quantify the genome sequence repetitiveness and locate the repetitive sequences. According to the repetitiveness map of the human genome generated by RepLoc, we found that there may be relationships between sequence repetitiveness and genome structures. A comprehensive benchmark shows that RepLoc is a more efficient k-mer counting based tool for de novo repeat detection. The RepLoc software is freely available at http://bis.zju.edu.cn/reploc.


Asunto(s)
ADN/genética , Secuencias Repetitivas de Ácidos Nucleicos , Algoritmos , Genoma Humano , Humanos , Análisis de Secuencia de ADN/métodos
SELECCIÓN DE REFERENCIAS
Detalles de la búsqueda