Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 259
Filtrar
1.
Trends Genet ; 40(7): 621-631, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-38637269

RESUMO

Whole-genome duplications (WGDs) are widespread genomic events in eukaryotes that are hypothesized to contribute to the evolutionary success of many lineages, including flowering plants, Saccharomyces yeast, and vertebrates. WGDs generally can be classified into autopolyploids (ploidy increase descended from one species) or allopolyploids (ploidy increase descended from multiple species). Assignment of allopolyploid progenitor species (called subgenomes in the polyploid) is important to understanding the biology and evolution of polyploids, including the asymmetric subgenome evolution following hybridization (biased fractionation). Here, I review the different methodologies used to identify the ancestors of allopolyploid subgenomes, discuss the advantages and disadvantages of these methods, and outline the implications of how these methods affect the subsequent evolutionary analysis of these genomes.


Assuntos
Evolução Molecular , Poliploidia , Filogenia , Animais , Genoma/genética , Genômica/métodos , Duplicação Gênica/genética
2.
RNA ; 30(11): 1408-1421, 2024 Oct 16.
Artigo em Inglês | MEDLINE | ID: mdl-39187382

RESUMO

SEquence Evaluation through k-mer Representation (SEEKR) is a method of sequence comparison that uses sequence substrings called k-mers to quantify the nonlinear similarity between nucleic acid species. We describe the development of new functions within SEEKR that enable end-users to estimate P-values that ascribe statistical significance to SEEKR-derived similarities, as well as visualize different aspects of k-mer similarity. We apply the new functions to identify chromatin-enriched lncRNAs that contain XIST-like sequence features, and we demonstrate the utility of applying SEEKR on lncRNA fragments to identify potential RNA-protein interaction domains. We also highlight ways in which SEEKR can be applied to augment studies of lncRNA conservation, and we outline the best practice of visualizing RNA-seq read density to evaluate support for lncRNA annotations before their in-depth study in cell types of interest.


Assuntos
RNA Longo não Codificante , RNA Longo não Codificante/genética , Humanos , Animais , Software , Análise de Sequência de RNA/métodos , Algoritmos , Biologia Computacional/métodos , Cromatina/genética , Cromatina/metabolismo , Cromatina/química , Camundongos
3.
Brief Bioinform ; 25(6)2024 Sep 23.
Artigo em Inglês | MEDLINE | ID: mdl-39428128

RESUMO

We introduce a groundbreaking approach: the minimum free energy-based Gaussian Self-Benchmarking (MFE-GSB) framework, designed to combat the myriad of biases inherent in RNA-seq data. Central to our methodology is the MFE concept, facilitating the adoption of a Gaussian distribution model tailored to effectively mitigate all co-existing biases within a k-mer counting scheme. The MFE-GSB framework operates on a sophisticated dual-model system, juxtaposing modeling data of uniform k-mer distribution against the real, observed sequencing data characterized by nonuniform k-mer distributions. The framework applies a Gaussian function, guided by the predetermined parameters-mean and SD-derived from modeling data, to fit unknown sequencing data. This dual comparison allows for the accurate prediction of k-mer abundances across MFE categories, enabling simultaneous correction of biases at the single k-mer level. Through validation with both engineered RNA constructs and human tissue RNA samples, its wide-ranging efficacy and applicability are demonstrated.


Assuntos
RNA-Seq , Humanos , RNA-Seq/métodos , Benchmarking , Análise de Sequência de RNA/métodos , RNA/química , RNA/genética , Algoritmos , Distribuição Normal , Biologia Computacional/métodos , Viés
4.
Brief Bioinform ; 25(6)2024 Sep 23.
Artigo em Inglês | MEDLINE | ID: mdl-39441245

RESUMO

Sequences derived from organisms sharing common evolutionary origins exhibit similarity, while unique sequences, absent in related organisms, act as good diagnostic marker candidates. However, the approach focused on identifying dissimilar regions among closely-related organisms poses challenges as it requires complex multiple sequence alignments, making computation and parsing difficult. To address this, we have developed a biologically inspired universal NAUniSeq algorithm to find the unique sequences for microorganism diagnosis by traveling through the phylogeny of life. Mapping through a phylogenetic tree ensures a low number of cross-contamination and false positives. We have downloaded complete taxonomy data from Taxadb database and sequence data from National Center for Biotechnology Information Reference Sequence Database (NCBI-Refseq) and, with the help of NetworkX, created a phylogenetic tree. Sequences were assigned over the graph nodes, k-mers were created for target and non-target nodes and search was performed over the graph using the depth first search algorithm. In a memory efficient alternative NoSQL approach, we created a collection of Refseq sequences in MongoDB database using tax-id and path of FASTA files. We queried the MongoDB collection for the target and non-target sequences. In both the approaches, we used an alignment free sliding window k-mer-based procedure that quickly compares k-mers of target and non-target sequences and returns unique sequences that are not present in the non-target. We have validated our algorithm with target nodes Mycobacterium tuberculosis, Neisseria gonorrhoeae, and Monkeypox and generated unique sequences. This universal algorithm is a powerful tool for generating diagnostic sequences, enabling the accurate identification of microbial strains with high phylogenetic precision.


Assuntos
Algoritmos , Filogenia , Biologia Computacional/métodos , Humanos , Bactérias/genética , Bactérias/classificação , Software , Alinhamento de Sequência , Análise de Sequência de DNA/métodos
5.
Brief Bioinform ; 25(2)2024 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-38344864

RESUMO

Bacteriophages can help the treatment of bacterial infections yet require in-silico models to deal with the great genetic diversity between phages and bacteria. Despite the tolerable prediction performance, the application scope of current approaches is limited to the prediction at the species level, which cannot accurately predict the relationship of phages across strain mutants. This has hindered the development of phage therapeutics based on the prediction of phage-bacteria relationships. In this paper, we present, PB-LKS, to predict the phage-bacteria interaction based on local K-mer strategy with higher performance and wider applicability. The utility of PB-LKS is rigorously validated through (i) large-scale historical screening, (ii) case study at the class level and (iii) in vitro simulation of bacterial antiphage resistance at the strain mutant level. The PB-LKS approach could outperform the current state-of-the-art methods and illustrate potential clinical utility in pre-optimized phage therapy design.


Assuntos
Infecções Bacterianas , Bacteriófagos , Humanos , Bacteriófagos/genética , Bactérias/genética
6.
Plant J ; 120(2): 833-850, 2024 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-39259496

RESUMO

Genome-wide association study (GWAS) with single nucleotide polymorphisms (SNPs) has been widely used to explore genetic controls of phenotypic traits. Alternatively, GWAS can use counts of substrings of length k from longer sequencing reads, k-mers, as genotyping data. Using maize cob and kernel color traits, we demonstrated that k-mer GWAS can effectively identify associated k-mers. Co-expression analysis of kernel color k-mers and genes directly found k-mers from known causal genes. Analyzing complex traits of kernel oil and leaf angle resulted in k-mers from both known and candidate genes. A gene encoding a MADS transcription factor was functionally validated by showing that ectopic expression of the gene led to less upright leaves. Evolution analysis revealed most k-mers positively correlated with kernel oil were strongly selected against in maize populations, while most k-mers for upright leaf angle were positively selected. In addition, genomic prediction of kernel oil, leaf angle, and flowering time using k-mer data resulted in a similarly high prediction accuracy to the standard SNP-based method. Collectively, we showed k-mer GWAS is a powerful approach for identifying trait-associated genetic elements. Further, our results demonstrated the bridging role of k-mers for data integration and functional gene discovery.


Assuntos
Estudo de Associação Genômica Ampla , Fenótipo , Polimorfismo de Nucleotídeo Único , Zea mays , Zea mays/genética , Locos de Características Quantitativas/genética , Folhas de Planta/genética , Genótipo , Genoma de Planta/genética
7.
Brief Bioinform ; 24(4)2023 07 20.
Artigo em Inglês | MEDLINE | ID: mdl-37249547

RESUMO

Pathogen detection from biological and environmental samples is important for global disease control. Despite advances in pathogen detection using deep learning, current algorithms have limitations in processing long genomic sequences. Through the deep cross-fusion of cross, residual and deep neural networks, we developed DCiPatho for accurate pathogen detection based on the integrated frequency features of 3-to-7 k-mers. Compared with the existing state-of-the-art algorithms, DCiPatho can be used to accurately identify distinct pathogenic bacteria infecting humans, animals and plants. We evaluated DCiPatho on both learned and unlearned pathogen species using both genomics and metagenomics datasets. DCiPatho is an effective tool for the genomic-scale identification of pathogens by integrating the frequency of k-mers into deep cross-fusion networks. The source code is publicly available at https://github.com/LorMeBioAI/DCiPatho.


Assuntos
Algoritmos , Software , Humanos , Redes Neurais de Computação , Genoma , Genômica
8.
Bioinformatics ; 2024 Oct 21.
Artigo em Inglês | MEDLINE | ID: mdl-39432565

RESUMO

MOTIVATION: Sequences equivalent to their reverse complements (ie, double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g., sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space). RESULTS: The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome ("sketching deserts") are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (1) a new procedure that adapts existing sketching methods to k-nonical space and (2) an optimization procedure to directly design new sketching methods for k-nonical space. AVAILABILITY: The code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope. SUPPLEMENTARY INFORMATION: Supplementary data are available at Oxford Bioinformatics.

9.
Genomics ; 116(5): 110906, 2024 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-39084477

RESUMO

Enhancers are crucial in gene expression regulation, dictating the specificity and timing of transcriptional activity, which highlights the importance of their identification for unravelling the intricacies of genetic regulation. Therefore, it is critical to identify enhancers and their strengths. Repeated sequences in the genome are repeats of the same or symmetrical fragments. There has been a great deal of evidence that repetitive sequences contain enormous amounts of genetic information. Thus, We introduce the W2V-Repeated Index, designed to identify enhancer sequence fragments and evaluates their strength through the analysis of repeated K-mer sequences in enhancer regions. Utilizing the word2vector algorithm for numerical conversion and Manta Ray Foraging Optimization for feature selection, this method effectively captures the frequency and distribution of K-mer sequences. By concentrating on repeated K-mer sequences, it minimizes computational complexity and facilitates the analysis of larger K values. Experiments indicate that our method performs better than all other advanced methods on almost all indicators.


Assuntos
Algoritmos , Elementos Facilitadores Genéticos , Sequências Repetitivas de Ácido Nucleico , Humanos
10.
BMC Bioinformatics ; 25(1): 241, 2024 Jul 16.
Artigo em Inglês | MEDLINE | ID: mdl-39014300

RESUMO

BACKGROUND: Using next-generation sequencing technologies, scientists can sequence complex microbial communities directly from the environment. Significant insights into the structure, diversity, and ecology of microbial communities have resulted from the study of metagenomics. The assembly of reads into longer contigs, which are then binned into groups of contigs that correspond to different species in the metagenomic sample, is a crucial step in the analysis of metagenomics. It is necessary to organize these contigs into operational taxonomic units (OTUs) for further taxonomic profiling and functional analysis. For binning, which is synonymous with the clustering of OTUs, the tetra-nucleotide frequency (TNF) is typically utilized as a compositional feature for each OTU. RESULTS: In this paper, we present AFIT, a new l-mer statistic vector for each contig, and AFITBin, a novel method for metagenomic binning based on AFIT and a matrix factorization method. To evaluate the performance of the AFIT vector, the t-SNE algorithm is used to compare species clustering based on AFIT and TNF information. In addition, the efficacy of AFITBin is demonstrated on both simulated and real datasets in comparison to state-of-the-art binning methods such as MetaBAT 2, MaxBin 2.0, CONCOT, MetaCon, SolidBin, BusyBee Web, and MetaBinner. To further analyze the performance of the purposed AFIT vector, we compare the barcodes of the AFIT vector and the TNF vector. CONCLUSION: The results demonstrate that AFITBin shows superior performance in taxonomic identification compared to existing methods, leveraging the AFIT vector for improved results in metagenomic binning. This approach holds promise for advancing the analysis of metagenomic data, providing more reliable insights into microbial community composition and function. AVAILABILITY: A python package is available at: https://github.com/SayehSobhani/AFITBin .


Assuntos
Algoritmos , Metagenômica , Metagenômica/métodos , Nucleotídeos/genética , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Microbiota/genética , Análise de Sequência de DNA/métodos , Análise por Conglomerados , Mapeamento de Sequências Contíguas/métodos , Metagenoma/genética
11.
BMC Genomics ; 25(1): 993, 2024 Oct 23.
Artigo em Inglês | MEDLINE | ID: mdl-39443845

RESUMO

BACKGROUND : Garuga Roxb. is a genus endemic to southwest China and other tropical regions in Southeast Asia facing risk of extinction due to the loss of tropical forests and changes in land use. Conducting a genome survey of G. forrestii contribute to a deeper understanding and conservation of the genus. RESULTS: This study utilized genome survey of G. forrestii generated approximately 54.56 GB of sequence data, with approximately 112 × coverage. K-mer analysis indicated a genome size of approximately 0.48 GB, smaller than 0.52GB estimated by flow cytometry. The heterozygosity is of about 0.54%, and a repeat rate of around 51.54%. All the shotgun data were assembled into 339,729 scaffolds, with an N50 of 17,344 bp. The average content of guanine and cytosine was approximately 35.16%. A total of 330,999 SSRs were detected, with mononucleotide repeats being the most abundant at 70.16%, followed by dinucleotide repeats at 20.40%. We conducted a preliminary ploidy assessment using Smudgeplot and observed a clear bimodal distribution in G. forrestii at 1/2 relative coverage depth and total coverage depth (2n), suggesting a potential diploid genome structure. A pseudo chromosome of G. forrestii and a gemone of Boswellia sacra were used as reference genome to perform a primer population resequencing analysis within three Garuga species. Principal component analysis (PCA) indicated three distinct groups, but genome wide phylogenetics represented conflicting both between the dataset of different reference genomes and between maternal and nuclear genome. CONCLUSION: In summary, the genome of G. forrestii is small, and the phylogenetic relationships within the Garuga genus are complex. The genetic data presented in this study holds significant value for comprehensive whole-genome analyses, the evaluation of population genetic diversity, investigations into adaptive evolution, the advancement of artificial breeding efforts, and the support of species conservation and restoration initiatives. Ultimately, this research contributes to reinforcing the conservation and management of natural ecosystems, promoting biodiversity conservation, and advancing sustainable development.


Assuntos
Evolução Molecular , Genoma de Planta , Filogenia , Repetições de Microssatélites , Tamanho do Genoma , Genômica/métodos
12.
Brief Bioinform ; 23(1)2022 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-34849572

RESUMO

Lactic acid bacteria consortia are commonly present in food, and some of these bacteria possess probiotic properties. However, discovery and experimental validation of probiotics require extensive time and effort. Therefore, it is of great interest to develop effective screening methods for identifying probiotics. Advances in sequencing technology have generated massive genomic data, enabling us to create a machine learning-based platform for such purpose in this work. This study first selected a comprehensive probiotics genome dataset from the probiotic database (PROBIO) and literature surveys. Then, k-mer (from 2 to 8) compositional analysis was performed, revealing diverse oligonucleotide composition in strain genomes and apparently more probiotic (P-) features in probiotic genomes than non-probiotic genomes. To reduce noise and improve computational efficiency, 87 376 k-mers were refined by an incremental feature selection (IFS) method, and the model achieved the maximum accuracy level at 184 core features, with a high prediction accuracy (97.77%) and area under the curve (98.00%). Functional genomic analysis using annotations from gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) and Rapid Annotation using Subsystem Technology (RAST) databases, as well as analysis of genes associated with host gastrointestinal survival/settlement, carbohydrate utilization, drug resistance and virulence factors, revealed that the distribution of P-features was biased toward genes/pathways related to probiotic function. Our results suggest that the role of probiotics is not determined by a single gene, but by a combination of k-mer genomic components, providing new insights into the identification and underlying mechanisms of probiotics. This work created a novel and free online bioinformatic tool, iProbiotics, which would facilitate rapid screening for probiotics.


Assuntos
Probióticos , Trato Gastrointestinal , Genoma , Genômica/métodos , Aprendizado de Máquina , Probióticos/análise
13.
BMC Cancer ; 24(1): 607, 2024 May 20.
Artigo em Inglês | MEDLINE | ID: mdl-38769480

RESUMO

BACKGROUND: Cancerous cells' identity is determined via a mixture of multiple factors such as genomic variations, epigenetics, and the regulatory variations that are involved in transcription. The differences in transcriptome expression as well as abnormal structures in peptides determine phenotypical differences. Thus, bulk RNA-seq and more recent single-cell RNA-seq data (scRNA-seq) are important to identify pathogenic differences. In this case, we rely on k-mer decomposition of sequences to identify pathogenic variations in detail which does not need a reference, so it outperforms more traditional Next-Generation Sequencing (NGS) analysis techniques depending on the alignment of the sequences to a reference. RESULTS: Via our alignment-free analysis, over esophageal and glioblastoma cancer patients, high-frequency variations over multiple different locations (repeats, intergenic regions, exons, introns) as well as multiple different forms (fusion, polyadenylation, splicing, etc.) could be discovered. Additionally, we have analyzed the importance of less-focused events systematically in a classic transcriptome analysis pipeline where these events are considered as indicators for tumor prognosis, tumor prediction, tumor neoantigen inference, as well as their connection with respect to the immune microenvironment. CONCLUSIONS: Our results suggest that esophageal cancer (ESCA) and glioblastoma processes can be explained via pathogenic microbial RNA, repeated sequences, novel splicing variants, and long intergenic non-coding RNAs (lincRNAs). We expect our application of reference-free process and analysis to be helpful in tumor and normal samples differential scRNA-seq analysis, which in turn offers a more comprehensive scheme for major cancer-associated events.


Assuntos
Glioblastoma , Análise de Célula Única , Transcriptoma , Humanos , Análise de Célula Única/métodos , Glioblastoma/genética , Glioblastoma/patologia , Perfilação da Expressão Gênica/métodos , Neoplasias Esofágicas/genética , Neoplasias Esofágicas/patologia , Sequenciamento de Nucleotídeos em Larga Escala , RNA-Seq/métodos , Análise de Sequência de RNA/métodos , Regulação Neoplásica da Expressão Gênica , Neoplasias/genética , Neoplasias/patologia
14.
Syst Biol ; 72(5): 1101-1118, 2023 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-37314057

RESUMO

In the age of genome sequencing, whole-genome data is readily and frequently generated, leading to a wealth of new information that can be used to advance various fields of research. New approaches, such as alignment-free phylogenetic methods that utilize k-mer-based distance scoring, are becoming increasingly popular given their ability to rapidly generate phylogenetic information from whole-genome data. However, these methods have not yet been tested using environmental data, which often tends to be highly fragmented and incomplete. Here, we compare the results of one alignment-free approach (which utilizes the D2 statistic) to traditional multi-gene maximum likelihood trees in 3 algal groups that have high-quality genome data available. In addition, we simulate lower-quality, fragmented genome data using these algae to test method robustness to genome quality and completeness. Finally, we apply the alignment-free approach to environmental metagenome assembled genome data of unclassified Saccharibacteria and Trebouxiophyte algae, and single-cell amplified data from uncultured marine stramenopiles to demonstrate its utility with real datasets. We find that in all instances, the alignment-free method produces phylogenies that are comparable, and often more informative, than those created using the traditional multi-gene approach. The k-mer-based method performs well even when there are significant missing data that include marker genes traditionally used for tree reconstruction. Our results demonstrate the value of alignment-free approaches for classifying novel, often cryptic or rare, species, that may not be culturable or are difficult to access using single-cell methods, but fill important gaps in the tree of life.


Assuntos
Genoma , Metagenômica , Metagenômica/métodos , Filogenia , Sequência de Bases
15.
J Theor Biol ; 595: 111943, 2024 Sep 12.
Artigo em Inglês | MEDLINE | ID: mdl-39277166

RESUMO

Of Chargaff's four rules on DNA base quantity, his second parity rule (PR-2) is the most contentious. Various biometricians (e.g., Sueoka, Lobry) regarded PR-2 compliance as a non-adaptive feature of modern genomes that could be modeled through interrelations among mutation rates. However, PR-2 compliance with stem-loop potential was considered adaptively relevant by biochemists familiar with analyses of nucleic acid structure (e.g., of Crick) and of meiotic recombination (e.g., of Kleckner). Meanwhile, other biometricians had shown that PR-2 complementarity extended beyond individual bases (1-mers) to oligonucleotides (k-mers), possibly reflecting "advantageous DNA structure" (Nussinov). An "introns early" hypothesis (Reanney, Forsdyke) had suggested a primordial nucleic acid world with recombination-mediated error-correction requiring genome-wide stem-loop potential to have evolved prior to localized intrusions of protein-encoding potential (exons). Thus, a primordial genome was equivalent to one long intron. Indeed, when assessed as the base order-dependent component (correcting for local influences of GC%), modern genes, especially when evolving rapidly under positive Darwinian selection, display high intronic stem-loop potential. This suggests forced migration from neighboring exons by competing protein-encoding potential. PR-2 compliance may have first arisen non-adaptively. Primary prototypic structures were later strengthened by their adaptive contribution to recombination. Thus, contentious views may actually be in harmony.

16.
Methods ; 212: 21-30, 2023 04.
Artigo em Inglês | MEDLINE | ID: mdl-36813016

RESUMO

Long non-coding RNAs are a class of essential non-coding RNAs with a length of more than 200 nts. Recent studies have indicated that lncRNAs have various complex regulatory functions, which play great impacts on many fundamental biological processes. However, measuring the functional similarity between lncRNAs by traditional wet-experiments is time-consuming and labor intensive, computational-based approaches have been an effective choice to tackle this problem. Meanwhile, most sequences-based computation methods measure the functional similarity of lncRNAs with their fixed length vector representations, which could not capture the features on larger k-mers. Therefore, it is urgent to improve the predict performance of the potential regulatory functions of lncRNAs. In this study, we propose a novel approach called MFSLNC to comprehensively measure functional similarity of lncRNAs based on variable k-mer profiles of nucleotide sequences. MFSLNC employs the dictionary tree storage, which could comprehensively represent lncRNAs with long k-mers. The functional similarity between lncRNAs is evaluated by the Jaccard similarity. MFSLNC verified the similarity between two lncRNAs with the same mechanism, detecting homologous sequence pairs between human and mouse. Besides, MFSLNC is also applied to lncRNA-disease associations, combined with the association prediction model WKNKN. Moreover, we also proved that our method can more effectively calculate the similarity of lncRNAs by comparing with the classical methods based on the lncRNA-mRNA association data. The detected AUC value of prediction is 0.867, which achieves good performance in the comparison of similar models.


Assuntos
RNA Longo não Codificante , Humanos , Animais , Camundongos , RNA Longo não Codificante/genética , Sequência de Bases , Biologia Computacional/métodos , Algoritmos
17.
Int J Mol Sci ; 25(15)2024 Jul 26.
Artigo em Inglês | MEDLINE | ID: mdl-39125755

RESUMO

The recent increase in Group A Streptococcus (GAS) incidences in several countries across Europe and some areas of the Unites States (U.S.) has raised concerns. To understand GAS diversity and prevalence, we conducted a local genomic surveillance in Eastern North Carolina (ENC) in 2022-2023 with 95 isolates and compared its results to those of the existing national genomic surveillance in the U.S. in 2015-2021 with 13,064 isolates. We observed their epidemiological changes before and during the COVID-19 pandemic and detected a unique sub-lineage in ENC among the most common invasive GAS strain, ST28/emm1. We further discovered a multiple-copy insertion sequence, ISLgar5, in ST399/emm77 and its single-copy variants in some other GAS strains. We discovered ISLgar5 was linked to a Tn5801-like tetM-carrying integrative and conjugative element, and its copy number was associated with an ermT-carrying pRW35-like plasmid. The dynamic insertions of ISLgar5 may play a vital role in genome fitness and adaptation, driving GAS evolution relevant to antimicrobial resistance and potentially GAS virulence.


Assuntos
Infecções Estreptocócicas , Streptococcus pyogenes , Streptococcus pyogenes/genética , Streptococcus pyogenes/patogenicidade , North Carolina/epidemiologia , Infecções Estreptocócicas/epidemiologia , Infecções Estreptocócicas/microbiologia , Humanos , Genoma Bacteriano , COVID-19/epidemiologia , COVID-19/virologia , Genômica/métodos , Filogenia , Elementos de DNA Transponíveis/genética , SARS-CoV-2/genética
18.
BMC Bioinformatics ; 24(1): 261, 2023 Jun 22.
Artigo em Inglês | MEDLINE | ID: mdl-37349705

RESUMO

BACKGROUND: Autism spectrum disorders (ASD) are a group of neurodevelopmental disorders characterized by difficulty communicating with society and others, behavioral difficulties, and a brain that processes information differently than normal. Genetics has a strong impact on ASD associated with early onset and distinctive signs. Currently, all known ASD risk genes are able to encode proteins, and some de novo mutations disrupting protein-coding genes have been demonstrated to cause ASD. Next-generation sequencing technology enables high-throughput identification of ASD risk RNAs. However, these efforts are time-consuming and expensive, so an efficient computational model for ASD risk gene prediction is necessary. RESULTS: In this study, we propose DeepASDPerd, a predictor for ASD risk RNA based on deep learning. Firstly, we use K-mer to feature encode the RNA transcript sequences, and then fuse them with corresponding gene expression values to construct a feature matrix. After combining chi-square test and logistic regression to select the best feature subset, we input them into a binary classification prediction model constructed by convolutional neural network and long short-term memory for training and classification. The results of the tenfold cross-validation proved our method outperformed the state-of-the-art methods. Dataset and source code are available at https://github.com/Onebear-X/DeepASDPred is freely available. CONCLUSIONS: Our experimental results show that DeepASDPred has outstanding performance in identifying ASD risk RNA genes.


Assuntos
Transtorno do Espectro Autista , Aprendizado Profundo , Humanos , Transtorno do Espectro Autista/genética , RNA/genética , Redes Neurais de Computação , Software
19.
BMC Bioinformatics ; 24(1): 485, 2023 Dec 18.
Artigo em Inglês | MEDLINE | ID: mdl-38110863

RESUMO

BACKGROUND: Numerous tools exist for biological sequence comparisons and search. One case of particular interest for immunologists is finding matches for linear peptide T cell epitopes, typically between 8 and 15 residues in length, in a large set of protein sequences. Both to find exact matches or matches that account for residue substitutions. The utility of such tools is critical in applications ranging from identifying conservation across viral epitopes, identifying putative epitope targets for allergens, and finding matches for cancer-associated neoepitopes to examine the role of tolerance in tumor recognition. RESULTS: We defined a set of benchmarks that reflect the different practical applications of short peptide sequence matching. We evaluated a suite of existing methods for speed and recall and developed a new tool, PEPMatch. The tool uses a deterministic k-mer mapping algorithm that preprocesses proteomes before searching, achieving a 50-fold increase in speed over methods such as the Basic Local Alignment Search Tool (BLAST) without compromising recall. PEPMatch's code and benchmark datasets are publicly available. CONCLUSIONS: PEPMatch offers significant speed and recall advantages for peptide sequence matching. While it is of immediate utility for immunologists, the developed benchmarking framework also provides a standard against which future tools can be evaluated for improvements. The tool is available at https://nextgen-tools.iedb.org , and the source code can be found at https://github.com/IEDB/PEPMatch .


Assuntos
Neoplasias , Software , Humanos , Sequência de Aminoácidos , Peptídeos/química , Algoritmos , Epitopos de Linfócito T , Proteoma
20.
BMC Genomics ; 24(1): 266, 2023 May 18.
Artigo em Inglês | MEDLINE | ID: mdl-37202721

RESUMO

BACKGROUND: The prevalence of the COVID-19 disease in recent years and its widespread impact on mortality, as well as various aspects of life around the world, has made it important to study this disease and its viral cause. However, very long sequences of this virus increase the processing time, complexity of calculation, and memory consumption required by the available tools to compare and analyze the sequences. RESULTS: We present a new encoding method, named PC-mer, based on the k-mer and physic-chemical properties of nucleotides. This method minimizes the size of encoded data by around 2 k times compared to the classical k-mer based profiling method. Moreover, using PC-mer, we designed two tools: 1) a machine-learning-based classification tool for coronavirus family members with the ability to recive input sequences from the NCBI database, and 2) an alignment-free computational comparison tool for calculating dissimilarity scores between coronaviruses at the genus and species levels. CONCLUSIONS: PC-mer achieves 100% accuracy despite the use of very simple classification algorithms based on Machine Learning. Assuming dynamic programming-based pairwise alignment as the ground truth approach, we achieved a degree of convergence of more than 98% for coronavirus genus-level sequences and 93% for SARS-CoV-2 sequences using PC-mer in the alignment-free classification method. This outperformance of PC-mer suggests that it can serve as a replacement for alignment-based approaches in certain sequence analysis applications that rely on similarity/dissimilarity scores, such as searching sequences, comparing sequences, and certain types of phylogenetic analysis methods that are based on sequence comparison.


Assuntos
COVID-19 , SARS-CoV-2 , Humanos , SARS-CoV-2/genética , Filogenia , Análise de Sequência de DNA , Nucleotídeos/genética , Sequência de Bases , Algoritmos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA