RESUMO
MOTIVATION: Structural variants (SVs) play a causal role in numerous diseases but can be difficult to detect and accurately genotype (determine zygosity) with short-read genome sequencing data (SRS). Improving SV genotyping accuracy in SRS data, particularly for the many SVs first detected with long-read sequencing, will improve our understanding of genetic variation. RESULTS: NPSV-deep is a deep learning-based approach for genotyping previously reported insertion and deletion SVs that recasts this task as an image similarity problem. NPSV-deep predicts the SV genotype based on the similarity between pileup images generated from the actual SRS data and matching SRS simulations. We show that NPSV-deep consistently matches or improves upon the state-of-the-art for SV genotyping accuracy across different SV call sets, samples and variant types, including a 25% reduction in genotyping errors for the Genome-in-a-Bottle (GIAB) high-confidence SVs. NPSV-deep is not limited to the SVs as described; it improves deletion genotyping concordance a further 1.5 percentage points for GIAB SVs (92%) by automatically correcting imprecise/incorrectly described SVs. AVAILABILITY AND IMPLEMENTATION: Python/C++ source code and pre-trained models freely available at https://github.com/mlinderm/npsv2.
Assuntos
Aprendizado Profundo , Humanos , Genótipo , Genoma Humano , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala , Variação Estrutural do GenomaRESUMO
Immunoglobulins (IGs), crucial components of the adaptive immune system, are encoded by three genomic loci. However, the complexity of the IG loci severely limits the effective use of short read sequencing, limiting our knowledge of population diversity in these loci. We leveraged existing long read whole-genome sequencing (WGS) data, fosmid technology, and IG targeted single-molecule, real-time (SMRT) long-read sequencing (IG-Cap) to create haplotype-resolved assemblies of the IG Lambda (IGL) locus from 6 ethnically diverse individuals. In addition, we generated 10 diploid assemblies of IGL from a diverse cohort of individuals utilizing IG-Cap. From these 16 individuals, we identified significant allelic diversity, including 36 novel IGLV alleles. In addition, we observed highly elevated single nucleotide variation (SNV) in IGLV genes relative to IGL intergenic and genomic background SNV density. By comparing SNV calls between our high quality assemblies and existing short read datasets from the same individuals, we show a high propensity for false-positives in the short read datasets. Finally, for the first time, we nucleotide-resolved common 5-10 Kb duplications in the IGLC region that contain functional IGLJ and IGLC genes. Together these data represent a significant advancement in our understanding of genetic variation and population diversity in the IGL locus.
Assuntos
Genes de Imunoglobulinas , Cadeias lambda de Imunoglobulina , Humanos , Cadeias lambda de Imunoglobulina/genética , Genômica , Variação Genética , NucleotídeosRESUMO
The emergence of the Internet of Things (IoT) technology has brought about tremendous possibilities, but at the same time, it has opened up new vulnerabilities and attack vectors that could compromise the confidentiality, integrity, and availability of connected systems. Developing a secure IoT ecosystem is a daunting challenge that requires a systematic and holistic approach to identify and mitigate potential security threats. Cybersecurity research considerations play a critical role in this regard, as they provide the foundation for designing and implementing security measures that can address emerging risks. To achieve a secure IoT ecosystem, scientists and engineers must first define rigorous security specifications that serve as the foundation for developing secure devices, chipsets, and networks. Developing such specifications requires an interdisciplinary approach that involves multiple stakeholders, including cybersecurity experts, network architects, system designers, and domain experts. The primary challenge in IoT security is ensuring the system can defend against both known and unknown attacks. To date, the IoT research community has identified several key security concerns related to the architecture of IoT systems. These concerns include issues related to connectivity, communication, and management protocols. This research paper provides an all-inclusive and lucid review of the current state of anomalies and security concepts related to the IoT. We classify and analyze prevalent security distresses regarding IoT's layered architecture, including connectivity, communication, and management protocols. We establish the foundation of IoT security by examining the current attacks, threats, and cutting-edge solutions. Furthermore, we set security goals that will serve as the benchmark for assessing whether a solution satisfies the specific IoT use cases.
RESUMO
Predicting attacks in Android malware devices using machine learning for recommender systems-based IoT can be a challenging task. However, it is possible to use various machine-learning techniques to achieve this goal. An internet-based framework is used to predict and recommend Android malware on IoT devices. As the prevalence of Android devices grows, the malware creates new viruses on a regular basis, posing a threat to the central system's security and the privacy of the users. The suggested system uses static analysis to predict the malware in Android apps used by consumer devices. The training of the presented system is used to predict and recommend malicious devices to block them from transmitting the data to the cloud server. By taking into account various machine-learning methods, feature selection is performed and the K-Nearest Neighbor (KNN) machine-learning model is proposed. Testing was carried out on more than 10,000 Android applications to check malicious nodes and recommend that the cloud server block them. The developed model contemplated all four machine-learning algorithms in parallel, i.e., naive Bayes, decision tree, support vector machine, and the K-Nearest Neighbor approach and static analysis as a feature subset selection algorithm, and it achieved the highest prediction rate of 93% to predict the malware in real-world applications of consumer devices to minimize the utilization of energy. The experimental results show that KNN achieves 93%, 95%, 90%, and 92% accuracy, precision, recall and f1 measures, respectively.
RESUMO
BACKGROUND: Uncomplicated type B aortic dissection (un-TBAD) has been managed conservatively with medical therapy to control the heart rate and blood pressure to limit disease progression, in addition to radiological follow-up. However, several trials and observational studies have investigated the use of thoracic endovascular aortic repair (TEVAR) in un-TBAD and suggested that TEVAR provides a survival benefit over medical therapy. Outcomes of TEVAR have also been linked with the timing of intervention. AIMS: The scope of this review is to collate and summarize all the evidence in the literature on the mid- and long-term outcomes of TEVAR in un-TBAD, confirming its superiority. We also aimed to investigate the relationship between the timing of TEVAR intervention and results. METHODS: We carried out a comprehensive literature search on multiple electronic databases including PubMed, Scopus, and EMBASE to collate and summarize all research evidence on the mid- and long-term outcomes of TEVAR in un-TBAD, as well as its relationship with intervention timing. RESULTS: TEVAR has proven to be a safe and effective tool in un-TBAD, offering superior mid- and long-term outcomes including all-cause and aorta-related mortality, aortic-specific adverse events, aortic remodeling, and need for reintervention. Additionally, performing TEVAR during the subacute phase of dissection seems to yield optimal results. CONCLUSION: The evidence demonstrating a survival advantage in favor TEVAR over medical therapy in un-TBAD means that with further research, particular trials and observational studies, TEVAR could become the gold-standard treatment option for un-TBAD patients.
Assuntos
Aneurisma da Aorta Torácica , Dissecção Aórtica , Implante de Prótese Vascular , Procedimentos Endovasculares , Dissecção Aórtica/etiologia , Aneurisma da Aorta Torácica/etiologia , Implante de Prótese Vascular/efeitos adversos , Procedimentos Endovasculares/métodos , Humanos , Estudos Retrospectivos , Fatores de Risco , Fatores de Tempo , Resultado do TratamentoRESUMO
All witnessed the terrible effects of the COVID-19 pandemic on the health and work lives of the population across the world. It is hard to diagnose all infected people in real time since the conventional medical diagnosis of COVID-19 patients takes a couple of days for accurate diagnosis results. In this paper, a novel learning framework is proposed for the early diagnosis of COVID-19 patients using hybrid deep fusion learning models. The proposed framework performs early classification of patients based on collected samples of chest X-ray images and Coswara cough (sound) samples of possibly infected people. The captured cough samples are pre-processed using speech signal processing techniques and Mel frequency cepstral coefficient features are extracted using deep convolutional neural networks. Finally, the proposed system fuses extracted features to provide 98.70% and 82.7% based on Chest-X ray images and cough (audio) samples for early diagnosis using the weighted sum-rule fusion method.
RESUMO
SUMMARY: While next-generation sequencing (NGS) has dramatically increased the availability of genomic data, phased genome assembly and structural variant (SV) analyses are limited by NGS read lengths. Long-read sequencing from Pacific Biosciences and NGS barcoding from 10x Genomics hold the potential for far more comprehensive views of individual genomes. Here, we present MsPAC, a tool that combines both technologies to partition reads, assemble haplotypes (via existing software) and convert assemblies into high-quality, phased SV predictions. MsPAC represents a framework for haplotype-resolved SV calls that moves one step closer to fully resolved, diploid genomes. AVAILABILITY AND IMPLEMENTATION: https://github.com/oscarlr/MsPAC. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Genoma , Haplótipos , Análise de Sequência de DNA , SoftwareRESUMO
With an increasing penetration of ubiquitous connectivity, the amount of data describing the actions of end-users has been increasing dramatically, both within the domain of the Internet of Things (IoT) and other smart devices. This has led to more awareness of users in terms of protecting personal data. Within the IoT, there is a growing number of peer-to-peer (P2P) transactions, increasing the exposure to security vulnerabilities, and the risk of cyberattacks. Blockchain technology has been explored as middleware in P2P transactions, but existing solutions have mainly focused on providing a safe environment for data trade without considering potential changes in interaction topologies. we present EdgeBoT, a proof-of-concept smart contracts based platform for the IoT built on top of the ethereum blockchain. With the Blockchain of Things (BoT) at the edge of the network, EdgeBoT enables a wider variety of interaction topologies between nodes in the network and external services while guaranteeing ownership of data and end users' privacy. in EdgeBoT, edge devices trade their data directly with third parties and without the need of intermediaries. This opens the door to new interaction modalities, in which data producers at the edge grant access to batches of their data to different third parties. Leveraging the immutability properties of blockchains, together with the distributed nature of smart contracts, data owners can audit and are aware of all transactions that have occurred with their data. we report initial results demonstrating the potential of EdgeBoT within the IoT. we show that integrating our solutions on top of existing IoT systems has a relatively small footprint in terms of computational resource usage, but a significant impact on the protection of data ownership and management of data trade.
RESUMO
Whole-genome sequencing (WGS) of Staphylococcus aureus is increasingly used as part of infection prevention practices. In this study, we established a long-read technology-based WGS screening program of all first-episode methicillin-resistant Staphylococcus aureus (MRSA) blood infections at a major urban hospital. A survey of 132 MRSA genomes assembled from long reads enabled detailed characterization of an outbreak lasting several months of a CC5/ST105/USA100 clone among 18 infants in a neonatal intensive care unit (NICU). Available hospital-wide genome surveillance data traced the origins of the outbreak to three patients admitted to adult wards during a 4-month period preceding the NICU outbreak. The pattern of changes among complete outbreak genomes provided full spatiotemporal resolution of its progression, which was characterized by multiple subtransmissions and likely precipitated by equipment sharing between adults and infants. Compared to other hospital strains, the outbreak strain carried distinct mutations and accessory genetic elements that impacted genes with roles in metabolism, resistance, and persistence. This included a DNA recognition domain recombination in the hsdS gene of a type I restriction modification system that altered DNA methylation. Transcriptome sequencing (RNA-Seq) profiling showed that the (epi)genetic changes in the outbreak clone attenuated agr gene expression and upregulated genes involved in stress response and biofilm formation. Overall, our findings demonstrate the utility of long-read sequencing for hospital surveillance and for characterizing accessory genomic elements that may impact MRSA virulence and persistence.
Assuntos
Bacteriemia/epidemiologia , Infecção Hospitalar/epidemiologia , Surtos de Doenças , Staphylococcus aureus Resistente à Meticilina/isolamento & purificação , Epidemiologia Molecular/métodos , Infecções Estafilocócicas/epidemiologia , Sequenciamento Completo do Genoma/métodos , Adulto , Bacteriemia/microbiologia , Bacteriemia/transmissão , Infecção Hospitalar/microbiologia , Infecção Hospitalar/transmissão , Transmissão de Doença Infecciosa , Genótipo , Hospitais , Humanos , Lactente , Recém-Nascido , Unidades de Terapia Intensiva Neonatal , Programas de Rastreamento/métodos , Staphylococcus aureus Resistente à Meticilina/classificação , Staphylococcus aureus Resistente à Meticilina/genética , Infecções Estafilocócicas/microbiologia , Infecções Estafilocócicas/transmissãoRESUMO
The prevalence of smart devices in our day-to-day activities increases the potential threat to our secret information. To counter these threats like unauthorized access and misuse of phones, only authorized users should be able to access the device. Authentication mechanism provide a secure way to safeguard the physical resources as well the information that is processed. Text-based passwords are the most common technique used for the authentication of devices, however, they are vulnerable to a certain type of attacks such as brute force, smudge and shoulder surfing attacks. Graphical Passwords (GPs) were introduced as an alternative for the conventional text-based authentication to overcome the potential threats. GPs use pictures and have been implemented in smart devices and workstations. Psychological studies reveal that humans can recognize images much easier and quicker than numeric and alphanumeric passwords, which become the basis for creating GPs. In this paper a novel Fractal-Based Authentication Technique (FBAT) has been proposed by implementing a Sierpinski triangle. In the FBAT scheme, the probability of password guessing is low making system resilient against abovementioned threats. Increasing fractal level makes the system stronger and provides security against attacks like shoulder surfing.
RESUMO
Therapy for bacteremia caused by Staphylococcus aureus is often ineffective, even when treatment conditions are optimal according to experimental protocols. Adapted subclones, such as those bearing mutations that attenuate agr-mediated virulence activation, are associated with persistent infection and patient mortality. To identify additional alterations in agr-defective mutants, we sequenced and assembled the complete genomes of clone pairs from colonizing and infected sites of several patients in whom S. aureus demonstrated a within-host loss of agr function. We report that events associated with agr inactivation result in agr-defective blood and nares strain pairs that are enriched in mutations compared to pairs from wild-type controls. The random distribution of mutations between colonizing and infecting strains from the same patient, and between strains from different patients, suggests that much of the genetic complexity of agr-defective strains results from prolonged infection or therapy-induced stress. However, in one of the agr-defective infecting strains, multiple genetic changes resulted in increased virulence in a murine model of bloodstream infection, bypassing the mutation of agr and raising the possibility that some changes were selected. Expression profiling correlated the elevated virulence of this agr-defective mutant to restored expression of the agr-regulated ESAT6-like type VII secretion system, a known virulence factor. Thus, additional mutations outside the agr locus can contribute to diversification and adaptation during infection by S. aureus agr mutants associated with poor patient outcomes.
Assuntos
Proteínas de Bactérias/genética , Genoma Bacteriano , Infecções Estafilocócicas/microbiologia , Staphylococcus aureus/genética , Staphylococcus aureus/metabolismo , Transativadores/genética , Animais , Bacteriemia/microbiologia , Proteínas de Bactérias/metabolismo , Feminino , Regulação Bacteriana da Expressão Gênica , Humanos , Camundongos , Mutação , Filogenia , Staphylococcus aureus/classificação , Staphylococcus aureus/patogenicidade , Transativadores/metabolismo , VirulênciaRESUMO
We present the first comprehensive analysis of a diploid human genome that combines single-molecule sequencing with single-molecule genome maps. Our hybrid assembly markedly improves upon the contiguity observed from traditional shotgun sequencing approaches, with scaffold N50 values approaching 30 Mb, and we identified complex structural variants (SVs) missed by other high-throughput approaches. Furthermore, by combining Illumina short-read data with long reads, we phased both single-nucleotide variants and SVs, generating haplotypes with over 99% consistency with previous trio-based studies. Our work shows that it is now possible to integrate single-molecule and high-throughput sequence data to generate de novo assembled genomes that approach reference quality.
Assuntos
Biologia Computacional/métodos , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Polimorfismo de Nucleotídeo Único , Algoritmos , Mapeamento Cromossômico , Diploide , Biblioteca Gênica , Variação Genética , Genoma , Haplótipos , Humanos , Nucleotídeos/genética , Reprodutibilidade dos Testes , Análise de Sequência de DNA , Sequências de Repetição em TandemRESUMO
MOTIVATION: Long arrays of near-identical tandem repeats are a common feature of centromeric and subtelomeric regions in complex genomes. These sequences present a source of repeat structure diversity that is commonly ignored by standard genomic tools. Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, e.g. assembly, long reads allow direct inference of satellite higher order repeat structure. To automate characterization of local centromeric tandem repeat sequence variation we have designed Alpha-CENTAURI (ALPHA satellite CENTromeric AUtomated Repeat Identification), that takes advantage of Pacific Bioscience long-reads from whole-genome sequencing datasets. By operating on reads prior to assembly, our approach provides a more comprehensive set of repeat-structure variants and is not impacted by rearrangements or sequence underrepresentation due to misassembly. RESULTS: We demonstrate the utility of Alpha-CENTAURI in characterizing repeat structure for alpha satellite containing reads in the hydatidiform mole (CHM1, haploid-like) genome. The pipeline is designed to report local repeat organization summaries for each read, thereby monitoring rearrangements in repeat units, shifts in repeat orientation and sites of array transition into non-satellite DNA, typically defined by transposable element insertion. We validate the method by showing consistency with existing centromere high order repeat references. Alpha-CENTAURI can, in principle, run on any sequence data, offering a method to generate a sequence repeat resolution that could be readily performed using consensus sequences available for other satellite families in genomes without high-quality reference assemblies. AVAILABILITY AND IMPLEMENTATION: Documentation and source code for Alpha-CENTAURI are freely available at http://github.com/volkansevim/alpha-CENTAURI CONTACT: ali.bashir@mssm.edu SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Centrômero/genética , Biologia Computacional/métodos , Genômica , Análise de Sequência de DNA/métodos , Sequências de Repetição em Tandem , Algoritmos , Sequência Consenso , Feminino , Humanos , Mola Hidatiforme/genética , GravidezRESUMO
OBJECTIVE: To assess the awareness of medical apps and academic use of smartphones among medical students. METHODS: The questionnaire-based descriptive cross-sectional study was conducted in January 2015 and comprised medical students of the Rawal Institute of Health Sciences, Islamabad, Pakistan. The self-designed questionnaire was reviewed by a panel of expert for content reliability and validity. Questionnaires were distributed in the classrooms and were filled by the students anonymously. SPSS 16 was used for statistical analysis. RESULTS: Among the 569 medical students in the study, 545 (95.8%) had smartphones and 24(4.2%) were using simple cell phones. Overall, 226(41.46%) of the smart phone users were using some medical apps. Besides, 137(24.08%) were aware of the medical apps but were not using them. Also, 391(71.7%) students were not using any type of medical text eBooks through their phone, and only 154(28.3%) had relevant text eBooks in their phones. CONCLUSIONS: Medical college students were using smartphones mostly as a means of telecommunication rather than a gadget for improving medical knowledge.
Assuntos
Aplicativos Móveis/estatística & dados numéricos , Smartphone/estatística & dados numéricos , Estudantes de Medicina , Materiais de Ensino , Adulto , Estudos Transversais , Educação Médica/métodos , Educação Médica/tendências , Tecnologia Educacional/métodos , Tecnologia Educacional/tendências , Feminino , Humanos , Masculino , Paquistão , Estudantes de Medicina/psicologia , Estudantes de Medicina/estatística & dados numéricos , Inquéritos e QuestionáriosRESUMO
Whole-genome sequences for Stenotrophomonas maltophilia serial isolates from a bacteremic patient before and after development of levofloxacin resistance were assembled de novo and differed by one single-nucleotide variant in smeT, a repressor for multidrug efflux operon smeDEF. Along with sequenced isolates from five contemporaneous cases, they displayed considerable diversity compared against all published complete genomes. Whole-genome sequencing and complete assembly can conclusively identify resistance mechanisms emerging in S. maltophilia strains during clinical therapy.
Assuntos
Genoma Bacteriano/genética , Infecções por Bactérias Gram-Negativas/microbiologia , Quinolonas/farmacologia , Stenotrophomonas maltophilia/imunologia , DNA Bacteriano/genética , Farmacorresistência Bacteriana Múltipla/genética , Testes de Sensibilidade Microbiana , MutaçãoRESUMO
MOTIVATION: Resolving tandemly repeated genomic sequences is a necessary step in improving our understanding of the human genome. Short tandem repeats (TRs), or microsatellites, are often used as molecular markers in genetics, and clinically, variation in microsatellites can lead to genetic disorders like Huntington's diseases. Accurately resolving repeats, and in particular TRs, remains a challenging task in genome alignment, assembly and variation calling. Though tools have been developed for detecting microsatellites in short-read sequencing data, these are limited in the size and types of events they can resolve. Single-molecule sequencing technologies may potentially resolve a broader spectrum of TRs given their increased length, but require new approaches given their significantly higher raw error profiles. However, due to inherent error profiles of the single-molecule technologies, these reads presents a unique challenge in terms of accurately identifying and estimating the TRs. RESULTS: Here we present PacmonSTR, a reference-based probabilistic approach, to identify the TR region and estimate the number of these TR elements in long DNA reads. We present a multistep approach that requires as input, a reference region and the reference TR element. Initially, the TR region is identified from the long DNA reads via a 3-stage modified Smith-Waterman approach and then, expected number of TR elements is calculated using a pair-Hidden Markov Models-based method. Finally, TR-based genotype selection (or clustering: homozygous/heterozygous) is performed with Gaussian mixture models, using the Akaike information criteria, and coverage expectations.
Assuntos
Repetições de Microssatélites , Análise de Sequência de DNA/métodos , Genoma Humano , Humanos , Cadeias de Markov , Alinhamento de Sequência , SoftwareRESUMO
MOTIVATION: Structural variation is common in human and cancer genomes. High-throughput DNA sequencing has enabled genome-scale surveys of structural variation. However, the short reads produced by these technologies limit the study of complex variants, particularly those involving repetitive regions. Recent 'third-generation' sequencing technologies provide single-molecule templates and longer sequencing reads, but at the cost of higher per-nucleotide error rates. RESULTS: We present MultiBreak-SV, an algorithm to detect structural variants (SVs) from single molecule sequencing data, paired read sequencing data, or a combination of sequencing data from different platforms. We demonstrate that combining low-coverage third-generation data from Pacific Biosciences (PacBio) with high-coverage paired read data is advantageous on simulated chromosomes. We apply MultiBreak-SV to PacBio data from four human fosmids and show that it detects known SVs with high sensitivity and specificity. Finally, we perform a whole-genome analysis on PacBio data from a complete hydatidiform mole cell line and predict 1002 high-probability SVs, over half of which are confirmed by an Illumina-based assembly.
Assuntos
Algoritmos , Variação Estrutural do Genoma , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Genômica/métodos , Humanos , Sequências Repetitivas de Ácido Nucleico , Deleção de SequênciaRESUMO
PURPOSE: Health-care professionals need to be trained to work with whole-genome sequencing (WGS) in their practice. Our aim was to explore how students responded to a novel genome analysis course that included the option to analyze their own genomes. METHODS: This was an observational cohort study. Questionnaires were administered before (T3) and after the genome analysis course (T4), as well as 6 months later (T5). In-depth interviews were conducted at T5. RESULTS: All students (n = 19) opted to analyze their own genomes. At T5, 12 of 15 students stated that analyzing their own genomes had been useful. Ten reported they had applied their knowledge in the workplace. Technical WGS knowledge increased (mean of 63.8% at T3, mean of 72.5% at T4; P = 0.005). In-depth interviews suggested that analyzing their own genomes may increase students' motivation to learn and their understanding of the patient experience. Most (but not all) of the students reported low levels of WGS results-related distress and low levels of regret about their decision to analyze their own genomes. CONCLUSION: Giving students the option of analyzing their own genomes may increase motivation to learn, but some students may experience personal WGS results-related distress and regret. Additional evidence is required before considering incorporating optional personal genome analysis into medical education on a large scale.
Assuntos
Genoma Humano , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Estudantes/psicologia , Atitude do Pessoal de Saúde , Estudos de Coortes , Tomada de Decisões , Feminino , Genômica/métodos , Humanos , Estudos Longitudinais , Masculino , Estudantes de Medicina/psicologia , Inquéritos e QuestionáriosRESUMO
BACKGROUND: It has recently become possible to rapidly and accurately detect epigenetic signatures in bacterial genomes using third generation sequencing data. Monitoring the speed at which a single polymerase inserts a base in the read strand enables one to infer whether a modification is present at that specific site on the template strand. These sites can be challenging to detect in the absence of high coverage and reliable reference genomes. METHODS: Here we provide a new method for detecting epigenetic motifs in bacteria on datasets with low-coverage, with incomplete references, and with mixed samples (i.e. metagenomic data). Our approach treats motif inference as a kmer comparison problem. First, genomes (or contigs) are deconstructed into kmers. Then, native genome-wide distributions of interpulse durations (IPDs) for kmers are compared with corresponding whole genome amplified (WGA, modification free) IPD distributions using log likelihood ratios. Finally, kmers are ranked and greedily selected by iteratively correcting for sequences within a particular kmer's neighborhood. CONCLUSIONS: Our method can detect multiple types of modifications, even at very low-coverage and in the presence of mixed genomes. Additionally, we are able to predict modified motifs when genomes with "neighbor" modified motifs exist within the sample. Lastly, we show that these motifs can provide an alternative source of information by which to cluster metagenomics contigs and that iterative refinement on these clustered contigs can further improve both sensitivity and specificity of motif detection. AVAILABILITY: https://github.com/alibashir/EMMCKmer.