RESUMO
BACKGROUND: Tandem repeat sequences are common in the genomes of many organisms and are known to cause important phenomena such as gene silencing and rapid morphological changes. Due to the presence of multiple copies of the same pattern in tandem repeats and their high variability, they contain a wealth of information about the mutations that have led to their formation. The ability to extract this information can enhance our understanding of evolutionary mechanisms. RESULTS: We present a stochastic model for the formation of tandem repeats via tandem duplication and substitution mutations. Based on the analysis of this model, we develop a method for estimating the relative mutation rates of duplications and substitutions, as well as the total number of mutations, in the history of a tandem repeat sequence. We validate our estimation method via Monte Carlo simulation and show that it outperforms the state-of-the-art algorithm for discovering the duplication history. We also apply our method to tandem repeat sequences in the human genome, where it demonstrates the different behaviors of micro- and mini-satellites and can be used to compare mutation rates across chromosomes. It is observed that chromosomes that exhibit the highest mutation activity in tandem repeat regions are the same as those thought to have the highest overall mutation rates. However, unlike previous works that rely on comparing human and chimpanzee genomes to measure mutation rates, the proposed method allows us to find chromosomes with the highest mutation activity based on a single genome, in essence by comparing (approximate) copies of the pattern in tandem repeats. CONCLUSION: The prevalence of tandem repeats in most organisms and the efficiency of the proposed method enable studying various aspects of the formation of tandem repeats and the surrounding sequences in a wide range of settings. AVAILABILITY: The implementation of the estimation method is available at http://ips.lab.virginia.edu/smtr .
Assuntos
Duplicação Gênica , Sequências de Repetição em Tandem/genética , Algoritmos , Cromossomos Humanos X/genética , Simulação por Computador , Genoma Humano , Humanos , Método de Monte Carlo , Mutação/genética , Taxa de Mutação , Processos EstocásticosRESUMO
BACKGROUND: Metagenomics is a genomics research discipline devoted to the study of microbial communities in environmental samples and human and animal organs and tissues. Sequenced metagenomic samples usually comprise reads from a large number of different bacterial communities and hence tend to result in large file sizes, typically ranging between 1-10 GB. This leads to challenges in analyzing, transferring and storing metagenomic data. In order to overcome these data processing issues, we introduce MetaCRAM, the first de novo, parallelized software suite specialized for FASTA and FASTQ format metagenomic read processing and lossless compression. RESULTS: MetaCRAM integrates algorithms for taxonomy identification and assembly, and introduces parallel execution methods; furthermore, it enables genome reference selection and CRAM based compression. MetaCRAM also uses novel reference-based compression methods designed through extensive studies of integer compression techniques and through fitting of empirical distributions of metagenomic read-reference positions. MetaCRAM is a lossless method compatible with standard CRAM formats, and it allows for fast selection of relevant files in the compressed domain via maintenance of taxonomy information. The performance of MetaCRAM as a stand-alone compression platform was evaluated on various metagenomic samples from the NCBI Sequence Read Archive, suggesting 2- to 4-fold compression ratio improvements compared to gzip. On average, the compressed file sizes were 2-13 percent of the original raw metagenomic file sizes. CONCLUSIONS: We described the first architecture for reference-based, lossless compression of metagenomic data. The compression scheme proposed offers significantly improved compression ratios as compared to off-the-shelf methods such as zip programs. Furthermore, it enables running different components in parallel and it provides the user with taxonomic and assembly information generated during execution of the compression pipeline. AVAILABILITY: The MetaCRAM software is freely available at http://web.engr.illinois.edu/~mkim158/metacram.html. The website also contains a README file and other relevant instructions for running the code. Note that to run the code one needs a minimum of 16 GB of RAM. In addition, virtual box is set up on a 4GB RAM machine for users to run a simple demonstration.
Assuntos
Classificação/métodos , Compressão de Dados/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Metagenômica/métodos , HumanosRESUMO
UNLABELLED: Gene prioritization refers to a family of computational techniques for inferring disease genes through a set of training genes and carefully chosen similarity criteria. Test genes are scored based on their average similarity to the training set, and the rankings of genes under various similarity criteria are aggregated via statistical methods. The contributions of our work are threefold: (i) first, based on the realization that there is no unique way to define an optimal aggregate for rankings, we investigate the predictive quality of a number of new aggregation methods and known fusion techniques from machine learning and social choice theory. Within this context, we quantify the influence of the number of training genes and similarity criteria on the diagnostic quality of the aggregate and perform in-depth cross-validation studies; (ii) second, we propose a new approach to genomic data aggregation, termed HyDRA (Hybrid Distance-score Rank Aggregation), which combines the advantages of score-based and combinatorial aggregation techniques. We also propose incorporating a new top-versus-bottom (TvB) weighting feature into the hybrid schemes. The TvB feature ensures that aggregates are more reliable at the top of the list, rather than at the bottom, since only top candidates are tested experimentally; (iii) third, we propose an iterative procedure for gene discovery that operates via successful augmentation of the set of training genes by genes discovered in previous rounds, checked for consistency. MOTIVATION: Fundamental results from social choice theory, political and computer sciences, and statistics have shown that there exists no consistent, fair and unique way to aggregate rankings. Instead, one has to decide on an aggregation approach using predefined set of desirable properties for the aggregate. The aggregation methods fall into two categories, score- and distance-based approaches, each of which has its own drawbacks and advantages. This work is motivated by the observation that merging these two techniques in a computationally efficient manner, and by incorporating additional constraints, one can ensure that the predictive quality of the resulting aggregation algorithm is very high. RESULTS: We tested HyDRA on a number of gene sets, including autism, breast cancer, colorectal cancer, endometriosis, ischaemic stroke, leukemia, lymphoma and osteoarthritis. Furthermore, we performed iterative gene discovery for glioblastoma, meningioma and breast cancer, using a sequentially augmented list of training genes related to the Turcot syndrome, Li-Fraumeni condition and other diseases. The methods outperform state-of-the-art software tools such as ToppGene and Endeavour. Despite this finding, we recommend as best practice to take the union of top-ranked items produced by different methods for the final aggregated list. AVAILABILITY AND IMPLEMENTATION: The HyDRA software may be downloaded from: http://web.engr.illinois.edu/â¼mkim158/HyDRA.zip. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Assuntos
Algoritmos , Inteligência Artificial , Doença/genética , Genes , Predisposição Genética para Doença , Genômica/métodos , Software , Bases de Dados Genéticas , HumanosRESUMO
Antimicrobial susceptibility in Pseudomonas aeruginosa is dependent on a complex combination of host and pathogen-specific factors. Through the profiling of 971 clinical P. aeruginosa isolates from 590 patients and collection of paired patient metadata, we show that antimicrobial resistance is associated with not only patient-centric factors (e.g., cystic fibrosis and antipseudomonal prescription history) but also microbe-specific phenotypes (e.g., mucoid colony morphology). Additionally, isolates from different sources (e.g., respiratory tract, urinary tract) displayed rates of antimicrobial resistance that were correlated with source-specific antimicrobial prescription strategies. Furthermore, isolates from the same patient often displayed a high degree of heterogeneity, highlighting a key challenge facing personalized treatment of infectious diseases. Our findings support novel relationships between isolate and patient-level data sets, providing a potential guide for future antimicrobial treatment strategies. IMPORTANCE P. aeruginosa is a leading cause of nosocomial infection and infection in patients with cystic fibrosis. While P. aeruginosa infection and treatment can be complicated by a variety of antimicrobial resistance and virulence mechanisms, pathogen virulence is rarely recorded in a clinical setting. In this study, we discovered novel relationships between antimicrobial resistance, virulence-linked morphologies, and isolate source in a large and variable collection of clinical P. aeruginosa isolates. Our work motivates the clinical surveillance of virulence-linked P. aeruginosa morphologies as well as the tracking of source-specific antimicrobial prescription and resistance patterns.