RESUMEN
Electroencephalogram (EEG) interpretation plays a critical role in the clinical assessment of neurological conditions, most notably epilepsy. However, EEG recordings are typically analyzed manually by highly specialized and heavily trained personnel. Moreover, the low rate of capturing abnormal events during the procedure makes interpretation time-consuming, resource-hungry, and overall an expensive process. Automatic detection offers the potential to improve the quality of patient care by shortening the time to diagnosis, managing big data and optimizing the allocation of human resources towards precision medicine. Here, we present MindReader, a novel unsupervised machine-learning method comprised of the interplay between an autoencoder network, a hidden Markov model (HMM), and a generative component: after dividing the signal into overlapping frames and performing a fast Fourier transform, MindReader trains an autoencoder neural network for dimensionality reduction and compact representation of different frequency patterns for each frame. Next, we processed the temporal patterns using a HMM, while a third and generative component hypothesized and characterized the different phases that were then fed back to the HMM. MindReader then automatically generates labels that the physician can interpret as pathological and non-pathological phases, thus effectively reducing the search space for trained personnel. We evaluated MindReader's predictive performance on 686 recordings, encompassing more than 980 h from the publicly available Physionet database. Compared to manual annotations, MindReader identified 197 of 198 epileptic events (99.45%), and is, as such, a highly sensitive method, which is a prerequisite for clinical use.
Asunto(s)
Electroencefalografía , Epilepsia , Humanos , Electroencefalografía/métodos , Epilepsia/diagnóstico , Redes Neurales de la Computación , Análisis de Fourier , Aprendizaje Automático no SupervisadoRESUMEN
BACKGROUND: Generating polygenic risk scores for diseases and complex traits requires high quality GWAS summary statistic files. Often, these files can be difficult to acquire either as a result of unshared or incomplete data. To date, bioinformatics tools which focus on restoring missing columns containing identification and association data are limited, which has the potential to increase the number of usable GWAS summary statistics files. RESULTS: SumStatsRehab was able to restore rsID, effect/other alleles, chromosome, base pair position, effect allele frequencies, beta, standard error, and p-values to a better extent than any other currently available tool, with minimal loss. CONCLUSIONS: SumStatsRehab offers a unique tool utilizing both functional programming and pipeline-like architecture, allowing users to generate accurate data restorations for incomplete summary statistics files. This in turn, increases the number of usable GWAS summary statistics files, which may be invaluable for less researched health traits.
Asunto(s)
Estudio de Asociación del Genoma Completo , Polimorfismo de Nucleótido Simple , Herencia Multifactorial , Fenotipo , AlgoritmosRESUMEN
BACKGROUND: Studies that aim at explaining phenotypes or disease susceptibility by genetic or epigenetic variants often rely on clustering methods to stratify individuals or samples. While statistical associations may point at increased risk for certain parts of the population, the ultimate goal is to make precise predictions for each individual. This necessitates tools that allow for the rapid inspection of each data point, in particular to find explanations for outliers. RESULTS: ACES is an integrative cluster- and phenotype-browser, which implements standard clustering methods, as well as multiple visualization methods in which all sample information can be displayed quickly. In addition, ACES can automatically mine a list of phenotypes for cluster enrichment, whereby the number of clusters and their boundaries are estimated by a novel method. For visual data browsing, ACES provides a 2D or 3D PCA or Heat Map view. ACES is implemented in Java, with a focus on a user-friendly, interactive, graphical interface. CONCLUSIONS: ACES has been proven an invaluable tool for analyzing large, pre-filtered DNA methylation data sets and RNA-Sequencing data, due to its ease to link molecular markers to complex phenotypes. The source code is available from https://github.com/GrabherrGroup/ACES .
Asunto(s)
Interfaz Usuario-Computador , Análisis por Conglomerados , Metilación de ADN , Diabetes Mellitus Tipo 1/genética , Diabetes Mellitus Tipo 1/patología , Humanos , Acceso a Internet , Análisis de Componente Principal , ARN/química , ARN/metabolismoRESUMEN
Marine stickleback fish have colonized and adapted to thousands of streams and lakes formed since the last ice age, providing an exceptional opportunity to characterize genomic mechanisms underlying repeated ecological adaptation in nature. Here we develop a high-quality reference genome assembly for threespine sticklebacks. By sequencing the genomes of twenty additional individuals from a global set of marine and freshwater populations, we identify a genome-wide set of loci that are consistently associated with marine-freshwater divergence. Our results indicate that reuse of globally shared standing genetic variation, including chromosomal inversions, has an important role in repeated evolution of distinct marine and freshwater sticklebacks, and in the maintenance of divergent ecotypes during early stages of reproductive isolation. Both coding and regulatory changes occur in the set of loci underlying marine-freshwater evolution, but regulatory changes appear to predominate in this well known example of repeated adaptive evolution in nature.
Asunto(s)
Adaptación Fisiológica/genética , Evolución Biológica , Genoma/genética , Smegmamorpha/genética , Alaska , Animales , Organismos Acuáticos/genética , Inversión Cromosómica/genética , Cromosomas/genética , Secuencia Conservada/genética , Ecotipo , Femenino , Agua Dulce , Variación Genética/genética , Genómica , Datos de Secuencia Molecular , Agua de Mar , Análisis de Secuencia de ADNRESUMEN
BACKGROUND: DNA methylation plays a key role in developmental processes, which is reflected in changing methylation patterns at specific CpG sites over the lifetime of an individual. The underlying mechanisms are complex and possibly affect multiple genes or entire pathways. RESULTS: We applied a multivariate approach to identify combinations of CpG sites that undergo modifications when transitioning between developmental stages. Monte Carlo feature selection produced a list of ranked and statistically significant CpG sites, while rule-based models allowed for identifying particular methylation changes in these sites. Our rule-based classifier reports combinations of CpG sites, together with changes in their methylation status in the form of easy-to-read IF-THEN rules, which allows for identification of the genes associated with the underlying sites. CONCLUSION: We utilized machine learning and statistical methods to discretize decision class (age) values to get a general pattern of methylation changes over the lifespan. The CpG sites present in the significant rules were annotated to genes involved in brain formation, general development, as well as genes linked to cancer and Alzheimer's disease.
RESUMEN
UNLABELLED: Whiteboard is a class library implemented in C++ that enables visualization to be tightly coupled with computation when analyzing large and complex datasets. AVAILABILITY AND IMPLEMENTATION: the C++ source code, coding samples and documentation are freely available under the Lesser General Public License from http://whiteboard-class.sourceforge.net/.
Asunto(s)
Biología Computacional/métodos , Gráficos por Computador , Lenguajes de Programación , Programas Informáticos , Bases de Datos Factuales , Humanos , Almacenamiento y Recuperación de la InformaciónRESUMEN
BACKGROUND: Genomic duplications constitute major events in the evolution of species, allowing paralogous copies of genes to take on fine-tuned biological roles. Unambiguously identifying the orthology relationship between copies across multiple genomes can be resolved by synteny, i.e. the conserved order of genomic sequences. However, a comprehensive analysis of duplication events and their contributions to evolution would require all-to-all genome alignments, which increases at N2 with the number of available genomes, N. RESULTS: Here, we introduce Kraken, software that omits the all-to-all requirement by recursively traversing a graph of pairwise alignments and dynamically re-computing orthology. Kraken scales linearly with the number of targeted genomes, N, which allows for including large numbers of genomes in analyses. We first evaluated the method on the set of 12 Drosophila genomes, finding that orthologous correspondence computed indirectly through a graph of multiple synteny maps comes at minimal cost in terms of sensitivity, but reduces overall computational runtime by an order of magnitude. We then used the method on three well-annotated mammalian genomes, human, mouse, and rat, and show that up to 93% of protein coding transcripts have unambiguous pairwise orthologous relationships across the genomes. On a nucleotide level, 70 to 83% of exons match exactly at both splice junctions, and up to 97% on at least one junction. We last applied Kraken to an RNA-sequencing dataset from multiple vertebrates and diverse tissues, where we confirmed that brain-specific gene family members, i.e. one-to-many or many-to-many homologs, are more highly correlated across species than single-copy (i.e. one-to-one homologous) genes. Not limited to protein coding genes, Kraken also identifies thousands of newly identified transcribed loci, likely non-coding RNAs that are consistently transcribed in human, chimpanzee and gorilla, and maintain significant correlation of expression levels across species. CONCLUSIONS: Kraken is a computational genome coordinate translator that facilitates cross-species comparisons, distinguishes orthologs from paralogs, and does not require costly all-to-all whole genome mappings. Kraken is freely available under LPGL from http://github.com/nedaz/kraken.
Asunto(s)
Genómica/métodos , Programas Informáticos , Animales , Mapeo Cromosómico , Drosophila melanogaster/genética , Evolución Molecular , Genoma/genética , Humanos , Ratones , Anotación de Secuencia Molecular , Ratas , Sintenía/genética , Transcripción GenéticaRESUMEN
High-throughput RNA sequencing (RNA-seq) promises a comprehensive picture of the transcriptome, allowing for the complete annotation and quantification of all genes and their isoforms across samples. Realizing this promise requires increasingly complex computational methods. These computational challenges fall into three main categories: (i) read mapping, (ii) transcriptome reconstruction and (iii) expression quantification. Here we explain the major conceptual and practical challenges, and the general classes of solutions for each category. Finally, we highlight the interdependence between these categories and discuss the benefits for different biological applications.
Asunto(s)
Perfilación de la Expresión Génica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Análisis de Secuencia de ARN/estadística & datos numéricos , Animales , Biología Computacional/métodos , Genómica/estadística & datos numéricos , Humanos , Alineación de Secuencia/estadística & datos numéricosRESUMEN
Rust fungi are some of the most devastating pathogens of crop plants. They are obligate biotrophs, which extract nutrients only from living plant tissues and cannot grow apart from their hosts. Their lifestyle has slowed the dissection of molecular mechanisms underlying host invasion and avoidance or suppression of plant innate immunity. We sequenced the 101-Mb genome of Melampsora larici-populina, the causal agent of poplar leaf rust, and the 89-Mb genome of Puccinia graminis f. sp. tritici, the causal agent of wheat and barley stem rust. We then compared the 16,399 predicted proteins of M. larici-populina with the 17,773 predicted proteins of P. graminis f. sp tritici. Genomic features related to their obligate biotrophic lifestyle include expanded lineage-specific gene families, a large repertoire of effector-like small secreted proteins, impaired nitrogen and sulfur assimilation pathways, and expanded families of amino acid and oligopeptide membrane transporters. The dramatic up-regulation of transcripts coding for small secreted proteins, secreted hydrolytic enzymes, and transporters in planta suggests that they play a role in host infection and nutrient acquisition. Some of these genomic hallmarks are mirrored in the genomes of other microbial eukaryotes that have independently evolved to infect plants, indicating convergent adaptation to a biotrophic existence inside plant cells.
Asunto(s)
Basidiomycota/genética , Hongos/genética , Triticum/microbiología , Perfilación de la Expresión Génica , Genes Fúngicos , Genoma , Genoma Fúngico , Modelos Genéticos , Nitratos/química , Análisis de Secuencia por Matrices de Oligonucleótidos , Filogenia , Enfermedades de las Plantas/microbiología , Hojas de la Planta/microbiología , Análisis de Secuencia de ADN , Sulfatos/químicaRESUMEN
BACKGROUND: Phenomena such as incomplete lineage sorting, horizontal gene transfer, gene duplication and subsequent sub- and neo-functionalisation can result in distinct local phylogenetic relationships that are discordant with species phylogeny. In order to assess the possible biological roles for these subdivisions, they must first be identified and characterised, preferably on a large scale and in an automated fashion. RESULTS: We developed Saguaro, a combination of a Hidden Markov Model (HMM) and a Self Organising Map (SOM), to characterise local phylogenetic relationships among aligned sequences using cacti, matrices of pair-wise distance measures. While the HMM determines the genomic boundaries from aligned sequences, the SOM hypothesises new cacti in an unsupervised and iterative fashion based on the regions that were modelled least well by existing cacti. After testing the software on simulated data, we demonstrate the utility of Saguaro by testing two different data sets: (i) 181 Dengue virus strains, and (ii) 5 primate genomes. Saguaro identifies regions under lineage-specific constraint for the first set, and genomic segments that we attribute to incomplete lineage sorting in the second dataset. Intriguingly for the primate data, Saguaro also classified an additional ~3% of the genome as most incompatible with the expected species phylogeny. A substantial fraction of these regions was found to overlap genes associated with both the innate and adaptive immune systems. CONCLUSIONS: Saguaro detects distinct cacti describing local phylogenetic relationships without requiring any a priori hypotheses. We have successfully demonstrated Saguaro's utility with two contrasting data sets, one containing many members with short sequences (Dengue viral strains: n = 181, genome size = 10,700 nt), and the other with few members but complex genomes (related primate species: n = 5, genome size = 3 Gb), suggesting that the software is applicable to a wide variety of experimental populations. Saguaro is written in C++, runs on the Linux operating system, and can be downloaded from http://saguarogw.sourceforge.net/.
Asunto(s)
Genómica/métodos , Algoritmos , Animales , Virus del Dengue/genética , Brotes de Enfermedades , Humanos , Inmunidad/genética , Cadenas de Markov , Modelos Genéticos , Filogenia , Primates/genética , Primates/inmunología , Programas Informáticos , Especificidad de la EspecieRESUMEN
Rhizopus oryzae is the primary cause of mucormycosis, an emerging, life-threatening infection characterized by rapid angioinvasive growth with an overall mortality rate that exceeds 50%. As a representative of the paraphyletic basal group of the fungal kingdom called "zygomycetes," R. oryzae is also used as a model to study fungal evolution. Here we report the genome sequence of R. oryzae strain 99-880, isolated from a fatal case of mucormycosis. The highly repetitive 45.3 Mb genome assembly contains abundant transposable elements (TEs), comprising approximately 20% of the genome. We predicted 13,895 protein-coding genes not overlapping TEs, many of which are paralogous gene pairs. The order and genomic arrangement of the duplicated gene pairs and their common phylogenetic origin provide evidence for an ancestral whole-genome duplication (WGD) event. The WGD resulted in the duplication of nearly all subunits of the protein complexes associated with respiratory electron transport chains, the V-ATPase, and the ubiquitin-proteasome systems. The WGD, together with recent gene duplications, resulted in the expansion of multiple gene families related to cell growth and signal transduction, as well as secreted aspartic protease and subtilase protein families, which are known fungal virulence factors. The duplication of the ergosterol biosynthetic pathway, especially the major azole target, lanosterol 14alpha-demethylase (ERG11), could contribute to the variable responses of R. oryzae to different azole drugs, including voriconazole and posaconazole. Expanded families of cell-wall synthesis enzymes, essential for fungal cell integrity but absent in mammalian hosts, reveal potential targets for novel and R. oryzae-specific diagnostic and therapeutic treatments.
Asunto(s)
Duplicación de Gen , Genoma Fúngico , Genómica , Mucormicosis/microbiología , Rhizopus/genética , Elementos Transponibles de ADN , Proteínas Fúngicas/genética , Hongos/clasificación , Hongos/genética , Humanos , Filogenia , Rhizopus/química , Rhizopus/clasificación , Rhizopus/aislamiento & purificaciónRESUMEN
The vast majority of human traits, including many disease phenotypes, are affected by alleles at numerous genomic loci. With a continually increasing set of variants with published clinical disease or biomarker associations, an easy-to-use tool for non-programmers to rapidly screen VCF files for risk alleles is needed. We have developed EZTraits as a tool to quickly evaluate genotype data against a set of rules defined by the user. These rules can be defined directly in the scripting language Lua, for genotype calls using variant ID (RS number) or chromosomal position. Alternatively, EZTraits can parse simple and intuitive text including concepts like 'any' or 'all'. Thus, EZTraits is designed to support rapid genetic analysis and hypothesis-testing by researchers, regardless of programming experience or technical background. The software is implemented in C++ and compiles and runs on Linux and MacOS. The source code is available under the MIT license from https://github.com/selfdecode/rd-eztraits.
Asunto(s)
Genómica , Programas Informáticos , Alelos , Genotipo , FenotipoRESUMEN
MOTIVATION: Comparative genomics heavily relies on alignments of large and often complex DNA sequences. From an engineering perspective, the problem here is to provide maximum sensitivity (to find all there is to find), specificity (to only find real homology) and speed (to accommodate the billions of base pairs of vertebrate genomes). RESULTS: Satsuma addresses all three issues through novel strategies: (i) cross-correlation, implemented via fast Fourier transform; (ii) a match scoring scheme that eliminates almost all false hits; and (iii) an asynchronous 'battleship'-like search that allows for aligning two entire fish genomes (470 and 217 Mb) in 120 CPU hours using 15 processors on a single machine. AVAILABILITY: Satsuma is part of the Spines software package, implemented in C++ on Linux. The latest version of Spines can be freely downloaded under the LGPL license from http://www.broadinstitute.org/science/programs/genome-biology/spines/.
Asunto(s)
Biología Computacional/métodos , Programas Informáticos , Algoritmos , Animales , Análisis de Fourier , Genoma , Genómica/métodos , Humanos , Modelos Estadísticos , Oryza/genética , Probabilidad , Lenguajes de Programación , Alineación de Secuencia , Sorghum/genética , TetraodontiformesRESUMEN
The massive increase in computational power over the recent years and wider applicationsof machine learning methods, coincidental or not, were paralleled by remarkable advances inhigh-throughput DNA sequencing technologies.[...].
RESUMEN
BACKGROUND: Measuring how gene expression changes in the course of an experiment assesses how an organism responds on a molecular level. Sequencing of RNA molecules, and their subsequent quantification, aims to assess global gene expression changes on the RNA level (transcriptome). While advances in high-throughput RNA-sequencing (RNA-seq) technologies allow for inexpensive data generation, accurate post-processing and normalization across samples is required to eliminate any systematic noise introduced by the biochemical and/or technical processes. Existing methods thus either normalize on selected known reference genes that are invariant in expression across the experiment, assume that the majority of genes are invariant, or that the effects of up- and down-regulated genes cancel each other out during the normalization. RESULTS: Here, we present a novel method, moose2 , which predicts invariant genes in silico through a dynamic programming (DP) scheme and applies a quadratic normalization based on this subset. The method allows for specifying a set of known or experimentally validated invariant genes, which guides the DP. We experimentally verified the predictions of this method in the bacterium Escherichia coli, and show how moose2 is able to (i) estimate the expression value distances between RNA-seq samples, (ii) reduce the variation of expression values across all samples, and (iii) to subsequently reveal new functional groups of genes during the late stages of DNA damage. We further applied the method to three eukaryotic data sets, on which its performance compares favourably to other methods. The software is implemented in C++ and is publicly available from http://grabherr.github.io/moose2/. CONCLUSIONS: The proposed RNA-seq normalization method, moose2 , is a valuable alternative to existing methods, with two major advantages: (i) in silico prediction of invariant genes provides a list of potential reference genes for downstream analyses, and (ii) non-linear artefacts in RNA-seq data are handled adequately to minimize variations between replicates.
RESUMEN
After performing de novo transcript assembly of >1 billion RNA-Sequencing reads obtained from 22 samples of different Norway spruce (Picea abies) tissues that were not surface sterilized, we found that assembled sequences captured a mix of plant, lichen, and fungal transcripts. The latter were likely expressed by endophytic and epiphytic symbionts, indicating that these organisms were present, alive, and metabolically active. Here, we show that these serendipitously sequenced transcripts need not be considered merely as contamination, as is common, but that they provide insight into the plant's phyllosphere. Notably, we could classify these transcripts as originating predominantly from Dothideomycetes and Leotiomycetes species, with functional annotation of gene families indicating active growth and metabolism, with particular regards to glucose intake and processing, as well as gene regulation.
Asunto(s)
Hongos/genética , Picea/genética , Picea/microbiología , Transcriptoma/genética , Composición de Base/genética , Regulación Fúngica de la Expresión Génica , Regulación de la Expresión Génica de las Plantas , ARN Mensajero/genética , ARN Mensajero/metabolismoRESUMEN
BACKGROUND: The fundamental challenge in optimally aligning homologous sequences is to define a scoring scheme that best reflects the underlying biological processes. Maximising the overall number of matches in the alignment does not always reflect the patterns by which nucleotides mutate. Efficiently implemented algorithms that can be parameterised to accommodate more complex non-linear scoring schemes are thus desirable. RESULTS: We present Cola, alignment software that implements different optimal alignment algorithms, also allowing for scoring contiguous matches of nucleotides in a nonlinear manner. The latter places more emphasis on short, highly conserved motifs, and less on the surrounding nucleotides, which can be more diverged. To illustrate the differences, we report results from aligning 14,100 sequences from 3' untranslated regions of human genes to 25 of their mammalian counterparts, where we found that a nonlinear scoring scheme is more consistent than a linear scheme in detecting short, conserved motifs. CONCLUSIONS: Cola is freely available under LPGL from https://github.com/nedaz/cola.
RESUMEN
The domestic dog, Canis familiaris, is a well-established model system for mapping trait and disease loci. While the original draft sequence was of good quality, gaps were abundant particularly in promoter regions of the genome, negatively impacting the annotation and study of candidate genes. Here, we present an improved genome build, canFam3.1, which includes 85 MB of novel sequence and now covers 99.8% of the euchromatic portion of the genome. We also present multiple RNA-Sequencing data sets from 10 different canine tissues to catalog â¼175,000 expressed loci. While about 90% of the coding genes previously annotated by EnsEMBL have measurable expression in at least one sample, the number of transcript isoforms detected by our data expands the EnsEMBL annotations by a factor of four. Syntenic comparison with the human genome revealed an additional â¼3,000 loci that are characterized as protein coding in human and were also expressed in the dog, suggesting that those were previously not annotated in the EnsEMBL canine gene set. In addition to â¼20,700 high-confidence protein coding loci, we found â¼4,600 antisense transcripts overlapping exons of protein coding genes, â¼7,200 intergenic multi-exon transcripts without coding potential, likely candidates for long intergenic non-coding RNAs (lincRNAs) and â¼11,000 transcripts were reported by two different library construction methods but did not fit any of the above categories. Of the lincRNAs, about 6,000 have no annotated orthologs in human or mouse. Functional analysis of two novel transcripts with shRNA in a mouse kidney cell line altered cell morphology and motility. All in all, we provide a much-improved annotation of the canine genome and suggest regulatory functions for several of the novel non-coding transcripts.
Asunto(s)
Perros/genética , Genoma , Polimorfismo de Nucleótido Simple , Animales , Línea Celular , Exones , Perfilación de la Expresión Génica , Humanos , Ratones , Proteínas del Tejido Nervioso/metabolismo , Oligonucleótidos Antisentido/química , Podocitos/citología , ARN Mensajero/metabolismo , ARN Interferente Pequeño/metabolismo , ARN no Traducido , Análisis de Secuencia de ARNRESUMEN
The promoter is a key element in gene transcription and regulation. We previously reported that artificial sequences rich in the dinucleotide CpG are sufficient to drive expression in vitro in mammalian cell lines, without requiring canonical binding sites for transcription factor proteins. Here, we report that introducing a promoter organization that alternates in CpGs and regions rich in A and T further increases expression strength, as well as how insertion of specific binding sites makes such sequences respond to induced levels of the transcription factor NFκB. Our findings further contribute to the mechanistic understanding of promoters, as well as how these sequences might be shaped by evolutionary pressure in living organisms.
Asunto(s)
Islas de CpG , FN-kappa B/metabolismo , Regiones Promotoras Genéticas , Secuencia de Bases , Sitios de Unión/genética , Línea Celular , Fosfatos de Dinucleósidos/genética , Fosfatos de Dinucleósidos/metabolismo , Regulación de la Expresión Génica , Células HEK293 , Humanos , Datos de Secuencia Molecular , FN-kappa B/genéticaRESUMEN
Decoding the genome sequence is becoming a fundamental tool for molecular, genetic, and genomic studies. This chapter reviews the history of DNA sequencing and technical principles of different sequencing platforms, and compares the strengths and weaknesses of different techniques for high-throughput genome sequencing applications are compared. It also covers brief descriptions on genome assembly and its validation.