RESUMO
Genome Rearrangement distance problems are used in Computational Biology to estimate the evolutionary distance between genomes. These problems consist of minimizing the number of rearrangement events necessary to transform one genome into another. Two commonly used rearrangement events are reversal and transposition. The first studied problems ignored nucleotides outside genes (called intergenic regions), or assumed that genomes have a single copy of each gene. Recent works made advancements in more general problems considering the number of nucleotides in intergenic regions, and replicated genes. Nevertheless, genomes tend to have wildly different quantities of nucleotides on their intergenic regions, which poses a problem when comparing these regions exactly. To overcome this limitation, our work considers some flexibility when matching intergenic regions that do not have the same number of nucleotides. We propose new problems seeking the minimum number of reversals, or reversals and transpositions, necessary to transform one genome into another, while considering flexible intergenic region information. We show approximations for these problems by exploring their relationship with the Signed Minimum Common Flexible Intergenic String Partition problem. We also present different heuristics for the partition problem, and conduct experimental tests on simulated genomes to assess the performance of our algorithms.
RESUMO
BACKGROUND: In proteomics, the interpretation of mass spectra representing peptides carrying multiple complex modifications remains challenging, as it is difficult to strike a balance between reasonable execution time, a limited number of false positives, and a huge search space allowing any number of modifications without a priori. The scientific community needs new developments in this area to aid in the discovery of novel post-translational modifications that may play important roles in disease. RESULTS: To make progress on this issue, we implemented SpecGlobX (SpecGlob eXTended to eXperimental spectra), a standalone Java application that quickly determines the best spectral alignments of a (possibly very large) list of Peptide-to-Spectrum Matches (PSMs) provided by any open modification search method, or generated by the user. As input, SpecGlobX reads a file containing spectra in MGF or mzML format and a semicolon-delimited spreadsheet describing the PSMs. SpecGlobX returns the best alignment for each PSM as output, splitting the mass difference between the spectrum and the peptide into one or more shifts while considering the possibility of non-aligned masses (a phenomenon resulting from many situations including neutral losses). SpecGlobX is fast, able to align one million PSMs in about 1.5 min on a standard desktop. Firstly, we remind the foundations of the algorithm and detail how we adapted SpecGlob (the method we previously developed following the same aim, but limited to the interpretation of perfect simulated spectra) to the interpretation of imperfect experimental spectra. Then, we highlight the interest of SpecGlobX as a complementary tool downstream to three open modification search methods on a large simulated spectra dataset. Finally, we ran SpecGlobX on a proteome-wide dataset downloaded from PRIDE to demonstrate that SpecGlobX functions just as well on simulated and experimental spectra. We then carefully analyzed a limited set of interpretations. CONCLUSIONS: SpecGlobX is helpful as a decision support tool, providing keys to interpret peptides carrying complex modifications still poorly considered by current open modification search software. Better alignment of PSMs enhances confidence in the identification of spectra provided by open modification search methods and should improve the interpretation rate of spectra.
Assuntos
Peptídeos , Proteômica , Proteômica/métodos , Bases de Dados de Proteínas , Espectrometria de Massas/métodos , Software , AlgoritmosRESUMO
The most common way to calculate the rearrangement distance between two genomes is to use the size of a minimum length sequence of rearrangements that transforms one of the two given genomes into the other, where the genomes are represented as permutations using only their gene order, based on the assumption that genomes have the same gene content. With the advance of research in genome rearrangements, new works extended the classical models by either considering genomes with different gene content (unbalanced genomes) or including more genomic characteristics to the mathematical representation of the genomes, such as the distribution of intergenic regions sizes. In this study, we study the Reversal, Transposition, and Indel (Insertion and Deletion) Distance using intergenic information, which allows comparing unbalanced genomes, because indels are included in the rearrangement model (i.e., the set of possible rearrangements allowed when we compute the distance). For the particular case of transpositions and indels on unbalanced genomes, we present a 4-approximation algorithm, improving a previous 4.5 approximation. This algorithm is extended so as to deal with gene orientation and to maintain the 4-approximation factor for the Reversal, Transposition, and Indel Distance on unbalanced genomes. Furthermore, we evaluate the proposed algorithms using experiments on simulated data.
Assuntos
Rearranjo Gênico , Modelos Genéticos , Genoma/genética , Genômica , Mutação INDEL , AlgoritmosRESUMO
Genome Rearrangements are events that affect large stretches of genomes during evolution. Many mathematical models have been used to estimate the evolutionary distance between two genomes based on genome rearrangements. However, most of them focused on the (order of the) genes of a genome, disregarding other important elements in it. Recently, researchers have shown that considering regions between each pair of genes, called intergenic regions, can enhance distance estimation in realistic data. Two of the most studied genome rearrangements are the reversal, which inverts a sequence of genes, and the transposition, which occurs when two adjacent gene sequences swap their positions inside the genome. In this work, we study the transposition distance between two genomes, but we also consider intergenic regions, a problem we name Sorting by Intergenic Transpositions. We show that this problem is NP-hard and propose two approximation algorithms, with factors 3.5 and 2.5, considering two distinct definitions for the problem. We also investigate the signed reversal and transposition distance between two genomes considering their intergenic regions. This second problem is called Sorting by Signed Intergenic Reversals and Intergenic Transpositions. We show that this problem is NP-hard and develop two approximation algorithms, with factors 3 and 2.5. We check how these algorithms behave when assigning weights for genome rearrangements. Finally, we implemented all these algorithms and tested them on real and simulated data.
Assuntos
Algoritmos , Rearranjo Gênico/genética , Genoma/genética , Genômica/métodos , Elementos de DNA Transponíveis/genética , DNA Intergênico/genética , Análise de Sequência de DNARESUMO
BACKGROUND: Mass spectrometry remains the privileged method to characterize proteins. Nevertheless, most of the spectra generated by an experiment remain unidentified after their analysis, mostly because of the modifications they carry. Open Modification Search (OMS) methods offer a promising answer to this problem. However, assessing the quality of OMS identifications remains a difficult task. METHODS: Aiming at better understanding the relationship between (1) similarity of pairs of spectra provided by OMS methods and (2) relevance of their corresponding peptide sequences, we used a dataset composed of theoretical spectra only, on which we applied two OMS strategies. We also introduced two appropriately defined measures for evaluating the above mentioned spectra/sequence relevance in this context: one is a color classification representing the level of difficulty to retrieve the proper sequence of the peptide that generated the identified spectrum ; the other, called LIPR, is the proportion of common masses, in a given Peptide Spectrum Match (PSM), that represent dissimilar sequences. These two measures were also considered in conjunction with the False Discovery Rate (FDR). RESULTS: According to our measures, the strategy that selects the best candidate by taking the mass difference between two spectra into account yields better quality results. Besides, although the FDR remains an interesting indicator in OMS methods (as shown by LIPR), it is questionable: indeed, our color classification shows that a non negligible proportion of relevant spectra/sequence interpretations corresponds to PSMs coming from the decoy database. CONCLUSIONS: The three above mentioned measures allowed us to clearly determine which of the two studied OMS strategies outperformed the other, both in terms of number of identifications and of accuracy of these identifications. Even though quality evaluation of PSMs in OMS methods remains challenging, the study of theoretical spectra is a favorable framework for going further in this direction.
Assuntos
Proteômica , Espectrometria de Massas em Tandem , Algoritmos , Bases de Dados de Proteínas , Peptídeos , SoftwareRESUMO
Genome rearrangements are mutations affecting large portions of a genome, and a reversal is one of the most studied genome rearrangements in the literature through the Sorting by Reversals (SbR) problem. SbR is solvable in polynomial time on signed permutations (i.e., the gene orientation is known), and it is NP-hard on unsigned permutations. This problem (and many others considering genome rearrangements) models genome as a list of its genes in the order they appear, ignoring all other information present in the genome. Recent works claimed that the incorporation of the size of intergenic regions, i.e., sequences of nucleotides between genes, may result in better estimators for the real distance between genomes. Here we introduce the Sorting Signed Permutations by Intergenic Reversals problem, that sorts a signed permutation using reversals both on gene order and intergenic sizes. We show that this problem is NP-hard by a reduction from the 3-partition problem. Then, we propose a 2-approximation algorithm for it. Finally, we also incorporate intergenic indels (i.e., insertions or deletions of intergenic regions) to overcome a limitation of sorting by conservative events (such as reversals) and propose two approximation algorithms.
Assuntos
DNA Intergênico/genética , Rearranjo Gênico/genética , Genômica/legislação & jurisprudência , Algoritmos , Mutação INDEL/genética , Modelos Genéticos , Mutação/genéticaRESUMO
During the evolutionary process, genomes are affected by various genome rearrangements, that is, events that modify large stretches of the genetic material. In the literature, a large number of models have been proposed to estimate the number of events that occurred during evolution; most of them represent a genome as an ordered sequence of genes, and, in particular, disregard the genetic material between consecutive genes. However, recent studies showed that taking into account the genetic material between consecutive genes can enhance evolutionary distance estimations. Reversal and transposition are genome rearrangements that have been widely studied in the literature. A reversal inverts a (contiguous) segment of the genome, while a transposition swaps the positions of two consecutive segments. Genomes also undergo nonconservative events (events that alter the amount of genetic material) such as insertions and deletions, in which genetic material from intergenic regions of the genome is inserted or deleted, respectively. In this article, we study a genome rearrangement model that considers both gene order and sizes of intergenic regions. We investigate the reversal distance, and also the reversal and transposition distance between two genomes in two scenarios: with and without nonconservative events. We show that these problems are NP-hard and we present constant ratio approximation algorithms for all of them. More precisely, we provide a 4-approximation algorithm for the reversal distance, both in the conservative and nonconservative versions. For the reversal and transposition distance, we provide a 4.5-approximation algorithm, both in the conservative and nonconservative versions. We also perform experimental tests to verify the behavior of our algorithms, as well as to compare the practical and theoretical results. We finally extend our study to scenarios in which events have different costs, and we present constant ratio approximation algorithms for each scenario.
RESUMO
BACKGROUND: The evolutionary distance between two genomes can be estimated by computing a minimum length sequence of operations, called genome rearrangements, that transform one genome into another. Usually, a genome is modeled as an ordered sequence of genes, and most of the studies in the genome rearrangement literature consist in shaping biological scenarios into mathematical models. For instance, allowing different genome rearrangements operations at the same time, adding constraints to these rearrangements (e.g., each rearrangement can affect at most a given number of genes), considering that a rearrangement implies a cost depending on its length rather than a unit cost, etc. Most of the works, however, have overlooked some important features inside genomes, such as the presence of sequences of nucleotides between genes, called intergenic regions. RESULTS AND CONCLUSIONS: In this work, we investigate the problem of computing the distance between two genomes, taking into account both gene order and intergenic sizes. The genome rearrangement operations we consider here are constrained types of reversals and transpositions, called super short reversals (SSRs) and super short transpositions (SSTs), which affect up to two (consecutive) genes. We denote by super short operations (SSOs) any SSR or SST. We show 3-approximation algorithms when the orientation of the genes is not considered when we allow SSRs, SSTs, or SSOs, and 5-approximation algorithms when considering the orientation for either SSRs or SSOs. We also show that these algorithms improve their approximation factors when the input permutation has a higher number of inversions, where the approximation factor decreases from 3 to either 2 or 1.5, and from 5 to either 3 or 2.
RESUMO
Understanding the factors that modulate bacterial community assembly in natural soils is a longstanding challenge in microbial community ecology. In this work, we compared two microbial co-occurrence networks representing bacterial soil communities from two different sections of a pH, temperature and humidity gradient occurring along a western slope of the Andes in the Atacama Desert. In doing so, a topological graph alignment of co-occurrence networks was used to determine the impact of a shift in environmental variables on OTUs taxonomic composition and their relationships. We observed that a fraction of association patterns identified in the co-occurrence networks are persistent despite large environmental variation. This apparent resilience seems to be due to: (1) a proportion of OTUs that persist across the gradient and maintain similar association patterns within the community and (2) bacterial community ecological rearrangements, where an important fraction of the OTUs come to fill the ecological roles of other OTUs in the other network. Actually, potential functional features suggest a fundamental role of persistent OTUs along the soil gradient involving nitrogen fixation. Our results allow identifying factors that induce changes in microbial assemblage configuration, altering specific bacterial soil functions and interactions within the microbial communities in natural environments.
Assuntos
Archaea/fisiologia , Fenômenos Fisiológicos Bacterianos/genética , Ecologia , Microbiota/fisiologia , Archaea/crescimento & desenvolvimento , Microbiota/genética , RNA Ribossômico 16S , Microbiologia do Solo , Estresse Fisiológico/genética , Estresse Fisiológico/fisiologiaRESUMO
BACKGROUND: Combinatorial works on genome rearrangements have so far ignored the influence of intergene sizes, i.e. the number of nucleotides between consecutive genes, although it was recently shown decisive for the accuracy of inference methods (Biller et al. in Genome Biol Evol 8:1427-39, 2016; Biller et al. in Beckmann A, Bienvenu L, Jonoska N, editors. Proceedings of Pursuit of the Universal-12th conference on computability in Europe, CiE 2016, Lecture notes in computer science, vol 9709, Paris, France, June 27-July 1, 2016. Berlin: Springer, p. 35-44, 2016). In this line, we define a new genome rearrangement model called wDCJ, a generalization of the well-known double cut and join (or DCJ) operation that modifies both the gene order and the intergene size distribution of a genome. RESULTS: We first provide a generic formula for the wDCJ distance between two genomes, and show that computing this distance is strongly NP-complete. We then propose an approximation algorithm of ratio 4/3, and two exact ones: a fixed-parameter tractable (FPT) algorithm and an integer linear programming (ILP) formulation. CONCLUSIONS: We provide theoretical and empirical bounds on the expected growth of the parameter at the center of our FPT and ILP algorithms, assuming a probabilistic model of evolution under wDCJ, which shows that both these algorithms should run reasonably fast in practice.
RESUMO
BACKGROUND: As one of the most studied genome rearrangements, tandem repeats have a considerable impact on genetic backgrounds of inherited diseases. Many methods designed for tandem repeat detection on reference sequences obtain high quality results. However, in the case of a de novo context, where no reference sequence is available, tandem repeat detection remains a difficult problem. The short reads obtained with the second-generation sequencing methods are not long enough to span regions that contain long repeats. This length limitation was tackled by the long reads obtained with the third-generation sequencing platforms such as Pacific Biosciences technologies. Nevertheless, the gain on the read length came with a significant increase of the error rate. The main objective of nowadays studies on long reads is to handle the high error rate up to 16%. METHODS: In this paper we present MixTaR, the first de novo method for tandem repeat detection that combines the high-quality of short reads and the large length of long reads. Our hybrid algorithm uses the set of short reads for tandem repeat pattern detection based on a de Bruijn graph. These patterns are then validated using the long reads, and the tandem repeat sequences are constructed using local greedy assemblies. RESULTS: MixTaR is tested with both simulated and real reads from complex organisms. For a complete analysis of its robustness to errors, we use short and long reads with different error rates. The results are then analysed in terms of number of tandem repeats detected and the length of their patterns. CONCLUSIONS: Our method shows high precision and sensitivity. With low false positive rates even for highly erroneous reads, MixTaR is able to detect accurate tandem repeats with pattern lengths varying within a significant interval.
Assuntos
Algoritmos , Sequências de Repetição em Tandem/genética , Animais , Caenorhabditis elegans/genética , Cromossomos , Genoma Bacteriano , Legionella pneumophila/genéticaRESUMO
Epigenome modulation potentially provides a mechanism for organisms to adapt, within and between generations. However, neither the extent to which this occurs, nor the mechanisms involved are known. Here we investigate DNA methylation variation in Swedish Arabidopsis thaliana accessions grown at two different temperatures. Environmental effects were limited to transposons, where CHH methylation was found to increase with temperature. Genome-wide association studies (GWAS) revealed that the extensive CHH methylation variation was strongly associated with genetic variants in both cis and trans, including a major trans-association close to the DNA methyltransferase CMT2. Unlike CHH methylation, CpG gene body methylation (GBM) was not affected by growth temperature, but was instead correlated with the latitude of origin. Accessions from colder regions had higher levels of GBM for a significant fraction of the genome, and this was associated with increased transcription for the genes affected. GWAS revealed that this effect was largely due to trans-acting loci, many of which showed evidence of local adaptation.
Assuntos
Adaptação Fisiológica/genética , Proteínas de Arabidopsis/genética , Arabidopsis/genética , DNA (Citosina-5-)-Metiltransferases/genética , Regulação da Expressão Gênica de Plantas , Genoma de Planta , Arabidopsis/metabolismo , Proteínas de Arabidopsis/metabolismo , Ilhas de CpG , DNA (Citosina-5-)-Metiltransferases/metabolismo , Metilação de DNA , Elementos de DNA Transponíveis , Epigênese Genética , Perfilação da Expressão Gênica , Variação Genética , Estudo de Associação Genômica Ampla , Temperatura , Transcrição GênicaRESUMO
We present Oqtans, an open-source workbench for quantitative transcriptome analysis, that is integrated in Galaxy. Its distinguishing features include customizable computational workflows and a modular pipeline architecture that facilitates comparative assessment of tool and data quality. Oqtans integrates an assortment of machine learning-powered tools into Galaxy, which show superior or equal performance to state-of-the-art tools. Implemented tools comprise a complete transcriptome analysis workflow: short-read alignment, transcript identification/quantification and differential expression analysis. Oqtans and Galaxy facilitate persistent storage, data exchange and documentation of intermediate results and analysis workflows. We illustrate how Oqtans aids the interpretation of data from different experiments in easy to understand use cases. Users can easily create their own workflows and extend Oqtans by integrating specific tools. Oqtans is available as (i) a cloud machine image with a demo instance at cloud.oqtans.org, (ii) a public Galaxy instance at galaxy.cbio.mskcc.org, (iii) a git repository containing all installed software (oqtans.org/git); most of which is also available from (iv) the Galaxy Toolshed and (v) a share string to use along with Galaxy CloudMan.
Assuntos
RNA/genética , Análise de Sequência de RNA/métodos , Transcriptoma , Sequência de Bases , Internet , SoftwareRESUMO
CONTEXT: Prophylactic intraoperative ureteral stent placement is performed to decrease operative ureteric injury, though few data are available on the effectiveness of this procedure, and no data are available on its cost. OBJECTIVE: To analyze the cost of prophylactic intraoperative cystoscopic ureteral stents in gynecologic surgery. METHODS: All cases of prophylactic ureteral stent placement performed in gynecologic surgery during a 1-year period were identified and retrospectively reviewed through the electronic medical records database of Summa Health System. Costs were obtained through the Healthcare Cost Accounting System. The principles of cost-effective analysis were used (ie, explicit and detailed descriptions of costs and cost-effectiveness statistics). Importantly, we evaluated cost and not charges or financial model estimates. In addition, we obtained the contribution margins (ie, the hospital's net profit or loss) for prophylactic ureteral stent placement. Other gynecologic procedures were also analyzed. RESULTS: Among 792 major inpatient gynecologic procedures, 18 cases of prophylactic intraoperative ureteral stents were identified. Median costs were as follows: additional cost of prophylactic intraoperative ureteral stenting, $1580; additional cost of surgical resources, $770; cost of ureteral catheters, $427; cost of surgeons, $383. The contribution margins per case for various gynecologic surgical procedures were as follows: oophorectomy, $2804 profit; abdominal hysterectomy, $2649 profit; laparoscopically assisted vaginal hysterectomy (LAVH), $1760 profit. When intraoperative ureteral stenting was added, the contribution margins changed to the following: oophorectomy, $782 profit; abdominal hysterectomy, $627 profit; LAVH, $262 loss. Overall, the contribution margin profit was decreased by about 85%, from $2400 to $380. CONCLUSION: Prophylactic intraoperative ureteral stenting in gynecologic surgery decreases a hospital's contribution margin. Because of the expense of this procedure, as well as scientific data suggesting a lack of effectiveness, the authors argue that prophylactic intraoperative ureteral stenting should not be used in gynecologic surgery to decrease operative ureteric injury.
Assuntos
Cistoscopia/economia , Procedimentos Cirúrgicos em Ginecologia/economia , Cuidados Intraoperatórios/economia , Stents/economia , Ureter/cirurgia , Adulto , Idoso , Análise Custo-Benefício , Cistoscopia/instrumentação , Feminino , Procedimentos Cirúrgicos em Ginecologia/instrumentação , Humanos , Pacientes Internados , Cuidados Intraoperatórios/instrumentação , Pessoa de Meia-Idade , Pennsylvania , Estudos RetrospectivosRESUMO
Genetic differences between Arabidopsis thaliana accessions underlie the plant's extensive phenotypic variation, and until now these have been interpreted largely in the context of the annotated reference accession Col-0. Here we report the sequencing, assembly and annotation of the genomes of 18 natural A. thaliana accessions, and their transcriptomes. When assessed on the basis of the reference annotation, one-third of protein-coding genes are predicted to be disrupted in at least one accession. However, re-annotation of each genome revealed that alternative gene models often restore coding potential. Gene expression in seedlings differed for nearly half of expressed genes and was frequently associated with cis variants within 5 kilobases, as were intron retention alternative splicing events. Sequence and expression variation is most pronounced in genes that respond to the biotic environment. Our data further promote evolutionary and functional studies in A. thaliana, especially the MAGIC genetic reference population descended from these accessions.
Assuntos
Arabidopsis/genética , Perfilação da Expressão Gênica , Regulação da Expressão Gênica de Plantas/genética , Genoma de Planta/genética , Transcrição Gênica/genética , Arabidopsis/classificação , Proteínas de Arabidopsis/genética , Sequência de Bases , Genes de Plantas/genética , Genômica , Haplótipos/genética , Mutação INDEL/genética , Anotação de Sequência Molecular , Filogenia , Polimorfismo de Nucleotídeo Único/genética , Proteoma/genética , Plântula/genética , Análise de Sequência de DNARESUMO
Current methods for detecting synteny work well for genomes with high degrees of inter- and intra-species chromosomal homology, such as mammals. This paper presents a new algorithm for synteny computation that is well suited to genomes covering a large evolutionary span. It is based on a three-step process: identification of initial microsyntenic homologous regions, extension of homologous boundaries and reconstruction of syntenic blocks by identification of groups of homologous genomic segments that are conserved in every subject genome. Our method performs as well as GRIMM-Synteny on mammalian genomes, and outperforms it for clades with much greater evolutionary distances such as the Hemiascomycetous yeasts.
Assuntos
Algoritmos , Sintenia , Animais , Biologia Computacional , DNA Concatenado/genética , Evolução Molecular , Duplicação Gênica , Genômica/estatística & dados numéricos , Humanos , Mamíferos/genética , Leveduras/genéticaRESUMO
Next-generation sequencing technologies have revolutionized genome and transcriptome sequencing. RNA-Seq experiments are able to generate huge amounts of transcriptome sequence reads at a fraction of the cost of Sanger sequencing. Reads produced by these technologies are relatively short and error prone. To utilize such reads for transcriptome reconstruction and gene-structure identification, one needs to be able to accurately align the sequence reads over intron boundaries. In this unit, we describe PALMapper, a fast and easy-to-use tool that is designed to accurately compute both unspliced and spliced alignments for millions of RNA-Seq reads. It combines the efficient read mapper GenomeMapper with the spliced aligner QPALMA, which exploits read-quality information and predictions of splice sites to improve the alignment accuracy. The PALMapper package is available as a command-line tool running on Unix or Mac OS X systems or through a Web interface based on Galaxy tools.
Assuntos
Genômica/métodos , RNA/química , Alinhamento de Sequência/métodos , Análise de Sequência de RNA/métodos , Software , Sequência de Bases , Perfilação da Expressão Gênica , Genoma , Splicing de RNARESUMO
The study of evolutionary mechanisms is made more and more accurate by the increase in the number of fully sequenced genomes. One of the main problems is to reconstruct plausible ancestral genome architectures based on the comparison of contemporary genomes. Current methods have largely focused on finding complete architectures for ancestral genomes, and, due to the computational difficulty of the problem, stop after a small number of equivalent minimal solutions have been found. Recent results suggest, however, that the set of minimum complete architectures is very large and heterogeneous. In fact these solutions are collections of conserved blocks, freely rearranged. In this paper, we identify these conserved super-blocks, using a new method of analysis of ancestral architectures that reconciles both breakpoint and rearrangement analyses, as well as respects biological constraints. The resulting algorithms permit the first reliable reconstruction of plausible ancestral architectures for several non-WGD yeasts simultaneously, a problem hitherto intractable due to the extensive map reshuffling of these species. See online Supplementary Material at www.liebertonline.com.
Assuntos
Algoritmos , Evolução Molecular , Genoma , Animais , Gatos , Simulação por Computador , Genoma Fúngico , Humanos , Camundongos , Modelos Genéticos , FilogeniaRESUMO
Our knowledge of yeast genomes remains largely dominated by the extensive studies on Saccharomyces cerevisiae and the consequences of its ancestral duplication, leaving the evolution of the entire class of hemiascomycetes only partly explored. We concentrate here on five species of Saccharomycetaceae, a large subdivision of hemiascomycetes, that we call "protoploid" because they diverged from the S. cerevisiae lineage prior to its genome duplication. We determined the complete genome sequences of three of these species: Kluyveromyces (Lachancea) thermotolerans and Saccharomyces (Lachancea) kluyveri (two members of the newly described Lachancea clade), and Zygosaccharomyces rouxii. We included in our comparisons the previously available sequences of Kluyveromyces lactis and Ashbya (Eremothecium) gossypii. Despite their broad evolutionary range and significant individual variations in each lineage, the five protoploid Saccharomycetaceae share a core repertoire of approximately 3300 protein families and a high degree of conserved synteny. Synteny blocks were used to define gene orthology and to infer ancestors. Far from representing minimal genomes without redundancy, the five protoploid yeasts contain numerous copies of paralogous genes, either dispersed or in tandem arrays, that, altogether, constitute a third of each genome. Ancient, conserved paralogs as well as novel, lineage-specific paralogs were identified.