RESUMO
Many questions in biology benefit greatly from the use of a variety of model systems. High-throughput sequencing methods have been a triumph in the democratization of diverse model systems. They allow for the economical sequencing of an entire genome or transcriptome of interest, and with technical variations can even provide insight into genome organization and the expression and regulation of genes. The analysis and biological interpretation of such large datasets can present significant challenges that depend on the 'scientific status' of the model system. While high-quality genome and transcriptome references are readily available for well-established model systems, the establishment of such references for an emerging model system often requires extensive resources such as finances, expertise and computation capabilities. The de novo assembly of a transcriptome represents an excellent entry point for genetic and molecular studies in emerging model systems as it can efficiently assess gene content while also serving as a reference for differential gene expression studies. However, the process of de novo transcriptome assembly is non-trivial, and as a rule must be empirically optimized for every dataset. For the researcher working with an emerging model system, and with little to no experience with assembling and quantifying short-read data from the Illumina platform, these processes can be daunting. In this guide we outline the major challenges faced when establishing a reference transcriptome de novo and we provide advice on how to approach such an endeavor. We describe the major experimental and bioinformatic steps, provide some broad recommendations and cautions for the newcomer to de novo transcriptome assembly and differential gene expression analyses. Moreover, we provide an initial selection of tools that can assist in the journey from raw short-read data to assembled transcriptome and lists of differentially expressed genes.
RESUMO
BACKGROUND: Genomic sequencing reads compressors are essential for balancing high-throughput sequencing short reads generation speed, large-scale genomic data sharing, and infrastructure storage expenditure. However, most existing short reads compressors rarely utilize big-memory systems and duplicative information between diverse sequencing files to achieve a higher compression ratio for conserving reads data storage space. RESULTS: We employ compression ratio as the optimization objective and propose a large-scale genomic sequencing short reads data compression optimizer, named PMFFRC, through novelty memory modeling and redundant reads clustering technologies. By cascading PMFFRC, in 982 GB fastq format sequencing data, with 274 GB and 3.3 billion short reads, the state-of-the-art and reference-free compressors HARC, SPRING, Mstcom, and FastqCLS achieve 77.89%, 77.56%, 73.51%, and 29.36% average maximum compression ratio gains, respectively. PMFFRC saves 39.41%, 41.62%, 40.99%, and 20.19% of storage space sizes compared with the four unoptimized compressors. CONCLUSIONS: PMFFRC rational usage big-memory of compression server, effectively saving the sequencing reads data storage space sizes, which relieves the basic storage facilities costs and community sharing transmitting overhead. Our work furnishes a novel solution for improving sequencing reads compression and saving storage space. The proposed PMFFRC algorithm is packaged in a same-name Linux toolkit, available un-limited at https://github.com/fahaihi/PMFFRC .
Assuntos
Compressão de Dados , Software , Algoritmos , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Análise por Conglomerados , Análise de Sequência de DNARESUMO
Ciborinia camelliae Kohn is a camellia pathogen belonging to family Sclerotiniaceae, infecting only flowers of camellias. To better understand the virulence mechanism in this species, the draft genome sequence of the Italian strain of C. camelliae was obtained with a hybrid approach, combining Illumina HiSeq paired reads and MinIon Nanopore long-read sequencing. This combination improved significantly the existing National Center for Biotechnology Information reference genome. The assembly contiguity was implemented decreasing the contig number from 2,604 to 49. The N50 contig size increased from 31,803 to 2,726,972 bp and the completeness of assembly increased from 94.5 to 97.3% according to BUSCO analysis. This work is foundational to allow functional analysis of the infection process in this scarcely known floral pathogen. [Formula: see text] Copyright © 2022 The Author(s). This is an open access article distributed under the CC BY-NC-ND 4.0 International license.
Assuntos
Ascomicetos , Camellia , Camellia/genética , Genoma , Ascomicetos/genética , FloresRESUMO
With the rapid progress of sequencing technologies, various types of sequencing reads and assembly algorithms have been designed to construct genome assemblies. Although recent studies have attempted to evaluate the appropriate type of sequencing reads and algorithms for assembling high-quality genomes, it is still a challenge to set the correct combination for constructing animal genomes. Here, we present a comparative performance assessment of 14 assembly combinations-9 software programs with different short and long reads of Duroc pig. Based on the results of the optimization process for genome construction, we designed an integrated hybrid de novo assembly pipeline, HSCG, and constructed a draft genome for Duroc pig. Comparison between the new genome and Sus scrofa 11.1 revealed important breakpoints in two S. scrofa 11.1 genes. Our findings may provide new insights into the pan-genome analysis studies of agricultural animals, and the integrated assembly pipeline may serve as a guide for the assembly of other animal genomes.
Assuntos
Algoritmos , Mapeamento Cromossômico/métodos , Mapeamento de Sequências Contíguas/métodos , Genoma , Suínos/genética , Animais , Biblioteca Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Masculino , Análise de Sequência de DNA , SoftwareRESUMO
Computational methods based on whole genome linked-reads and short-reads have been successful in genome assembly and detection of structural variants (SVs). Numerous variant callers that rely on linked-reads and short reads can detect genetic variations, including SVs. A shortcoming of existing tools is a propensity for overestimating SVs, especially for deletions. Optimizing the advantages of linked-read and short-read sequencing technologies would thus benefit from an additional step to effectively identify and eliminate false positive large deletions. Here, we introduce a novel tool, AquilaDeepFilter, aiming to automatically filter genome-wide false positive large deletions. Our approach relies on transforming sequencing data into an image and then relying on convolutional neural networks to improve classification of candidate deletions as such. Input data take into account multiple alignment signals including read depth, split reads and discordant read pairs. We tested the performance of AquilaDeepFilter on five linked-reads and short-read libraries sequenced from the well-studied NA24385 sample, validated against the Genome in a Bottle benchmark. To demonstrate the filtering ability of AquilaDeepFilter, we utilized the SV calls from three upstream SV detection tools including Aquila, Aquila_stLFR and Delly as the baseline. We showed that AquilaDeepFilter increased precision while preserving the recall rate of all three tools. The overall F1-score improved by an average 20% on linked-reads and by an average of 15% on short-read data. AquilaDeepFilter also compared favorably to existing deep learning based methods for SV filtering, such as DeepSVFilter. AquilaDeepFilter is thus an effective SV refinement framework that can improve SV calling for both linked-reads and short-read data.
Assuntos
Aprendizado Profundo , Genoma Humano , Sequência de Bases , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Análise de Sequência , Análise de Sequência de DNA/métodosRESUMO
BACKGROUND: De novo genome assembly is essential to modern genomics studies. As it is not biased by a reference, it is also a useful method for studying genomes with high variation, such as cancer genomes. De novo short-read assemblers commonly use de Bruijn graphs, where nodes are sequences of equal length k, also known as k-mers. Edges in this graph are established between nodes that overlap by [Formula: see text] bases, and nodes along unambiguous walks in the graph are subsequently merged. The selection of k is influenced by multiple factors, and optimizing this value results in a trade-off between graph connectivity and sequence contiguity. Ideally, multiple k sizes should be used, so lower values can provide good connectivity in lesser covered regions and higher values can increase contiguity in well-covered regions. However, current approaches that use multiple k values do not address the scalability issues inherent to the assembly of large genomes. RESULTS: Here we present RResolver, a scalable algorithm that takes a short-read de Bruijn graph assembly with a starting k as input and uses a k value closer to that of the read length to resolve repeats. RResolver builds a Bloom filter of sequencing reads which is used to evaluate the assembly graph path support at branching points and removes paths with insufficient support. RResolver runs efficiently, taking only 26 min on average for an ABySS human assembly with 48 threads and 60 GiB memory. Across all experiments, compared to a baseline assembly, RResolver improves scaffold contiguity (NGA50) by up to 15% and reduces misassemblies by up to 12%. CONCLUSIONS: RResolver adds a missing component to scalable de Bruijn graph genome assembly. By improving the initial and fundamental graph traversal outcome, all downstream ABySS algorithms greatly benefit by working with a more accurate and less complex representation of the genome. The RResolver code is integrated into ABySS and is available at https://github.com/bcgsc/abyss/tree/master/RResolver .
Assuntos
Genômica , Software , Algoritmos , Genoma , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Análise de Sequência de DNA/métodosRESUMO
BACKGROUND: To understand the dynamics of infectious diseases, genomic epidemiology is increasingly advocated, with a need for rapid generation of genetic sequences during outbreaks for public health decision making. Here, we explore the use of metagenomic sequencing compared to specific amplicon- and capture-based sequencing, both on the Nanopore and the Illumina platform for generation of whole genomes of Usutu virus, Zika virus, West Nile virus, and Yellow Fever virus. RESULTS: We show that amplicon-based Nanopore sequencing can be used to rapidly obtain whole genome sequences in samples with a viral load up to Ct 33 and capture-based Illumina is the most sensitive method for initial virus determination. CONCLUSIONS: The choice of sequencing approach and platform is important for laboratories wishing to start whole genome sequencing. Depending on the purpose of genome sequencing the best choice can differ. The insights presented in this work and the shown differences in data characteristics can guide labs to make a well informed choice.
Assuntos
Sequenciamento por Nanoporos , Infecção por Zika virus , Zika virus , Surtos de Doenças , Genoma Viral , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Metagenômica/métodos , Sequenciamento Completo do Genoma/métodos , Zika virus/genéticaRESUMO
BACKGROUND: Full-length isoform quantification from RNA-Seq is a key goal in transcriptomics analyses and has been an area of active development since the beginning. The fundamental difficulty stems from the fact that RNA transcripts are long, while RNA-Seq reads are short. RESULTS: Here we use simulated benchmarking data that reflects many properties of real data, including polymorphisms, intron signal and non-uniform coverage, allowing for systematic comparative analyses of isoform quantification accuracy and its impact on differential expression analysis. Genome, transcriptome and pseudo alignment-based methods are included; and a simple approach is included as a baseline control. CONCLUSIONS: Salmon, kallisto, RSEM, and Cufflinks exhibit the highest accuracy on idealized data, while on more realistic data they do not perform dramatically better than the simple approach. We determine the structural parameters with the greatest impact on quantification accuracy to be length and sequence compression complexity and not so much the number of isoforms. The effect of incomplete annotation on performance is also investigated. Overall, the tested methods show sufficient divergence from the truth to suggest that full-length isoform quantification and isoform level DE should still be employed selectively.
Assuntos
Perfilação da Expressão Gênica , Transcriptoma , Isoformas de Proteínas/genética , RNA-Seq , Análise de Sequência de RNARESUMO
BACKGROUND: SNP arrays, short- and long-read genome sequencing are genome-wide high-throughput technologies that may be used to assay copy number variants (CNVs) in a personal genome. Each of these technologies comes with its own limitations and biases, many of which are well-known, but not all of them are thoroughly quantified. RESULTS: We assembled an ensemble of public datasets of published CNV calls and raw data for the well-studied Genome in a Bottle individual NA12878. This assembly represents a variety of methods and pipelines used for CNV calling from array, short- and long-read technologies. We then performed cross-technology comparisons regarding their ability to call CNVs. Different from other studies, we refrained from using the golden standard. Instead, we attempted to validate the CNV calls by the raw data of each technology. CONCLUSIONS: Our study confirms that long-read platforms enable recalling CNVs in genomic regions inaccessible to arrays or short reads. We also found that the reproducibility of a CNV by different pipelines within each technology is strongly linked to other CNV evidence measures. Importantly, the three technologies show distinct public database frequency profiles, which differ depending on what technology the database was built on.
Assuntos
Variações do Número de Cópias de DNA , Polimorfismo de Nucleotídeo Único , Genoma , Genômica , Reprodutibilidade dos TestesRESUMO
The collection of all transcripts in a cell, a tissue, or an organism is called the transcriptome, or meta-transcriptome when dealing with the transcripts of a community of different organisms. Nowadays, we have a vast array of technologies that allow us to assess the (meta-)transcriptome regarding its composition (which transcripts are produced) and the abundance of its components (what are the expression levels of each transcript), and we can do this across several samples, conditions, and time-points, at costs that are decreasing year after year, allowing experimental designs with ever-increasing complexity. Here we will present the current state of the art regarding the technologies that can be applied to the study of plant transcriptomes and their applications, including differential gene expression and coexpression analyses, identification of sequence polymorphisms, the application of machine learning for the identification of alternative splicing and ncRNAs, and the ranking of candidate genes for downstream studies. We continue with a collection of examples of these approaches in a diverse array of plant species to generate gene/transcript catalogs/atlases, population mapping, identification of genes related to stress phenotypes, and phylogenomics. We finalize the chapter with some of our ideas about the future of this dynamic field in plant physiology.
Assuntos
Perfilação da Expressão Gênica , Plantas/genética , Transcriptoma , Processamento Alternativo , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de RNARESUMO
BACKGROUND: Since 2009, numerous tools have been developed to detect structural variants using short read technologies. Insertions >50 bp are one of the hardest type to discover and are drastically underrepresented in gold standard variant callsets. The advent of long read technologies has completely changed the situation. In 2019, two independent cross technologies studies have published the most complete variant callsets with sequence resolved insertions in human individuals. Among the reported insertions, only 17 to 28% could be discovered with short-read based tools. RESULTS: In this work, we performed an in-depth analysis of these unprecedented insertion callsets in order to investigate the causes of such failures. We have first established a precise classification of insertion variants according to four layers of characterization: the nature and size of the inserted sequence, the genomic context of the insertion site and the breakpoint junction complexity. Because these levels are intertwined, we then used simulations to characterize the impact of each complexity factor on the recall of several structural variant callers. We showed that most reported insertions exhibited characteristics that may interfere with their discovery: 63% were tandem repeat expansions, 38% contained homology larger than 10 bp within their breakpoint junctions and 70% were located in simple repeats. Consequently, the recall of short-read based variant callers was significantly lower for such insertions (6% for tandem repeats vs 56% for mobile element insertions). Simulations showed that the most impacting factor was the insertion type rather than the genomic context, with various difficulties being handled differently among the tested structural variant callers, and they highlighted the lack of sequence resolution for most insertion calls. CONCLUSIONS: Our results explain the low recall by pointing out several difficulty factors among the observed insertion features and provide avenues for improving SV caller algorithms and their combinations.
Assuntos
Genoma , Genômica , Algoritmos , Sequência de Bases , Humanos , Análise de Sequência , Análise de Sequência de DNARESUMO
BACKGROUND: Patient-Derived Tumour Xenografts (PDTXs) have emerged as the pre-clinical models that best represent clinical tumour diversity and intra-tumour heterogeneity. The molecular characterization of PDTXs using High-Throughput Sequencing (HTS) is essential; however, the presence of mouse stroma is challenging for HTS data analysis. Indeed, the high homology between the two genomes results in a proportion of mouse reads being mapped as human. RESULTS: In this study we generated Whole Exome Sequencing (WES), Reduced Representation Bisulfite Sequencing (RRBS) and RNA sequencing (RNA-seq) data from samples with known mixtures of mouse and human DNA or RNA and from a cohort of human breast cancers and their derived PDTXs. We show that using an In silico Combined human-mouse Reference Genome (ICRG) for alignment discriminates between human and mouse reads with up to 99.9% accuracy and decreases the number of false positive somatic mutations caused by misalignment by >99.9%. We also derived a model to estimate the human DNA content in independent PDTX samples. For RNA-seq and RRBS data analysis, the use of the ICRG allows dissecting computationally the transcriptome and methylome of human tumour cells and mouse stroma. In a direct comparison with previously reported approaches, our method showed similar or higher accuracy while requiring significantly less computing time. CONCLUSIONS: The computational pipeline we describe here is a valuable tool for the molecular analysis of PDTXs as well as any other mixture of DNA or RNA species.
Assuntos
Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Ensaios Antitumorais Modelo de Xenoenxerto , Animais , Neoplasias da Mama/genética , Neoplasias da Mama/metabolismo , Perfilação da Expressão Gênica , Humanos , Camundongos , Mutação , Alinhamento de Sequência , Análise de Sequência de DNA , Análise de Sequência de RNARESUMO
Recent advancements in sequencing technologies have decreased both time span and cost for sequencing the whole bacterial genome. High-throughput Next-Generation Sequencing (NGS) technology has led to the generation of enormous data concerning microbial populations publically available across various repositories. As a consequence, it has become possible to study and compare the genomes of different bacterial strains within a species or genus in terms of evolution, ecology and diversity. Studying the pan-genome provides insights into deciphering microevolution, global composition and diversity in virulence and pathogenesis of a species. It can also assist in identifying drug targets and proposing vaccine candidates. The effective analysis of these large genome datasets necessitates the development of robust tools. Current methods to develop pan-genome do not support direct input of raw reads from the sequencer machine but require preprocessing of reads as an assembled protein/gene sequence file or the binary matrix of orthologous genes/proteins. We have designed an easy-to-use integrated pipeline, NGSPanPipe, which can directly identify the pan-genome from short reads. The output from the pipeline is compatible with other pan-genome analysis tools. We evaluated our pipeline with other methods for developing pan-genome, i.e. reference-based assembly and de novo assembly using simulated reads of Mycobacterium tuberculosis. The single script pipeline (pipeline.pl) is applicable for all bacterial strains. It integrates multiple in-house Perl scripts and is freely accessible from https://github.com/Biomedinformatics/NGSPanPipe .
Assuntos
Bactérias/genética , Genoma Bacteriano , Bactérias/classificação , Bactérias/isolamento & purificação , Bases de Dados Genéticas , Sequenciamento de Nucleotídeos em Larga EscalaRESUMO
BACKGROUND: Next-generation sequencing has been used by investigators to address a diverse range of biological problems through, for example, polymorphism and mutation discovery and microRNA profiling. However, compared to conventional sequencing, the error rates for next-generation sequencing are often higher, which impacts the downstream genomic analysis. Recently, Wang et al. (BMC Bioinformatics 13:185, 2012) proposed a shadow regression approach to estimate the error rates for next-generation sequencing data based on the assumption of a linear relationship between the number of reads sequenced and the number of reads containing errors (denoted as shadows). However, this linear read-shadow relationship may not be appropriate for all types of sequence data. Therefore, it is necessary to estimate the error rates in a more reliable way without assuming linearity. We proposed an empirical error rate estimation approach that employs cubic and robust smoothing splines to model the relationship between the number of reads sequenced and the number of shadows. RESULTS: We performed simulation studies using a frequency-based approach to generate the read and shadow counts directly, which can mimic the real sequence counts data structure. Using simulation, we investigated the performance of the proposed approach and compared it to that of shadow linear regression. The proposed approach provided more accurate error rate estimations than the shadow linear regression approach for all the scenarios tested. We also applied the proposed approach to assess the error rates for the sequence data from the MicroArray Quality Control project, a mutation screening study, the Encyclopedia of DNA Elements project, and bacteriophage PhiX DNA samples. CONCLUSIONS: The proposed empirical error rate estimation approach does not assume a linear relationship between the error-free read and shadow counts and provides more accurate estimations of error rates for next-generation, short-read sequencing data.
Assuntos
Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Análise de Sequência de RNA/métodos , Bacteriófago phi X 174/genética , Simulação por Computador , DNA Viral/genética , Genômica , Humanos , Células K562 , Modelos Lineares , Análise de Sequência com Séries de OligonucleotídeosRESUMO
BACKGROUND: Streptococcus suis is divided into 29 serotypes based on a serological reaction against the capsular polysaccharide (CPS). Multiplex PCR tests targeting the cps locus are also used to determine S. suis serotypes, but they cannot differentiate between serotypes 1 and 14, and between serotypes 2 and 1/2. Here, we developed a pipeline permitting in silico serotype determination from whole-genome sequencing (WGS) short-read data that can readily identify all 29 S. suis serotypes. RESULTS: We sequenced the genomes of 121 strains representing all 29 known S. suis serotypes. We next combined available software into an automated pipeline permitting in silico serotyping of strains by differential alignment of short-read sequencing data to a custom S. suis cps loci database. Strains of serotype pairs 1 and 14, and 2 and 1/2 could be differentiated by a missense mutation in the cpsK gene. We report a 99 % match between coagglutination- and pipeline-determined serotypes for strains in our collection. We used 375 additional S. suis genomes downloaded from the NCBI's Sequence Read Archive (SRA) to validate the pipeline. Validation with SRA WGS data resulted in a 92 % match. Included pipeline subroutines permitted us to assess strain virulence marker content and obtain multilocus sequence typing directly from WGS data. CONCLUSIONS: Our pipeline permits rapid and accurate determination of S. suis serotype, and other lineage information, directly from WGS data. By discriminating between serotypes 1 and 14, and between serotypes 2 and 1/2, our approach solves a three-decade longstanding S. suis typing issue.
Assuntos
Sorogrupo , Sorotipagem , Streptococcus suis/genética , Streptococcus suis/isolamento & purificação , Cápsulas Bacterianas , Proteínas de Bactérias , Sequência de Bases , DNA Bacteriano/genética , Marcação de Genes , Genes Bacterianos , Loci Gênicos , Genoma Bacteriano , Reação em Cadeia da Polimerase Multiplex , Polissacarídeos Bacterianos/classificação , Polissacarídeos Bacterianos/genética , Polissacarídeos Bacterianos/imunologia , Polissacarídeos Bacterianos/isolamento & purificação , Alinhamento de Sequência , Análise de Sequência de DNA , Streptococcus suis/classificação , Streptococcus suis/imunologia , Virulência/genética , Fatores de VirulênciaRESUMO
Cancer is a multifaceted disease arising from numerous genomic aberrations that have been identified as a result of advancements in sequencing technologies. While next-generation sequencing (NGS), which uses short reads, has transformed cancer research and diagnostics, it is limited by read length. Third-generation sequencing (TGS), led by the Pacific Biosciences and Oxford Nanopore Technologies platforms, employs long-read sequences, which have marked a paradigm shift in cancer research. Cancer genomes often harbour complex events, and TGS, with its ability to span large genomic regions, has facilitated their characterisation, providing a better understanding of how complex rearrangements affect cancer initiation and progression. TGS has also characterised the entire transcriptome of various cancers, revealing cancer-associated isoforms that could serve as biomarkers or therapeutic targets. Furthermore, TGS has advanced cancer research by improving genome assemblies, detecting complex variants, and providing a more complete picture of transcriptomes and epigenomes. This review focuses on TGS and its growing role in cancer research. We investigate its advantages and limitations, providing a rigorous scientific analysis of its use in detecting previously hidden aberrations missed by NGS. This promising technology holds immense potential for both research and clinical applications, with far-reaching implications for cancer diagnosis and treatment.
RESUMO
Accurate reconstruction of Escherichia coli antibiotic resistance gene (ARG) plasmids from Illumina sequencing data has proven to be a challenge with current bioinformatic tools. In this work, we present an improved method to reconstruct E. coli plasmids using short reads. We developed plasmidEC, an ensemble classifier that identifies plasmid-derived contigs by combining the output of three different binary classification tools. We showed that plasmidEC is especially suited to classify contigs derived from ARG plasmids with a high recall of 0.941. Additionally, we optimized gplas, a graph-based tool that bins plasmid-predicted contigs into distinct plasmid predictions. Gplas2 is more effective at recovering plasmids with large sequencing coverage variations and can be combined with the output of any binary classifier. The combination of plasmidEC with gplas2 showed a high completeness (median=0.818) and F1-Score (median=0.812) when reconstructing ARG plasmids and exceeded the binning capacity of the reference-based method MOB-suite. In the absence of long-read data, our method offers an excellent alternative to reconstruct ARG plasmids in E. coli.
Assuntos
Escherichia coli , Sequenciamento de Nucleotídeos em Larga Escala , Escherichia coli/genética , Antibacterianos/farmacologia , Resistência Microbiana a Medicamentos , Plasmídeos/genéticaRESUMO
Diplocarpon coronariae is a fungal pathogen that is prevalent in low-input apple production. Over the past 15 years, it has become increasingly distributed in Europe. However, comprehensive insights into its biology and pathogenicity remain limited. One particular aspect is the rarity of the sexual morph of this pathogen, a phenomenon hitherto unobserved in Europe. Diplocarpon coronariae reproduces through a heterothallic mating system requiring at least two different mating types for sexual reproduction. Genes determining the mating types are located on the mating-type locus. In this study, D. coronariae strain DC1_JKI from Dresden, Germany, was sequenced and used to unravel the structure of the mating type locus. Using short-read and long-read sequencing methods, the first gapless and near-complete telomere-to-telomere genome assembly of D. coronariae was achieved. The assembled genome spans 51.2 Mbp and comprises 21 chromosome-scale contigs of high completeness. The generated genome sequence was used to in silico elucidate the structure of the mating-type locus, identified as MAT1-2. Furthermore, an examination of MAT1-1 and MAT1-2 frequency across a diverse set of samples sourced from Europe and Asia revealed the exclusive presence of MAT1-2 in European samples, whereas both MAT loci were present in Asian counterparts. Our findings suggest an explanation for the absence of the sexual morph, potentially linked to the absence of the second mating idiomorph of D. coronariae in European apple orchards.
RESUMO
Comprehensive characterization of structural variation in natural populations has only become feasible in the last decade. To investigate the population genomic nature of structural variation, reproducible and high-confidence structural variation callsets are first required. We created a population-scale reference of the genome-wide landscape of structural variation across 33 Nordic house sparrows (Passer domesticus). To produce a consensus callset across all samples using short-read data, we compare heuristic-based quality filtering and visual curation (Samplot/PlotCritic and Samplot-ML) approaches. We demonstrate that curation of structural variants is important for reducing putative false positives and that the time invested in this step outweighs the potential costs of analyzing short-read-discovered structural variation data sets that include many potential false positives. We find that even a lenient manual curation strategy (e.g. applied by a single curator) can reduce the proportion of putative false positives by up to 80%, thus enriching the proportion of high-confidence variants. Crucially, in applying a lenient manual curation strategy with a single curator, nearly all (>99%) variants rejected as putative false positives were also classified as such by a more stringent curation strategy using three additional curators. Furthermore, variants rejected by manual curation failed to reflect the expected population structure from SNPs, whereas variants passing curation did. Combining heuristic-based quality filtering with rapid manual curation of structural variants in short-read data can therefore become a time- and cost-effective first step for functional and population genomic studies requiring high-confidence structural variation callsets.