Search | VHL Regional Portal

1.

A comprehensive tandem repeat catalog of the human genome.

Chiu, Readman; Rajan-Babu, Indhu-Shree; Friedman, Jan M; Birol, Inanc.

medRxiv ; 2024 Jun 20.

Article in English | MEDLINE | ID: mdl-38947075

ABSTRACT

With the increasing availability of long-read sequencing data, high-quality human genome assemblies, and software for fully characterizing tandem repeats, genome-wide genotyping of tandem repeat loci on a population scale becomes more feasible. Such efforts not only expand our knowledge of the tandem repeat landscape in the human genome but also enhance our ability to differentiate pathogenic tandem repeat mutations from benign polymorphisms. To this end, we analyzed 272 genomes assembled using datasets from three public initiatives that employed different long-read sequencing technologies. Here, we report a catalog of over 18 million tandem repeat loci, many of which were previously unannotated. Some of these loci are highly polymorphic, and many of them reside within coding sequences.

2.

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification.

Pardo-Palacios, Francisco J; Wang, Dingjie; Reese, Fairlie; Diekhans, Mark; Carbonell-Sala, Sílvia; Williams, Brian; Loveland, Jane E; De María, Maite; Adams, Matthew S; Balderrama-Gutierrez, Gabriela; Behera, Amit K; Gonzalez Martinez, Jose M; Hunt, Toby; Lagarde, Julien; Liang, Cindy E; Li, Haoran; Meade, Marcus Jerryd; Moraga Amador, David A; Prjibelski, Andrey D; Birol, Inanc; Bostan, Hamed; Brooks, Ashley M; Çelik, Muhammed Hasan; Chen, Ying; Du, Mei R M; Felton, Colette; Göke, Jonathan; Hafezqorani, Saber; Herwig, Ralf; Kawaji, Hideya; Lee, Joseph; Li, Jian-Liang; Lienhard, Matthias; Mikheenko, Alla; Mulligan, Dennis; Nip, Ka Ming; Pertea, Mihaela; Ritchie, Matthew E; Sim, Andre D; Tang, Alison D; Wan, Yuk Kei; Wang, Changqing; Wong, Brandon Y; Yang, Chen; Barnes, If; Berry, Andrew E; Capella-Gutierrez, Salvador; Cousineau, Alyssa; Dhillon, Namrita; Fernandez-Gonzalez, Jose M.

Nat Methods ; 2024 Jun 07.

Article in English | MEDLINE | ID: mdl-38849569

ABSTRACT

The Long-read RNA-Seq Genome Annotation Assessment Project Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. Using different protocols and sequencing platforms, the consortium generated over 427 million long-read sequences from complementary DNA and direct RNA datasets, encompassing human, mouse and manatee species. Developers utilized these data to address challenges in transcript isoform detection, quantification and de novo transcript detection. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. Incorporating additional orthogonal data and replicate samples is advised when aiming to detect rare and novel transcripts or using reference-free approaches. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.

3.

Conifers Concentrate Large Numbers of NLR Immune Receptor Genes on One Chromosome.

Woudstra, Yannick; Tumas, Hayley; van Ghelder, Cyril; Hung, Tin Hang; Ilska, Joana J; Girardi, Sebastien; A'Hara, Stuart; McLean, Paul; Cottrell, Joan; Bohlmann, Joerg; Bousquet, Jean; Birol, Inanc; Woolliams, John A; MacKay, John J.

Genome Biol Evol ; 16(6)2024 06 04.

Article in English | MEDLINE | ID: mdl-38787537

ABSTRACT

Nucleotide-binding domain and leucine-rich repeat (NLR) immune receptor genes form a major line of defense in plants, acting in both pathogen recognition and resistance machinery activation. NLRs are reported to form large gene clusters in limber pine (Pinus flexilis), but it is unknown how widespread this genomic architecture may be among the extant species of conifers (Pinophyta). We used comparative genomic analyses to assess patterns in the abundance, diversity, and genomic distribution of NLR genes. Chromosome-level whole genome assemblies and high-density linkage maps in the Pinaceae, Cupressaceae, Taxaceae, and other gymnosperms were scanned for NLR genes using existing and customized pipelines. The discovered genes were mapped across chromosomes and linkage groups and analyzed phylogenetically for evolutionary history. Conifer genomes are characterized by dense clusters of NLR genes, highly localized on one chromosome. These clusters are rich in TNL-encoding genes, which seem to have formed through multiple tandem duplication events. In contrast to angiosperms and nonconiferous gymnosperms, genomic clustering of NLR genes is ubiquitous in conifers. NLR-dense genomic regions are likely to influence a large part of the plant's resistance, informing our understanding of adaptation to biotic stress and the development of genetic resources through breeding.

Subject(s)

Chromosomes, Plant , NLR Proteins , Tracheophyta , NLR Proteins/genetics , Chromosomes, Plant/genetics , Tracheophyta/genetics , Phylogeny , Genome, Plant , Evolution, Molecular , Plant Proteins/genetics , Multigene Family

4.

ntEmbd: Deep learning embedding for nucleotide sequences.

Hafezqorani, Saber; Nip, Ka Ming; Birol, Inanc.

bioRxiv ; 2024 May 02.

Article in English | MEDLINE | ID: mdl-38746190

ABSTRACT

Enabled by the explosion of data and substantial increase in computational power, deep learning has transformed fields such as computer vision and natural language processing (NLP) and it has become a successful method to be applied to many transcriptomic analysis tasks. A core advantage of deep learning is its inherent capability to incorporate feature computation within the machine learning models. This results in a comprehensive and machine-readable representation of sequences, facilitating the downstream classification and clustering tasks. Compared to machine translation problems in NLP, feature embedding is particularly challenging for transcriptomic studies as the sequences are string of thousands of nucleotides in length, which make the long-term dependencies between features from different parts of the sequence even more difficult to capture. This highlights the need for nucleotide sequence embedding methods that are capable of learning input sequence features implicitly. Here we introduce ntEmbd, a deep learning embedding tool that captures dependencies between different features of the sequences and learns a latent representation for given nucleotide sequences. We further provide two sample use cases, describing how learned RNA features can be used in downstream analysis. The first use case demonstrates ntEmbd's utility in classifying coding and noncoding RNA benchmarked against existing tools, and the second one explores the utility of learned representations in identifying adapter sequences in nanopore RNA-seq reads. The tool as well as the trained models are freely available on GitHub at https://github.com/bcgsc/ntEmbd.

5.

Transcriptomic profiling of Rana [Lithobates] catesbeiana back skin during natural and thyroid hormone-induced metamorphosis under different temperature regimes with particular emphasis on innate immune system components.

Corrie, Lorissa M; Kuecks-Winger, Haley; Ebrahimikondori, Hossein; Birol, Inanc; Helbing, Caren C.

Comp Biochem Physiol Part D Genomics Proteomics ; 50: 101238, 2024 Jun.

Article in English | MEDLINE | ID: mdl-38714098

ABSTRACT

As amphibians undergo thyroid hormone (TH)-dependent metamorphosis from an aquatic tadpole to the terrestrial frog, their innate immune system must adapt to the new environment. Skin is a primary line of defense, yet this organ undergoes extensive remodelling during metamorphosis and how it responds to TH is poorly understood. Temperature modulation, which regulates metamorphic timing, is a unique way to uncover early TH-induced transcriptomic events. Metamorphosis of premetamorphic tadpoles is induced by exogenous TH administration at 24 °C but is paused at 5 °C. However, at 5 °C a "molecular memory" of TH exposure is retained that results in an accelerated metamorphosis upon shifting to 24 °C. We used RNA-sequencing to identify changes in Rana (Lithobates) catesbeiana back skin gene expression during natural and TH-induced metamorphosis. During natural metamorphosis, significant differential expression (DE) was observed in >6500 transcripts including classic TH-responsive transcripts (thrb and thibz), heat shock proteins, and innate immune system components: keratins, mucins, and antimicrobial peptides (AMPs). Premetamorphic tadpoles maintained at 5 °C showed 83 DE transcripts within 48 h after TH administration, including thibz which has previously been identified as a molecular memory component in other tissues. Over 3600 DE transcripts were detected in TH-treated tadpoles at 24 °C or when tadpoles held at 5 °C were shifted to 24 °C. Gene ontology (GO) terms related to transcription, RNA metabolic processes, and translation were enriched in both datasets and immune related GO terms were observed in the temperature-modulated experiment. Our findings have implications on survival as climate change affects amphibia worldwide.

Subject(s)

Gene Expression Profiling , Immunity, Innate , Metamorphosis, Biological , Skin , Temperature , Thyroid Hormones , Transcriptome , Animals , Metamorphosis, Biological/drug effects , Immunity, Innate/drug effects , Skin/drug effects , Skin/metabolism , Thyroid Hormones/metabolism , Transcriptome/drug effects , Rana catesbeiana/genetics , Rana catesbeiana/growth & development , Larva/growth & development , Larva/genetics , Larva/drug effects , Amphibian Proteins/genetics

6.

Establishing association between HLA-C*04:01 and severe COVID-19.

Warren, René L; Abraham, Rohan; Calingo, Marc; Garant, Jean-Michel; Jones, Steven J M; Birol, Inanc.

HLA ; 103(1): e15355, 2024 Jan.

Article in English | MEDLINE | ID: mdl-38273454

Subject(s)

COVID-19 , Humans , HLA-C Antigens/genetics , Alleles , SARS-CoV-2 , Gene Frequency

7.

Long-insert sequence capture detects high copy numbers in a defence-related beta-glucosidase gene ßglu-1 with large variations in white spruce but not Norway spruce.

Hung, Tin Hang; Wu, Ernest T Y; Zeltins, Pauls; Jansons, Aris; Ullah, Aziz; Erbilgin, Nadir; Bohlmann, Joerg; Bousquet, Jean; Birol, Inanc; Clegg, Sonya M; MacKay, John J.

BMC Genomics ; 25(1): 118, 2024 Jan 27.

Article in English | MEDLINE | ID: mdl-38281030

ABSTRACT

Conifers are long-lived and slow-evolving, thus requiring effective defences against their fast-evolving insect natural enemies. The copy number variation (CNV) of two key acetophenone biosynthesis genes Ugt5/Ugt5b and ßglu-1 may provide a plausible mechanism underlying the constitutively variable defence in white spruce (Picea glauca) against its primary defoliator, spruce budworm. This study develops a long-insert sequence capture probe set (Picea_hung_p1.0) for quantifying copy number of ßglu-1-like, Ugt5-like genes and single-copy genes on 38 Norway spruce (Picea abies) and 40 P. glauca individuals from eight and nine provenances across Europe and North America respectively. We developed local assemblies (Piabi_c1.0 and Pigla_c.1.0), full-length transcriptomes (PIAB_v1 and PIGL_v1), and gene models to characterise the diversity of ßglu-1 and Ugt5 genes. We observed very large copy numbers of ßglu-1, with up to 381 copies in a single P. glauca individual. We observed among-provenance CNV of ßglu-1 in P. glauca but not P. abies. Ugt5b was predominantly single-copy in both species. This study generates critical hypotheses for testing the emergence and mechanism of extreme CNV, the dosage effect on phenotype, and the varying copy number of genes with the same pathway. We demonstrate new approaches to overcome experimental challenges in genomic research in conifer defences.

Subject(s)

Picea , Humans , Picea/genetics , Picea/metabolism , DNA Copy Number Variations , beta-Glucosidase/genetics , Genomics , Transcriptome

8.

Genomic structures and regulation patterns at HPV integration sites in cervical cancer.

Porter, Vanessa L; O'Neill, Kieran; MacLennan, Signe; Corbett, Richard D; Ng, Michelle; Culibrk, Luka; Hamadeh, Zeid; Iden, Marissa; Schmidt, Rachel; Tsaih, Shirng-Wern; Chang, Glenn; Fan, Jeremy; Nip, Ka Ming; Akbari, Vahid; Chan, Simon K; Hopkins, James; Moore, Richard A; Chuah, Eric; Mungall, Karen L; Mungall, Andrew J; Birol, Inanc; Jones, Steven J M; Rader, Janet S; Marra, Marco A.

bioRxiv ; 2023 Nov 05.

Article in English | MEDLINE | ID: mdl-37961641

ABSTRACT

Human papillomavirus (HPV) integration has been implicated in transforming HPV infection into cancer, but its genomic consequences have been difficult to study using short-read technologies. To resolve the dysregulation associated with HPV integration, we performed long-read sequencing on 63 cervical cancer genomes. We identified six categories of integration events based on HPV-human genomic structures. Of all HPV integrants, defined as two HPV-human breakpoints bridged by an HPV sequence, 24% contained variable copies of HPV between the breakpoints, a phenomenon we termed heterologous integration. Analysis of DNA methylation within and in proximity to the HPV genome at individual integration events revealed relationships between methylation status of the integrant and its orientation and structure. Dysregulation of the human epigenome and neighboring gene expression in cis with the HPV-integrated allele was observed over megabase-ranges of the genome. By elucidating the structural, epigenetic, and allele-specific impacts of HPV integration, we provide insight into the role of integrated HPV in cervical cancer.

9.

aaHash: recursive amino acid sequence hashing.

Wong, Johnathan; Kazemi, Parham; Coombe, Lauren; Warren, René L; Birol, Inanç.

Bioinform Adv ; 3(1): vbad162, 2023.

Article in English | MEDLINE | ID: mdl-38023332

ABSTRACT

Motivation: K-mer hashing is a common operation in many foundational bioinformatics problems. However, generic string hashing algorithms are not optimized for this application. Strings in bioinformatics use specific alphabets, a trait leveraged for nucleic acid sequences in earlier work. We note that amino acid sequences, with complexities and context that cannot be captured by generic hashing algorithms, can also benefit from a domain-specific hashing algorithm. Such a hashing algorithm can accelerate and improve the sensitivity of bioinformatics applications developed for protein sequences. Results: Here, we present aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs â¼10× faster than generic string hashing algorithms in hashing adjacent k-mers. Availability and implementation: aaHash is available online at https://github.com/bcgsc/btllib and is free for academic use.

10.

Assembly and annotation of the black spruce genome provide insights on spruce phylogeny and evolution of stress response.

Lo, Theodora; Coombe, Lauren; Gagalova, Kristina K; Marr, Alex; Warren, René L; Kirk, Heather; Pandoh, Pawan; Zhao, Yongjun; Moore, Richard A; Mungall, Andrew J; Ritland, Carol; Pavy, Nathalie; Jones, Steven J M; Bohlmann, Joerg; Bousquet, Jean; Birol, Inanç; Thomson, Ashley.

G3 (Bethesda) ; 14(1)2023 Dec 29.

Article in English | MEDLINE | ID: mdl-37875130

ABSTRACT

Black spruce (Picea mariana [Mill.] B.S.P.) is a dominant conifer species in the North American boreal forest that plays important ecological and economic roles. Here, we present the first genome assembly of P. mariana with a reconstructed genome size of 18.3 Gbp and NG50 scaffold length of 36.0 kbp. A total of 66,332 protein-coding sequences were predicted in silico and annotated based on sequence homology. We analyzed the evolutionary relationships between P. mariana and 5 other spruces for which complete nuclear and organelle genome sequences were available. The phylogenetic tree estimated from mitochondrial genome sequences agrees with biogeography; specifically, P. mariana was strongly supported as a sister lineage to P. glauca and 3 other taxa found in western North America, followed by the European Picea abies. We obtained mixed topologies with weaker statistical support in phylogenetic trees estimated from nuclear and chloroplast genome sequences, indicative of ancient reticulate evolution affecting these 2 genomes. Clustering of protein-coding sequences from the 6 Picea taxa and 2 Pinus species resulted in 34,776 orthogroups, 560 of which appeared to be specific to P. mariana. Analysis of these specific orthogroups and dN/dS analysis of positive selection signatures for 497 single-copy orthogroups identified gene functions mostly related to plant development and stress response. The P. mariana genome assembly and annotation provides a valuable resource for forest genetics research and applications in this broadly distributed species, especially in relation to climate adaptation.

Subject(s)

Picea , Phylogeny , Picea/genetics , North America

11.

Systematic assessment of long-read RNA-seq methods for transcript identification and quantification.

Pardo-Palacios, Francisco J; Wang, Dingjie; Reese, Fairlie; Diekhans, Mark; Carbonell-Sala, Sílvia; Williams, Brian; Loveland, Jane E; De María, Maite; Adams, Matthew S; Balderrama-Gutierrez, Gabriela; Behera, Amit K; Gonzalez, Jose M; Hunt, Toby; Lagarde, Julien; Liang, Cindy E; Li, Haoran; Jerryd Meade, Marcus; Moraga Amador, David A; Prjibelski, Andrey D; Birol, Inanc; Bostan, Hamed; Brooks, Ashley M; Hasan Çelik, Muhammed; Chen, Ying; Du, Mei R M; Felton, Colette; Göke, Jonathan; Hafezqorani, Saber; Herwig, Ralf; Kawaji, Hideya; Lee, Joseph; Liang Li, Jian; Lienhard, Matthias; Mikheenko, Alla; Mulligan, Dennis; Ming Nip, Ka; Pertea, Mihaela; Ritchie, Matthew E; Sim, Andre D; Tang, Alison D; Kei Wan, Yuk; Wang, Changqing; Wong, Brandon Y; Yang, Chen; Barnes, If; Berry, Andrew; Capella, Salvador; Dhillon, Namrita; Fernandez-Gonzalez, Jose M; Ferrández-Peral, Luis.

bioRxiv ; 2023 Jul 27.

Article in English | MEDLINE | ID: mdl-37546854

ABSTRACT

The Long-read RNA-Seq Genome Annotation Assessment Project (LRGASP) Consortium was formed to evaluate the effectiveness of long-read approaches for transcriptome analysis. The consortium generated over 427 million long-read sequences from cDNA and direct RNA datasets, encompassing human, mouse, and manatee species, using different protocols and sequencing platforms. These data were utilized by developers to address challenges in transcript isoform detection and quantification, as well as de novo transcript isoform identification. The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth, whereas greater read depth improved quantification accuracy. In well-annotated genomes, tools based on reference sequences demonstrated the best performance. When aiming to detect rare and novel transcripts or when using reference-free approaches, incorporating additional orthogonal data and replicate samples are advised. This collaborative study offers a benchmark for current practices and provides direction for future method development in transcriptome analysis.

12.

Genomic virulence features of Beauveria bassiana as a biocontrol agent for the mountain pine beetle population.

Li, Janet X; Fernandez, Kleinberg X; Ritland, Carol; Jancsik, Sharon; Engelhardt, Daniel B; Coombe, Lauren; Warren, René L; van Belkum, Marco J; Carroll, Allan L; Vederas, John C; Bohlmann, Joerg; Birol, Inanc.

BMC Genomics ; 24(1): 390, 2023 Jul 10.

Article in English | MEDLINE | ID: mdl-37430186

ABSTRACT

BACKGROUND: The mountain pine beetle, Dendroctonus ponderosae, is an irruptive bark beetle that causes extensive mortality to many pine species within the forests of western North America. Driven by climate change and wildfire suppression, a recent mountain pine beetle (MPB) outbreak has spread across more than 18 million hectares, including areas to the east of the Rocky Mountains that comprise populations and species of pines not previously affected. Despite its impacts, there are few tactics available to control MPB populations. Beauveria bassiana is an entomopathogenic fungus used as a biological agent in agriculture and forestry and has potential as a management tactic for the mountain pine beetle population. This work investigates the phenotypic and genomic variation between B. bassiana strains to identify optimal strains against a specific insect. RESULTS: Using comparative genome and transcriptome analyses of eight B. bassiana isolates, we have identified the genetic basis of virulence, which includes oosporein production. Genes unique to the more virulent strains included functions in biosynthesis of mycotoxins, membrane transporters, and transcription factors. Significant differential expression of genes related to virulence, transmembrane transport, and stress response was identified between the different strains, as well as up to nine-fold upregulation of genes involved in the biosynthesis of oosporein. Differential correlation analysis revealed transcription factors that may be involved in regulating oosporein production. CONCLUSION: This study provides a foundation for the selection and/or engineering of the most effective strain of B. bassiana for the biological control of mountain pine beetle and other insect pests populations.

Subject(s)

Beauveria , Coleoptera , Animals , Beauveria/genetics , Virulence/genetics , Genomics

13.

Linear time complexity de novo long read genome assembly with GoldRush.

Wong, Johnathan; Coombe, Lauren; Nikolic, Vladimir; Zhang, Emily; Nip, Ka Ming; Sidhu, Puneet; Warren, René L; Birol, Inanç.

Nat Commun ; 14(1): 2906, 2023 05 22.

Article in English | MEDLINE | ID: mdl-37217507

ABSTRACT

Current state-of-the-art de novo long read genome assemblers follow the Overlap-Layout-Consensus paradigm. While read-to-read overlap - its most costly step - was improved in modern long read genome assemblers, these tools still often require excessive RAM when assembling a typical human dataset. Our work departs from this paradigm, foregoing all-vs-all sequence alignments in favor of a dynamic data structure implemented in GoldRush, a de novo long read genome assembly algorithm with linear time complexity. We tested GoldRush on Oxford Nanopore Technologies long sequencing read datasets with different base error profiles sourced from three human cell lines, rice, and tomato. Here, we show that GoldRush achieves assembly scaffold NGA50 lengths of 18.3-22.2, 0.3 and 2.6 Mbp, for the genomes of human, rice, and tomato, respectively, and assembles each genome within a day, using at most 54.5 GB of random-access memory, demonstrating the scalability of our genome assembly paradigm and its implementation.

Subject(s)

Algorithms , Genome , Humans , Sequence Analysis, DNA , High-Throughput Nucleotide Sequencing

14.

Reference-free assembly of long-read transcriptome sequencing data with RNA-Bloom2.

Nip, Ka Ming; Hafezqorani, Saber; Gagalova, Kristina K; Chiu, Readman; Yang, Chen; Warren, René L; Birol, Inanc.

Nat Commun ; 14(1): 2940, 2023 05 22.

Article in English | MEDLINE | ID: mdl-37217540

ABSTRACT

Long-read sequencing technologies have improved significantly since their emergence. Their read lengths, potentially spanning entire transcripts, is advantageous for reconstructing transcriptomes. Existing long-read transcriptome assembly methods are primarily reference-based and to date, there is little focus on reference-free transcriptome assembly. We introduce "RNA-Bloom2 [ https://github.com/bcgsc/RNA-Bloom ]", a reference-free assembly method for long-read transcriptome sequencing data. Using simulated datasets and spike-in control data, we show that the transcriptome assembly quality of RNA-Bloom2 is competitive to those of reference-based methods. Furthermore, we find that RNA-Bloom2 requires 27.0 to 80.6% of the peak memory and 3.6 to 10.8% of the total wall-clock runtime of a competing reference-free method. Finally, we showcase RNA-Bloom2 in assembling a transcriptome sample of Picea sitchensis (Sitka spruce). Since our method does not rely on a reference, it further sets the groundwork for large-scale comparative transcriptomics where high-quality draft genome assemblies are not readily available.

Subject(s)

RNA , Transcriptome , Transcriptome/genetics , High-Throughput Nucleotide Sequencing/methods , Gene Expression Profiling/methods , Sequence Analysis, RNA/methods

15.

aaHash: recursive amino acid sequence hashing.

Wong, Johnathan; Kazemi, Parham; Coombe, Lauren; Warren, René L; Birol, Inanç.

bioRxiv ; 2023 May 10.

Article in English | MEDLINE | ID: mdl-37214907

ABSTRACT

Motivation: K-mer hashing is a common operation in many foundational bioinformatics problems. However, generic string hashing algorithms are not optimized for this application. Strings in bioinformatics use specific alphabets, a trait leveraged for nucleic acid sequences in earlier work. We note that amino acid sequences, with complexities and context that cannot be captured by generic hashing algorithms, can also benefit from a domain-specific hashing algorithm. Such a hashing algorithm can accelerate and improve the sensitivity of bioinformatics applications developed for protein sequences. Results: Here, we present aaHash, a recursive hashing algorithm tailored for amino acid sequences. This algorithm utilizes multiple hash levels to represent biochemical similarities between amino acids. aaHash performs ~10X faster than generic string hashing algorithms in hashing adjacent k-mers. Availability and implementation: aaHash is available online at https://github.com/bcgsc/btllib and is free for academic use.

16.

ntLink: A Toolkit for De Novo Genome Assembly Scaffolding and Mapping Using Long Reads.

Coombe, Lauren; Warren, René L; Wong, Johnathan; Nikolic, Vladimir; Birol, Inanc.

Curr Protoc ; 3(4): e733, 2023 Apr.

Article in English | MEDLINE | ID: mdl-37039735

ABSTRACT

With the increasing affordability and accessibility of genome sequencing data, de novo genome assembly is an important first step to a wide variety of downstream studies and analyses. Therefore, bioinformatics tools that enable the generation of high-quality genome assemblies in a computationally efficient manner are essential. Recent developments in long-read sequencing technologies have greatly benefited genome assembly work, including scaffolding, by providing long-range evidence that can aid in resolving the challenging repetitive regions of complex genomes. ntLink is a flexible and resource-efficient genome scaffolding tool that utilizes long-read sequencing data to improve upon draft genome assemblies built from any sequencing technologies, including the same long reads. Instead of using read alignments to identify candidate joins, ntLink utilizes minimizer-based mappings to infer how input sequences should be ordered and oriented into scaffolds. Recent improvements to ntLink have added important features such as overlap detection, gap-filling, and in-code scaffolding iterations. Here, we present three basic protocols demonstrating how to use each of these new features to yield highly contiguous genome assemblies, while still maintaining ntLink's proven computational efficiency. Further, as we illustrate in the alternate protocols, the lightweight minimizer-based mappings that enable ntLink scaffolding can also be utilized for other downstream applications, such as misassembly detection. With its modularity and multiple modes of execution, ntLink has broad benefit to the genomics community, from genome scaffolding and beyond. ntLink is an open-source project and is freely available from https://github.com/bcgsc/ntLink. © 2023 The Authors. Current Protocols published by Wiley Periodicals LLC. Basic Protocol 1: ntLink scaffolding using overlap detection Basic Protocol 2: ntLink scaffolding with gap-filling Basic Protocol 3: Running in-code iterations of ntLink scaffolding Alternate Protocol 1: Generating long-read to contig mappings with ntLink Alternate Protocol 2: Using ntLink mappings for genome assembly correction with Tigmint-long Support Protocol: Installing ntLink.

Subject(s)

High-Throughput Nucleotide Sequencing , Software , High-Throughput Nucleotide Sequencing/methods , Genomics/methods , Sequence Analysis, DNA/methods , Genome

17.

Characterization and simulation of metagenomic nanopore sequencing data with Meta-NanoSim.

Yang, Chen; Lo, Theodora; Nip, Ka Ming; Hafezqorani, Saber; Warren, René L; Birol, Inanc.

Gigascience ; 122023 03 20.

Article in English | MEDLINE | ID: mdl-36939007

ABSTRACT

BACKGROUND: Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, sequencing platform-specific challenges, including high base-call error rate, nonuniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical algorithms. The use of simulated datasets with characteristics that are true to the sequencing platform under evaluation is a cost-effective way to assess the performance of bioinformatics tools with the ground truth in a controlled environment. RESULTS: Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. It improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. Meta-NanoSim can simulate complex microbial communities composed of both linear and circular genomes and can stream reference genomes from online servers directly. Simulated datasets showed high congruence with experimental data in terms of read length, error profiles, and abundance levels. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenome assembly benchmarking task. CONCLUSIONS: The Meta-NanoSim characterization module investigates read features, including chimeric information and abundance levels, while the simulation module simulates large and complex multisample microbial communities with different abundance profiles. All trained models and the software are freely accessible at GitHub: https://github.com/bcgsc/NanoSim.

Subject(s)

Nanopore Sequencing , Nanopores , Metagenome , Nanopore Sequencing/methods , Sequence Analysis, DNA/methods , Computer Simulation , Metagenomics/methods , Software , Algorithms , High-Throughput Nucleotide Sequencing/methods

18.

Models and data of AMPlify: a deep learning tool for antimicrobial peptide prediction.

Li, Chenkai; Warren, René L; Birol, Inanc.

BMC Res Notes ; 16(1): 11, 2023 Feb 02.

Article in English | MEDLINE | ID: mdl-36732807

ABSTRACT

OBJECTIVES: Antibiotic resistance is a rising global threat to human health and is prompting researchers to seek effective alternatives to conventional antibiotics, which include antimicrobial peptides (AMPs). Recently, we have reported AMPlify, an attentive deep learning model for predicting AMPs in databases of peptide sequences. In our tests, AMPlify outperformed the state-of-the-art. We have illustrated its use on data describing the American bullfrog (Rana [Lithobates] catesbeiana) genome. Here we present the model files and training/test data sets we used in that study. The original model (the balanced model) was trained on a balanced set of AMP and non-AMP sequences curated from public databases. In this data note, we additionally provide a model trained on an imbalanced set, in which non-AMP sequences far outnumber AMP sequences. We note that the balanced and imbalanced models would serve different use cases, and both would serve the research community, facilitating the discovery and development of novel AMPs. DATA DESCRIPTION: This data note provides two sets of models, as well as two AMP and four non-AMP sequence sets for training and testing the balanced and imbalanced models. Each model set includes five single sub-models that form an ensemble model. The first model set corresponds to the original model trained on a balanced training set that has been described in the original AMPlify manuscript, while the second model set was trained on an imbalanced training set.

Subject(s)

Antimicrobial Peptides , Deep Learning , Animals , Amino Acid Sequence , Anti-Bacterial Agents , Rana catesbeiana/genetics

19.

Associating Biological Activity and Predicted Structure of Antimicrobial Peptides from Amphibians and Insects.

Richter, Amelia; Sutherland, Darcy; Ebrahimikondori, Hossein; Babcock, Alana; Louie, Nathan; Li, Chenkai; Coombe, Lauren; Lin, Diana; Warren, René L; Yanai, Anat; Kotkoff, Monica; Helbing, Caren C; Hof, Fraser; Hoang, Linda M N; Birol, Inanc.

Antibiotics (Basel) ; 11(12)2022 Nov 27.

Article in English | MEDLINE | ID: mdl-36551368

ABSTRACT

Antimicrobial peptides (AMPs) are a diverse class of short, often cationic biological molecules that present promising opportunities in the development of new therapeutics to combat antimicrobial resistance. Newly developed in silico methods offer the ability to rapidly discover numerous novel AMPs with a variety of physiochemical properties. Herein, using the rAMPage AMP discovery pipeline, we bioinformatically identified 51 AMP candidates from amphibia and insect RNA-seq data and present their in-depth characterization. The studied AMPs demonstrate activity against a panel of bacterial pathogens and have undetected or low toxicity to red blood cells and human cultured cells. Amino acid sequence analysis revealed that 30 of these bioactive peptides belong to either the Brevinin-1, Brevinin-2, Nigrocin-2, or Apidaecin AMP families. Prediction of three-dimensional structures using ColabFold indicated an association between peptides predicted to adopt a helical structure and broad-spectrum antibacterial activity against the Gram-negative and Gram-positive species tested in our panel. These findings highlight the utility of associating the diverse sequences of novel AMPs with their estimated peptide structures in categorizing AMPs and predicting their antimicrobial activity.

20.

The western redcedar genome reveals low genetic diversity in a self-compatible conifer.

Shalev, Tal J; Gamal El-Dien, Omnia; Yuen, Macaire M S; Shengqiang, Shu; Jackman, Shaun D; Warren, René L; Coombe, Lauren; van der Merwe, Lise; Stewart, Ada; Boston, Lori B; Plott, Christopher; Jenkins, Jerry; He, Guifen; Yan, Juying; Yan, Mi; Guo, Jie; Breinholt, Jesse W; Neves, Leandro G; Grimwood, Jane; Rieseberg, Loren H; Schmutz, Jeremy; Birol, Inanc; Kirst, Matias; Yanchuk, Alvin D; Ritland, Carol; Russell, John H; Bohlmann, Joerg.

Genome Res ; 32(10): 1952-1964, 2022 10.

Article in English | MEDLINE | ID: mdl-36109148

ABSTRACT

We assembled the 9.8-Gbp genome of western redcedar (WRC; Thuja plicata), an ecologically and economically important conifer species of the Cupressaceae. The genome assembly, derived from a uniquely inbred tree produced through five generations of self-fertilization (selfing), was determined to be 86% complete by BUSCO analysis, one of the most complete genome assemblies for a conifer. Population genomic analysis revealed WRC to be one of the most genetically depauperate wild plant species, with an effective population size of approximately 300 and no significant genetic differentiation across its geographic range. Nucleotide diversity, π, is low for a continuous tree species, with many loci showing zero diversity, and the ratio of π at zero- to fourfold degenerate sites is relatively high (approximately 0.33), suggestive of weak purifying selection. Using an array of genetic lines derived from up to five generations of selfing, we explored the relationship between genetic diversity and mating system. Although overall heterozygosity was found to decline faster than expected during selfing, heterozygosity persisted at many loci, and nearly 100 loci were found to deviate from expectations of genetic drift, suggestive of associative overdominance. Nonreference alleles at such loci often harbor deleterious mutations and are rare in natural populations, implying that balanced polymorphisms are maintained by linkage to dominant beneficial alleles. This may account for how WRC remains responsive to natural and artificial selection, despite low genetic diversity.

Subject(s)

Tracheophyta , Tracheophyta/genetics , Self-Fertilization/genetics , Alleles , Heterozygote , Polymorphism, Genetic , Genetic Variation , Selection, Genetic

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL