Search | VHL Regional Portal

Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts.

Gafurov, Askar; Vinar, Tomás; Medvedev, Paul; Brejová, Brona.

bioRxiv ; 2023 Nov 22.

Article in English | MEDLINE | ID: mdl-38045397

ABSTRACT

An annotation is a set of genomic intervals sharing a particular function or property. Examples include genes, conserved elements, and epigenetic modifications. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing two random unrelated annotations. Previous approaches to this problem remain too slow or inaccurate. To incorporate more background information into such analyses and avoid biased results, we propose a new null model based on a Markov chain which differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or sequencing gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistics and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models. We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. The use of genomic contexts to correct for GC-bias also resulted in the reversal of some previously published findings. Availability: The software is freely available at https://github.com/fmfi-compbio/mcdp2 under the MIT licence. All data for reproducibility are available at https://github.com/fmfi-compbio/mcdp2-reproducibility.

VirPool: model-based estimation of SARS-CoV-2 variant proportions in wastewater samples.

Gafurov, Askar; Baláz, Andrej; Amman, Fabian; Borsová, Kristína; Cabanová, Viktória; Klempa, Boris; Bergthaler, Andreas; Vinar, Tomás; Brejová, Brona.

BMC Bioinformatics ; 23(1): 551, 2022 Dec 19.

Article in English | MEDLINE | ID: mdl-36536300

ABSTRACT

BACKGROUND: The genomes of SARS-CoV-2 are classified into variants, some of which are monitored as variants of concern (e.g. the Delta variant B.1.617.2 or Omicron variant B.1.1.529). Proportions of these variants circulating in a human population are typically estimated by large-scale sequencing of individual patient samples. Sequencing a mixture of SARS-CoV-2 RNA molecules from wastewater provides a cost-effective alternative, but requires methods for estimating variant proportions in a mixed sample. RESULTS: We propose a new method based on a probabilistic model of sequencing reads, capturing sequence diversity present within individual variants, as well as sequencing errors. The algorithm is implemented in an open source Python program called VirPool. We evaluate the accuracy of VirPool on several simulated and real sequencing data sets from both Illumina and nanopore sequencing platforms, including wastewater samples from Austria and France monitoring the onset of the Alpha variant. CONCLUSIONS: VirPool is a versatile tool for wastewater and other mixed-sample analysis that can handle both short- and long-read sequencing data. Our approach does not require pre-selection of characteristic mutations for variant profiles, it is able to use the entire length of reads instead of just the most informative positions, and can also capture haplotype dependencies within a single read.

Subject(s)

COVID-19 , SARS-CoV-2 , Wastewater , Humans , RNA, Viral , SARS-CoV-2/genetics , SARS-CoV-2/isolation & purification , Wastewater/virology

Markov chains improve the significance computation of overlapping genome annotations.

Gafurov, Askar; Brejová, Brona; Medvedev, Paul.

Bioinformatics ; 38(Suppl 1): i203-i211, 2022 06 24.

Article in English | MEDLINE | ID: mdl-35758770

ABSTRACT

MOTIVATION: Genome annotations are a common way to represent genomic features such as genes, regulatory elements or epigenetic modifications. The amount of overlap between two annotations is often used to ascertain if there is an underlying biological connection between them. In order to distinguish between true biological association and overlap by pure chance, a robust measure of significance is required. One common way to do this is to determine if the number of intervals in the reference annotation that intersect the query annotation is statistically significant. However, currently employed statistical frameworks are often either inefficient or inaccurate when computing P-values on the scale of the whole human genome. RESULTS: We show that finding the P-values under the typically used 'gold' null hypothesis is NP-hard. This motivates us to reformulate the null hypothesis using Markov chains. To be able to measure the fidelity of our Markovian null hypothesis, we develop a fast direct sampling algorithm to estimate the P-value under the gold null hypothesis. We then present an open-source software tool MCDP that computes the P-values under the Markovian null hypothesis in O(m2+n) time and O(m) memory, where m and n are the numbers of intervals in the reference and query annotations, respectively. Notably, MCDP runtime and memory usage are independent from the genome length, allowing it to outperform previous approaches in runtime and memory usage by orders of magnitude on human genome annotations, while maintaining the same level of accuracy. AVAILABILITY AND IMPLEMENTATION: The software is available at https://github.com/fmfi-compbio/mc-overlaps. All data for reproducibility are available at https://github.com/fmfi-compbio/mc-overlaps-reproducibility. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genome, Human , Software , Gold , Humans , Markov Chains , Reproducibility of Results

Nanopore sequencing of SARS-CoV-2: Comparison of short and long PCR-tiling amplicon protocols.

Brejová, Brona; Borsová, Kristína; Hodorová, Viktória; Cabanová, Viktória; Gafurov, Askar; Fricová, Dominika; Nebohácová, Martina; Vinar, Tomás; Klempa, Boris; Nosek, Jozef.

PLoS One ; 16(10): e0259277, 2021.

Article in English | MEDLINE | ID: mdl-34714886

ABSTRACT

Surveillance of the SARS-CoV-2 variants including the quickly spreading mutants by rapid and near real-time sequencing of the viral genome provides an important tool for effective health policy decision making in the ongoing COVID-19 pandemic. Here we evaluated PCR-tiling of short (~400-bp) and long (~2 and ~2.5-kb) amplicons combined with nanopore sequencing on a MinION device for analysis of the SARS-CoV-2 genome sequences. Analysis of several sequencing runs demonstrated that using the long amplicon schemes outperforms the original protocol based on the 400-bp amplicons. It also illustrated common artefacts and problems associated with PCR-tiling approach, such as uneven genome coverage, variable fraction of discarded sequencing reads, including human and bacterial contamination, as well as the presence of reads derived from the viral sub-genomic RNAs.

Subject(s)

COVID-19/diagnosis , Nanopore Sequencing/methods , Pandemics , SARS-CoV-2 , Humans , SARS-CoV-2/genetics , SARS-CoV-2/isolation & purification

Isometric gene tree reconciliation revisited.

Brejová, Brona; Gafurov, Askar; Pardubská, Dana; Sabo, Michal; Vinar, Tomás.

Algorithms Mol Biol ; 12: 17, 2017.

Article in English | MEDLINE | ID: mdl-28630644

ABSTRACT

BACKGROUND: Isometric gene tree reconciliation is a gene tree/species tree reconciliation problem where both the gene tree and the species tree include branch lengths, and these branch lengths must be respected by the reconciliation. The problem was introduced by Ma et al. in 2008 in the context of reconstructing evolutionary histories of genomes in the infinite sites model. RESULTS: In this paper, we show that the original algorithm by Ma et al. is incorrect, and we propose a modified algorithm that addresses the problems that we discovered. We have also improved the running time from [Formula: see text] to [Formula: see text], where N is the total number of nodes in the two input trees. Finally, we examine two new variants of the problem: reconciliation of two unrooted trees and scaling of branch lengths of the gene tree during reconciliation of two rooted trees. CONCLUSIONS: We provide several new algorithms for isometric reconciliation of trees. Some questions in this area remain open; most importantly extensions of the problem allowing for imprecise estimates of branch lengths.

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL