Pesquisa | BVS IEC

Beware the Jaccard: the choice of similarity measure is important and non-trivial in genomic colocalisation analysis.

Salvatore, Stefania; Dagestad Rand, Knut; Grytten, Ivar; Ferkingstad, Egil; Domanska, Diana; Holden, Lars; Gheorghe, Marius; Mathelier, Anthony; Glad, Ingrid; Kjetil Sandve, Geir.

Brief Bioinform ; 21(5): 1523-1530, 2020 09 25.

Artigo em Inglês | MEDLINE | ID: mdl-31624847

RESUMO

The generation and systematic collection of genome-wide data is ever-increasing. This vast amount of data has enabled researchers to study relations between a variety of genomic and epigenomic features, including genetic variation, gene regulation and phenotypic traits. Such relations are typically investigated by comparatively assessing genomic co-occurrence. Technically, this corresponds to assessing the similarity of pairs of genome-wide binary vectors. A variety of similarity measures have been proposed for this problem in other fields like ecology. However, while several of these measures have been employed for assessing genomic co-occurrence, their appropriateness for the genomic setting has never been investigated. We show that the choice of similarity measure may strongly influence results and propose two alternative modelling assumptions that can be used to guide this choice. On both simulated and real genomic data, the Jaccard index is strongly altered by dataset size and should be used with caution. The Forbes coefficient (fold change) and tetrachoric correlation are less influenced by dataset size, but one should be aware of increased variance for small datasets. All results on simulated and real data can be inspected and reproduced at https://hyperbrowser.uio.no/sim-measure.

Assuntos

Genômica/métodos , Algoritmos , Conjuntos de Dados como Assunto , Regulação da Expressão Gênica , Variação Genética , Humanos

Assessing graph-based read mappers against a baseline approach highlights strengths and weaknesses of current methods.

Grytten, Ivar; Rand, Knut D; Nederbragt, Alexander J; Sandve, Geir K.

BMC Genomics ; 21(1): 282, 2020 Apr 06.

Artigo em Inglês | MEDLINE | ID: mdl-32252628

RESUMO

BACKGROUND: Graph-based reference genomes have become popular as they allow read mapping and follow-up analyses in settings where the exact haplotypes underlying a high-throughput sequencing experiment are not precisely known. Two recent papers show that mapping to graph-based reference genomes can improve accuracy as compared to methods using linear references. Both of these methods index the sequences for most paths up to a certain length in the graph in order to enable direct mapping of reads containing common variants. However, the combinatorial explosion of possible paths through nearby variants also leads to a huge search space and an increased chance of false positive alignments to highly variable regions. RESULTS: We here assess three prominent graph-based read mappers against a hybrid baseline approach that combines an initial path determination with a tuned linear read mapping method. We show, using a previously proposed benchmark, that this simple approach is able to improve overall accuracy of read-mapping to graph-based reference genomes. CONCLUSIONS: Our method is implemented in a tool Two-step Graph Mapper, which is available at https://github.com/uio-bmi/two_step_graph_mapperalong with data and scripts for reproducing the experiments. Our method highlights characteristics of the current generation of graph-based read mappers and shows potential for improvement for future graph-based read mappers.

Assuntos

Biologia Computacional/métodos , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Alinhamento de Sequência

Graph Peak Caller: Calling ChIP-seq peaks on graph-based reference genomes.

Grytten, Ivar; Rand, Knut D; Nederbragt, Alexander J; Storvik, Geir O; Glad, Ingrid K; Sandve, Geir K.

PLoS Comput Biol ; 15(2): e1006731, 2019 02.

Artigo em Inglês | MEDLINE | ID: mdl-30779737

RESUMO

Graph-based representations are considered to be the future for reference genomes, as they allow integrated representation of the steadily increasing data on individual variation. Currently available tools allow de novo assembly of graph-based reference genomes, alignment of new read sets to the graph representation as well as certain analyses like variant calling and haplotyping. We here present a first method for calling ChIP-Seq peaks on read data aligned to a graph-based reference genome. The method is a graph generalization of the peak caller MACS2, and is implemented in an open source tool, Graph Peak Caller. By using the existing tool vg to build a pan-genome of Arabidopsis thaliana, we validate our approach by showing that Graph Peak Caller with a pan-genome reference graph can trace variants within peaks that are not part of the linear reference genome, and find peaks that in general are more motif-enriched than those found by MACS2.

Assuntos

Imunoprecipitação da Cromatina/métodos , Genômica/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Arabidopsis/genética , Genoma/genética , Ligação Proteica , Software , Fatores de Transcrição

Coordinates and intervals in graph-based reference genomes.

Rand, Knut D; Grytten, Ivar; Nederbragt, Alexander J; Storvik, Geir O; Glad, Ingrid K; Sandve, Geir K.

BMC Bioinformatics ; 18(1): 263, 2017 May 18.

Artigo em Inglês | MEDLINE | ID: mdl-28521770

RESUMO

BACKGROUND: It has been proposed that future reference genomes should be graph structures in order to better represent the sequence diversity present in a species. However, there is currently no standard method to represent genomic intervals, such as the positions of genes or transcription factor binding sites, on graph-based reference genomes. RESULTS: We formalize offset-based coordinate systems on graph-based reference genomes and introduce methods for representing intervals on these reference structures. We show the advantage of our methods by representing genes on a graph-based representation of the newest assembly of the human genome (GRCh38) and its alternative loci for regions that are highly variable. CONCLUSION: More complex reference genomes, containing alternative loci, require methods to represent genomic data on these structures. Our proposed notation for genomic intervals makes it possible to fully utilize the alternative loci of the GRCh38 assembly and potential future graph-based reference genomes. We have made a Python package for representing such intervals on offset-based coordinate systems, available at https://github.com/uio-cels/offsetbasedgraph . An interactive web-tool using this Python package to visualize genes on a graph created from GRCh38 is available at https://github.com/uio-cels/genomicgraphcoords .

Assuntos

Gráficos por Computador , Genoma Humano , Genômica/métodos , Algoritmos , Loci Gênicos , Humanos , Internet , RNA Mensageiro/genética , RNA Mensageiro/metabolismo , Análise de Sequência de DNA , Software

Ten simple rules for quick and dirty scientific programming.

Balaban, Gabriel; Grytten, Ivar; Rand, Knut Dagestad; Scheffer, Lonneke; Sandve, Geir Kjetil.

PLoS Comput Biol ; 17(3): e1008549, 2021 03.

Artigo em Inglês | MEDLINE | ID: mdl-33705383

Assuntos

Software , Pesquisa Biomédica , Biologia Computacional , Humanos , Design de Software

KAGE: fast alignment-free graph-based genotyping of SNPs and short indels.

Grytten, Ivar; Dagestad Rand, Knut; Sandve, Geir Kjetil.

Genome Biol ; 23(1): 209, 2022 10 04.

Artigo em Inglês | MEDLINE | ID: mdl-36195962

RESUMO

Genotyping is a core application of high-throughput sequencing. We present KAGE, a genotyper for SNPs and short indels that is inspired by recent developments within graph-based genome representations and alignment-free methods. KAGE uses a pan-genome representation of the population to efficiently and accurately predict genotypes. Two novel ideas improve both the speed and accuracy: a Bayesian model incorporates genotypes from thousands of individuals to improve prediction accuracy, and a computationally efficient method leverages correlation between variants. We show that the accuracy of KAGE is at par with the best existing alignment-free genotypers, while being an order of magnitude faster.

Assuntos

Mutação INDEL , Polimorfismo de Nucleotídeo Único , Algoritmos , Teorema de Bayes , Genoma Humano , Genótipo , Técnicas de Genotipagem , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Análise de Sequência de DNA

simAIRR: simulation of adaptive immune repertoires with realistic receptor sequence sharing for benchmarking of immune state prediction methods.

Kanduri, Chakravarthi; Scheffer, Lonneke; Pavlovic, Milena; Rand, Knut Dagestad; Chernigovskaya, Maria; Pirvandy, Oz; Yaari, Gur; Greiff, Victor; Sandve, Geir K.

Gigascience ; 122022 12 28.

Artigo em Inglês | MEDLINE | ID: mdl-37848619

RESUMO

BACKGROUND: Machine learning (ML) has gained significant attention for classifying immune states in adaptive immune receptor repertoires (AIRRs) to support the advancement of immunodiagnostics and therapeutics. Simulated data are crucial for the rigorous benchmarking of AIRR-ML methods. Existing approaches to generating synthetic benchmarking datasets result in the generation of naive repertoires missing the key feature of many shared receptor sequences (selected for common antigens) found in antigen-experienced repertoires. RESULTS: We demonstrate that a common approach to generating simulated AIRR benchmark datasets can introduce biases, which may be exploited for undesired shortcut learning by certain ML methods. To mitigate undesirable access to true signals in simulated AIRR datasets, we devised a simulation strategy (simAIRR) that constructs antigen-experienced-like repertoires with a realistic overlap of receptor sequences. simAIRR can be used for constructing AIRR-level benchmarks based on a range of assumptions (or experimental data sources) for what constitutes receptor-level immune signals. This includes the possibility of making or not making any prior assumptions regarding the similarity or commonality of immune state-associated sequences that will be used as true signals. We demonstrate the real-world realism of our proposed simulation approach by showing that basic ML strategies perform similarly on simAIRR-generated and real-world experimental AIRR datasets. CONCLUSIONS: This study sheds light on the potential shortcut learning opportunities for ML methods that can arise with the state-of-the-art way of simulating AIRR datasets. simAIRR is available as a Python package: https://github.com/KanduriC/simAIRR.

Assuntos

Benchmarking , Simulação por Computador

The immuneML ecosystem for machine learning analysis of adaptive immune receptor repertoires.

Pavlovic, Milena; Scheffer, Lonneke; Motwani, Keshav; Kanduri, Chakravarthi; Kompova, Radmila; Vazov, Nikolay; Waagan, Knut; Bernal, Fabian L M; Costa, Alexandre Almeida; Corrie, Brian; Akbar, Rahmad; Al Hajj, Ghadi S; Balaban, Gabriel; Brusko, Todd M; Chernigovskaya, Maria; Christley, Scott; Cowell, Lindsay G; Frank, Robert; Grytten, Ivar; Gundersen, Sveinung; Haff, Ingrid Hobæk; Hovig, Eivind; Hsieh, Ping-Han; Klambauer, Günter; Kuijjer, Marieke L; Lund-Andersen, Christin; Martini, Antonio; Minotto, Thomas; Pensar, Johan; Rand, Knut; Riccardi, Enrico; Robert, Philippe A; Rocha, Artur; Slabodkin, Andrei; Snapkov, Igor; Sollid, Ludvig M; Titov, Dmytro; Weber, Cédric R; Widrich, Michael; Yaari, Gur; Greiff, Victor; Sandve, Geir Kjetil.

Nat Mach Intell ; 3(11): 936-944, 2021 Nov.

Artigo em Inglês | MEDLINE | ID: mdl-37396030

RESUMO

Adaptive immune receptor repertoires (AIRR) are key targets for biomedical research as they record past and ongoing adaptive immune responses. The capacity of machine learning (ML) to identify complex discriminative sequence patterns renders it an ideal approach for AIRR-based diagnostic and therapeutic discovery. To date, widespread adoption of AIRR ML has been inhibited by a lack of reproducibility, transparency, and interoperability. immuneML (immuneml.uio.no) addresses these concerns by implementing each step of the AIRR ML process in an extensible, open-source software ecosystem that is based on fully specified and shareable workflows. To facilitate widespread user adoption, immuneML is available as a command-line tool and through an intuitive Galaxy web interface, and extensive documentation of workflows is provided. We demonstrate the broad applicability of immuneML by (i) reproducing a large-scale study on immune state prediction, (ii) developing, integrating, and applying a novel deep learning method for antigen specificity prediction, and (iii) showcasing streamlined interpretability-focused benchmarking of AIRR ML.

GSuite HyperBrowser: integrative analysis of dataset collections across the genome and epigenome.

Simovski, Boris; Vodák, Daniel; Gundersen, Sveinung; Domanska, Diana; Azab, Abdulrahman; Holden, Lars; Holden, Marit; Grytten, Ivar; Rand, Knut; Drabløs, Finn; Johansen, Morten; Mora, Antonio; Lund-Andersen, Christin; Fromm, Bastian; Eskeland, Ragnhild; Gabrielsen, Odd Stokke; Ferkingstad, Egil; Nakken, Sigve; Bengtsen, Mads; Nederbragt, Alexander Johan; Thorarensen, Hildur Sif; Akse, Johannes Andreas; Glad, Ingrid; Hovig, Eivind; Sandve, Geir Kjetil.

Gigascience ; 6(7): 1-12, 2017 07 01.

Artigo em Inglês | MEDLINE | ID: mdl-28459977

RESUMO

Background: Recent large-scale undertakings such as ENCODE and Roadmap Epigenomics have generated experimental data mapped to the human reference genome (as genomic tracks) representing a variety of functional elements across a large number of cell types. Despite the high potential value of these publicly available data for a broad variety of investigations, little attention has been given to the analytical methodology necessary for their widespread utilisation. Findings: We here present a first principled treatment of the analysis of collections of genomic tracks. We have developed novel computational and statistical methodology to permit comparative and confirmatory analyses across multiple and disparate data sources. We delineate a set of generic questions that are useful across a broad range of investigations and discuss the implications of choosing different statistical measures and null models. Examples include contrasting analyses across different tissues or diseases. The methodology has been implemented in a comprehensive open-source software system, the GSuite HyperBrowser. To make the functionality accessible to biologists, and to facilitate reproducible analysis, we have also developed a web-based interface providing an expertly guided and customizable way of utilizing the methodology. With this system, many novel biological questions can flexibly be posed and rapidly answered. Conclusions: Through a combination of streamlined data acquisition, interoperable representation of dataset collections, and customizable statistical analysis with guided setup and interpretation, the GSuite HyperBrowser represents a first comprehensive solution for integrative analysis of track collections across the genome and epigenome. The software is available at: https://hyperbrowser.uio.no.

Assuntos

Conjuntos de Dados como Assunto/normas , Epigênese Genética , Epigenômica/métodos , Genoma Humano , Software , Sequenciamento Completo do Genoma/métodos , Epigenômica/normas , Humanos , Sequenciamento Completo do Genoma/normas

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA