Pesquisa | Biblioteca Virtual em Saúde Fiocruz

1.

Polishing copy number variant calls on exome sequencing data via deep learning.

Özden, Furkan; Alkan, Can; Çiçek, A Ercüment.

Genome Res ; 32(6): 1170-1182, 2022 06.

Artigo em Inglês | MEDLINE | ID: mdl-35697522

RESUMO

Accurate and efficient detection of copy number variants (CNVs) is of critical importance owing to their significant association with complex genetic diseases. Although algorithms that use whole-genome sequencing (WGS) data provide stable results with mostly valid statistical assumptions, copy number detection on whole-exome sequencing (WES) data shows comparatively lower accuracy. This is unfortunate as WES data are cost-efficient, compact, and relatively ubiquitous. The bottleneck is primarily due to the noncontiguous nature of the targeted capture: biases in targeted genomic hybridization, GC content, targeting probes, and sample batching during sequencing. Here, we present a novel deep learning model, DECoNT, which uses the matched WES and WGS data, and learns to correct the copy number variations reported by any off-the-shelf WES-based germline CNV caller. We train DECoNT on the 1000 Genomes Project data, and we show that we can efficiently triple the duplication call precision and double the deletion call precision of the state-of-the-art algorithms. We also show that our model consistently improves the performance independent of (1) sequencing technology, (2) exome capture kit, and (3) CNV caller. Using DECoNT as a universal exome CNV call polisher has the potential to improve the reliability of germline CNV detection on WES data sets.

Assuntos

Aprendizado Profundo , Exoma , Algoritmos , Variações do Número de Cópias de DNA , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Reprodutibilidade dos Testes , Sequenciamento do Exoma

2.

Identification of protein-protein interaction bridges for multiple sclerosis.

Yazici, Gözde; Kurt Vatandaslar, Burcu; Aydin Canturk, Ilknur; Aydinli, Fatmagul I; Arici Duz, Ozge; Karakoc, Emre; Kerman, Bilal E; Alkan, Can.

Bioinformatics ; 39(4)2023 04 03.

Artigo em Inglês | MEDLINE | ID: mdl-37018152

RESUMO

MOTIVATION: Identifying and prioritizing disease-related proteins is an important scientific problem to develop proper treatments. Network science has become an important discipline to prioritize such proteins. Multiple sclerosis, an autoimmune disease for which there is still no cure, is characterized by a damaging process called demyelination. Demyelination is the destruction of myelin, a structure facilitating fast transmission of neuron impulses, and oligodendrocytes, the cells producing myelin, by immune cells. Identifying the proteins that have special features on the network formed by the proteins of oligodendrocyte and immune cells can reveal useful information about the disease. RESULTS: We investigated the most significant protein pairs that we define as bridges among the proteins providing the interaction between the two cells in demyelination, in the networks formed by the oligodendrocyte and each type of two immune cells (i.e. macrophage and T-cell) using network analysis techniques and integer programming. The reason, we investigated these specialized hubs was that a problem related to these proteins might impose a bigger damage in the system. We showed that 61%-100% of the proteins our model detected, depending on parameterization, have already been associated with multiple sclerosis. We further observed the mRNA expression levels of several proteins we prioritized significantly decreased in human peripheral blood mononuclear cells of multiple sclerosis patients. We therefore present a model, BriFin, which can be used for analyzing processes where interactions of two cell types play an important role. AVAILABILITY AND IMPLEMENTATION: BriFin is available at https://github.com/BilkentCompGen/brifin.

Assuntos

Esclerose Múltipla , Humanos , Leucócitos Mononucleares , Oligodendroglia/fisiologia , Neurônios , Bainha de Mielina

3.

FastRemap: a tool for quickly remapping reads between genome assemblies.

Kim, Jeremie S; Firtina, Can; Cavlak, Meryem Banu; Senol Cali, Damla; Alkan, Can; Mutlu, Onur.

Bioinformatics ; 38(19): 4633-4635, 2022 09 30.

Artigo em Inglês | MEDLINE | ID: mdl-35976109

RESUMO

MOTIVATION: A genome read dataset can be quickly and efficiently remapped from one reference to another similar reference (e.g., between two reference versions or two similar species) using a variety of tools, e.g., the commonly used CrossMap tool. With the explosion of available genomic datasets and references, high-performance remapping tools will be even more important for keeping up with the computational demands of genome assembly and analysis. RESULTS: We provide FastRemap, a fast and efficient tool for remapping reads between genome assemblies. FastRemap provides up to a 7.82× speedup (6.47×, on average) and uses as low as 61.7% (80.7%, on average) of the peak memory consumption compared to the state-of-the-art remapping tool, CrossMap. AVAILABILITY AND IMPLEMENTATION: FastRemap is written in C++. Source code and user manual are freely available at: github.com/CMU-SAFARI/FastRemap. Docker image available at: https://hub.docker.com/r/alkanlab/fastremap. Also available in Bioconda at: https://anaconda.org/bioconda/fastremap-bio.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Genômica/métodos , Genoma

4.

CONGA: Copy number variation genotyping in ancient genomes and low-coverage sequencing data.

Söylev, Arda; Çokoglu, Sevim Seda; Koptekin, Dilek; Alkan, Can; Somel, Mehmet.

PLoS Comput Biol ; 18(12): e1010788, 2022 12.

Artigo em Inglês | MEDLINE | ID: mdl-36516232

RESUMO

To date, ancient genome analyses have been largely confined to the study of single nucleotide polymorphisms (SNPs). Copy number variants (CNVs) are a major contributor of disease and of evolutionary adaptation, but identifying CNVs in ancient shotgun-sequenced genomes is hampered by typical low genome coverage (<1×) and short fragments (<80 bps), precluding standard CNV detection software to be effectively applied to ancient genomes. Here we present CONGA, tailored for genotyping CNVs at low coverage. Simulations and down-sampling experiments suggest that CONGA can genotype deletions >1 kbps with F-scores >0.75 at ≥1×, and distinguish between heterozygous and homozygous states. We used CONGA to genotype 10,002 outgroup-ascertained deletions across a heterogenous set of 71 ancient human genomes spanning the last 50,000 years, produced using variable experimental protocols. A fraction of these (21/71) display divergent deletion profiles unrelated to their population origin, but attributable to technical factors such as coverage and read length. The majority of the sample (50/71), despite originating from nine different laboratories and having coverages ranging from 0.44×-26× (median 4×) and average read lengths 52-121 bps (median 69), exhibit coherent deletion frequencies. Across these 50 genomes, inter-individual genetic diversity measured using SNPs and CONGA-genotyped deletions are highly correlated. CONGA-genotyped deletions also display purifying selection signatures, as expected. CONGA thus paves the way for systematic CNV analyses in ancient genomes, despite the technical challenges posed by low and variable genome coverage.

Assuntos

Variações do Número de Cópias de DNA , Genômica , Humanos , Variações do Número de Cópias de DNA/genética , Genótipo , Genômica/métodos , Genoma Humano/genética , Genética Populacional , Polimorfismo de Nucleotídeo Único/genética

5.

Author Correction: Comparative and demographic analysis of orang-utan genomes.

Locke, Devin P; Hillier, LaDeana W; Warren, Wesley C; Worley, Kim C; Nazareth, Lynne V; Muzny, Donna M; Yang, Shiaw-Pyng; Wang, Zhengyuan; Chinwalla, Asif T; Minx, Pat; Mitreva, Makedonka; Cook, Lisa; Delehaunty, Kim D; Fronick, Catrina; Schmidt, Heather; Fulton, Lucinda A; Fulton, Robert S; Nelson, Joanne O; Magrini, Vincent; Pohl, Craig; Graves, Tina A; Markovic, Chris; Cree, Andy; Dinh, Huyen H; Hume, Jennifer; Kovar, Christie L; Fowler, Gerald R; Lunter, Gerton; Meader, Stephen; Heger, Andreas; Ponting, Chris P; Marques-Bonet, Tomas; Alkan, Can; Chen, Lin; Cheng, Ze; Kidd, Jeffrey M; Eichler, Evan E; White, Simon; Searle, Stephen; Vilella, Albert J; Chen, Yuan; Flicek, Paul; Ma, Jian; Raney, Brian; Suh, Bernard; Burhans, Richard; Herrero, Javier; Haussler, David; Faria, Rui; Fernando, Olga.

Nature ; 608(7924): E36, 2022 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-35962045

6.

SneakySnake: a fast and accurate universal genome pre-alignment filter for CPUs, GPUs and FPGAs.

Alser, Mohammed; Shahroodi, Taha; Gómez-Luna, Juan; Alkan, Can; Mutlu, Onur.

Bioinformatics ; 36(22-23): 5282-5290, 2021 Apr 01.

Artigo em Inglês | MEDLINE | ID: mdl-33315064

RESUMO

MOTIVATION: We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for computationally costly sequence alignment. The key idea of SneakySnake is to reduce the approximate string matching (ASM) problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in finding the optimal path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly solves the SNR problem and uses the found optimal path to decide whether or not performing sequence alignment is necessary. Reducing the ASM problem into SNR also makes SneakySnake efficient to implement on CPUs, GPUs and FPGAs. RESULTS: SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper and SHD. For short sequences, SneakySnake accelerates Edlib (state-of-the-art implementation of Myers's bit-vector algorithm) and Parasail (state-of-the-art sequence aligner with a configurable scoring function), by up to 37.7× and 43.9× (>12× on average), respectively, with its CPU implementation, and by up to 413× and 689× (>400× on average), respectively, with FPGA and GPU acceleration. For long sequences, the CPU implementation of SneakySnake accelerates Parasail and KSW2 (sequence aligner of minimap2) by up to 979× (276.9× on average) and 91.7× (31.7× on average), respectively. As SneakySnake does not replace sequence alignment, users can still obtain all capabilities (e.g. configurable scoring functions) of the aligner of their choice, unlike existing acceleration efforts that sacrifice some aligner capabilities. AVAILABILITYAND IMPLEMENTATION: https://github.com/CMU-SAFARI/SneakySnake. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

7.

Realizing the potential of blockchain technologies in genomics.

Ozercan, Halil Ibrahim; Ileri, Atalay Mert; Ayday, Erman; Alkan, Can.

Genome Res ; 28(9): 1255-1263, 2018 09.

Artigo em Inglês | MEDLINE | ID: mdl-30076130

RESUMO

Genomics data introduce a substantial computational burden as well as data privacy and ownership issues. Data sets generated by high-throughput sequencing platforms require immense amounts of computational resources to align to reference genomes and to call and annotate genomic variants. This problem is even more pronounced if reanalysis is needed for new versions of reference genomes, which may impose high loads to existing computational infrastructures. Additionally, after the compute-intensive analyses are completed, the results are either kept in centralized repositories with access control, or distributed among stakeholders using standard file transfer protocols. This imposes two main problems: (1) Centralized servers become gatekeepers of the data, essentially acting as an unnecessary mediator between the actual data owners and data users; and (2) servers may create single points of failure both in terms of service availability and data privacy. Therefore, there is a need for secure and decentralized platforms for data distribution with user-level data governance. A new technology, blockchain, may help ameliorate some of these problems. In broad terms, the blockchain technology enables decentralized, immutable, incorruptible public ledgers. In this Perspective, we aim to introduce current developments toward using blockchain to address several problems in omics, and to provide an outlook of possible future implications of the blockchain technology to life sciences.

Assuntos

Anonimização de Dados , Genômica/métodos , Algoritmos , Big Data , Genoma Humano , Genômica/normas , Genômica/tendências , Humanos

8.

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions.

Senol Cali, Damla; Kim, Jeremie S; Ghose, Saugata; Alkan, Can; Mutlu, Onur.

Brief Bioinform ; 20(4): 1542-1559, 2019 07 19.

Artigo em Inglês | MEDLINE | ID: mdl-29617724

RESUMO

Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.

Assuntos

Genômica/métodos , Sequenciamento por Nanoporos/métodos , Animais , Mapeamento Cromossômico , Biologia Computacional , Escherichia coli/genética , Genoma Bacteriano , Genômica/estatística & dados numéricos , Genômica/tendências , Humanos , Sequenciamento por Nanoporos/estatística & dados numéricos , Sequenciamento por Nanoporos/tendências , Análise de Sequência de DNA , Software

9.

Apollo: a sequencing-technology-independent, scalable and accurate assembly polishing algorithm.

Firtina, Can; Kim, Jeremie S; Alser, Mohammed; Senol Cali, Damla; Cicek, A Ercument; Alkan, Can; Mutlu, Onur.

Bioinformatics ; 36(12): 3669-3679, 2020 06 01.

Artigo em Inglês | MEDLINE | ID: mdl-32167530

RESUMO

MOTIVATION: Third-generation sequencing technologies can sequence long reads that contain as many as 2 million base pairs. These long reads are used to construct an assembly (i.e. the subject's genome), which is further used in downstream genome analysis. Unfortunately, third-generation sequencing technologies have high sequencing error rates and a large proportion of base pairs in these long reads is incorrectly identified. These errors propagate to the assembly and affect the accuracy of genome analysis. Assembly polishing algorithms minimize such error propagation by polishing or fixing errors in the assembly by using information from alignments between reads and the assembly (i.e. read-to-assembly alignment information). However, current assembly polishing algorithms can only polish an assembly using reads from either a certain sequencing technology or a small assembly. Such technology-dependency and assembly-size dependency require researchers to (i) run multiple polishing algorithms and (ii) use small chunks of a large genome to use all available readsets and polish large genomes, respectively. RESULTS: We introduce Apollo, a universal assembly polishing algorithm that scales well to polish an assembly of any size (i.e. both large and small genomes) using reads from all sequencing technologies (i.e. second- and third-generation). Our goal is to provide a single algorithm that uses read sets from all available sequencing technologies to improve the accuracy of assembly polishing and that can polish large genomes. Apollo (i) models an assembly as a profile hidden Markov model (pHMM), (ii) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm and (iii) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments with real readsets demonstrate that Apollo is the only algorithm that (i) uses reads from any sequencing technology within a single run and (ii) scales well to polish large assemblies without splitting the assembly into multiple parts. AVAILABILITY AND IMPLEMENTATION: Source code is available at https://github.com/CMU-SAFARI/Apollo. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Algoritmos , Software , Sequenciamento de Nucleotídeos em Larga Escala , Polônia , Análise de Sequência de DNA , Tecnologia

10.

An integrated map of structural variation in 2,504 human genomes.

Sudmant, Peter H; Rausch, Tobias; Gardner, Eugene J; Handsaker, Robert E; Abyzov, Alexej; Huddleston, John; Zhang, Yan; Ye, Kai; Jun, Goo; Fritz, Markus Hsi-Yang; Konkel, Miriam K; Malhotra, Ankit; Stütz, Adrian M; Shi, Xinghua; Casale, Francesco Paolo; Chen, Jieming; Hormozdiari, Fereydoun; Dayama, Gargi; Chen, Ken; Malig, Maika; Chaisson, Mark J P; Walter, Klaudia; Meiers, Sascha; Kashin, Seva; Garrison, Erik; Auton, Adam; Lam, Hugo Y K; Mu, Xinmeng Jasmine; Alkan, Can; Antaki, Danny; Bae, Taejeong; Cerveira, Eliza; Chines, Peter; Chong, Zechen; Clarke, Laura; Dal, Elif; Ding, Li; Emery, Sarah; Fan, Xian; Gujral, Madhusudan; Kahveci, Fatma; Kidd, Jeffrey M; Kong, Yu; Lameijer, Eric-Wubbo; McCarthy, Shane; Flicek, Paul; Gibbs, Richard A; Marth, Gabor; Mason, Christopher E; Menelaou, Androniki.

Nature ; 526(7571): 75-81, 2015 Oct 01.

Artigo em Inglês | MEDLINE | ID: mdl-26432246

RESUMO

Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.

Assuntos

Variação Genética/genética , Genoma Humano/genética , Mapeamento Físico do Cromossomo , Sequência de Aminoácidos , Predisposição Genética para Doença , Genética Médica , Genética Populacional , Estudo de Associação Genômica Ampla , Genômica , Genótipo , Haplótipos/genética , Homozigoto , Humanos , Dados de Sequência Molecular , Taxa de Mutação , Polimorfismo de Nucleotídeo Único/genética , Locos de Características Quantitativas/genética , Análise de Sequência de DNA , Deleção de Sequência/genética

11.

Shouji: a fast and efficient pre-alignment filter for sequence alignment.

Alser, Mohammed; Hassan, Hasan; Kumar, Akash; Mutlu, Onur; Alkan, Can.

Bioinformatics ; 35(21): 4255-4263, 2019 11 01.

Artigo em Inglês | MEDLINE | ID: mdl-30923804

RESUMO

MOTIVATION: The ability to generate massive amounts of sequencing data continues to overwhelm the processing capability of existing algorithms and compute infrastructures. In this work, we explore the use of hardware/software co-design and hardware acceleration to significantly reduce the execution time of short sequence alignment, a crucial step in analyzing sequenced genomes. We introduce Shouji, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly dynamic programming algorithms. The first key idea of our proposed pre-alignment filter is to provide high filtering accuracy by correctly detecting all common subsequences shared between two given sequences. The second key idea is to design a hardware accelerator that adopts modern field-programmable gate array (FPGA) architectures to further boost the performance of our algorithm. RESULTS: Shouji significantly improves the accuracy of pre-alignment filtering by up to two orders of magnitude compared to the state-of-the-art pre-alignment filters, GateKeeper and SHD. Our FPGA-based accelerator is up to three orders of magnitude faster than the equivalent CPU implementation of Shouji. Using a single FPGA chip, we benchmark the benefits of integrating Shouji with five state-of-the-art sequence aligners, designed for different computing platforms. The addition of Shouji as a pre-alignment step reduces the execution time of the five state-of-the-art sequence aligners by up to 18.8×. Shouji can be adapted for any bioinformatics pipeline that performs sequence alignment for verification. Unlike most existing methods that aim to accelerate sequence alignment, Shouji does not sacrifice any of the aligner capabilities, as it does not modify or replace the alignment step. AVAILABILITY AND IMPLEMENTATION: https://github.com/CMU-SAFARI/Shouji. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Software , Algoritmos , Genoma , Alinhamento de Sequência , Análise de Sequência de DNA , Design de Software

12.

Discovery of tandem and interspersed segmental duplications using high-throughput sequencing.

Soylev, Arda; Le, Thong Minh; Amini, Hajar; Alkan, Can; Hormozdiari, Fereydoun.

Bioinformatics ; 35(20): 3923-3930, 2019 10 15.

Artigo em Inglês | MEDLINE | ID: mdl-30937433

RESUMO

MOTIVATION: Several algorithms have been developed that use high-throughput sequencing technology to characterize structural variations (SVs). Most of the existing approaches focus on detecting relatively simple types of SVs such as insertions, deletions and short inversions. In fact, complex SVs are of crucial importance and several have been associated with genomic disorders. To better understand the contribution of complex SVs to human disease, we need new algorithms to accurately discover and genotype such variants. Additionally, due to similar sequencing signatures, inverted duplications or gene conversion events that include inverted segmental duplications are often characterized as simple inversions, likewise, duplications and gene conversions in direct orientation may be called as simple deletions. Therefore, there is still a need for accurate algorithms to fully characterize complex SVs and thus improve calling accuracy of more simple variants. RESULTS: We developed novel algorithms to accurately characterize tandem, direct and inverted interspersed segmental duplications using short read whole genome sequencing datasets. We integrated these methods to our TARDIS tool, which is now capable of detecting various types of SVs using multiple sequence signatures such as read pair, read depth and split read. We evaluated the prediction performance of our algorithms through several experiments using both simulated and real datasets. In the simulation experiments, using a 30× coverage TARDIS achieved 96% sensitivity with only 4% false discovery rate. For experiments that involve real data, we used two haploid genomes (CHM1 and CHM13) and one human genome (NA12878) from the Illumina Platinum Genomes set. Comparison of our results with orthogonal PacBio call sets from the same genomes revealed higher accuracy for TARDIS than state-of-the-art methods. Furthermore, we showed a surprisingly low false discovery rate of our approach for discovery of tandem, direct and inverted interspersed segmental duplications prediction on CHM1 (<5% for the top 50 predictions). AVAILABILITY AND IMPLEMENTATION: TARDIS source code is available at https://github.com/BilkentCompGen/tardis, and a corresponding Docker image is available at https://hub.docker.com/r/alkanlab/tardis/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Duplicações Segmentares Genômicas , Algoritmos , Genoma Humano , Genômica , Humanos , Software

13.

Hercules: a profile HMM-based hybrid error correction algorithm for long reads.

Firtina, Can; Bar-Joseph, Ziv; Alkan, Can; Cicek, A Ercument.

Nucleic Acids Res ; 46(21): e125, 2018 11 30.

Artigo em Inglês | MEDLINE | ID: mdl-30124947

RESUMO

Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several types of studies require long and accurate reads. In such cases researchers often combine both technologies and the erroneous long reads are corrected using the short reads. Current approaches rely on various graph or alignment based techniques and do not take the error profile of the underlying technology into account. Efficient machine learning algorithms that address these shortcomings have the potential to achieve more accurate integration of these two technologies. We propose Hercules, the first machine learning-based long read error correction algorithm. Hercules models every long read as a profile Hidden Markov Model with respect to the underlying platform's error profile. The algorithm learns a posterior transition/emission probability distribution for each long read to correct errors in these reads. We show on two DNA-seq BAC clones (CH17-157L1 and CH17-227A2) that Hercules-corrected reads have the highest mapping rate among all competing algorithms and have the highest accuracy when the breadth of coverage is high. On a large human CHM1 cell line WGS data set, Hercules is one of the few scalable algorithms; and among those, it achieves the highest accuracy.

Assuntos

Algoritmos , Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Aprendizado de Máquina , Software , Humanos , Reprodutibilidade dos Testes

14.

Fast characterization of segmental duplications in genome assemblies.

Numanagic, Ibrahim; Gökkaya, Alim S; Zhang, Lillian; Berger, Bonnie; Alkan, Can; Hach, Faraz.

Bioinformatics ; 34(17): i706-i714, 2018 09 01.

Artigo em Inglês | MEDLINE | ID: mdl-30423092

RESUMO

Motivation: Segmental duplications (SDs) or low-copy repeats, are segments of DNA > 1 Kbp with high sequence identity that are copied to other regions of the genome. SDs are among the most important sources of evolution, a common cause of genomic structural variation and several are associated with diseases of genomic origin including schizophrenia and autism. Despite their functional importance, SDs present one of the major hurdles for de novo genome assembly due to the ambiguity they cause in building and traversing both state-of-the-art overlap-layout-consensus and de Bruijn graphs. This causes SD regions to be misassembled, collapsed into a unique representation, or completely missing from assembled reference genomes for various organisms. In turn, this missing or incorrect information limits our ability to fully understand the evolution and the architecture of the genomes. Despite the essential need to accurately characterize SDs in assemblies, there has been only one tool that was developed for this purpose, called Whole-Genome Assembly Comparison (WGAC); its primary goal is SD detection. WGAC is comprised of several steps that employ different tools and custom scripts, which makes this strategy difficult and time consuming to use. Thus there is still a need for algorithms to characterize within-assembly SDs quickly, accurately, and in a user friendly manner. Results: Here we introduce SEgmental Duplication Evaluation Framework (SEDEF) to rapidly detect SDs through sophisticated filtering strategies based on Jaccard similarity and local chaining. We show that SEDEF accurately detects SDs while maintaining substantial speed up over WGAC that translates into practical run times of minutes instead of weeks. Notably, our algorithm captures up to 25% 'pairwise error' between segments, whereas previous studies focused on only 10%, allowing us to more deeply track the evolutionary history of the genome. Availability and implementation: SEDEF is available at https://github.com/vpc-ccg/sedef.

Assuntos

Genoma , Duplicações Segmentares Genômicas , Algoritmos , Genômica , Humanos

15.

Great ape genetic diversity and population history.

Prado-Martinez, Javier; Sudmant, Peter H; Kidd, Jeffrey M; Li, Heng; Kelley, Joanna L; Lorente-Galdos, Belen; Veeramah, Krishna R; Woerner, August E; O'Connor, Timothy D; Santpere, Gabriel; Cagan, Alexander; Theunert, Christoph; Casals, Ferran; Laayouni, Hafid; Munch, Kasper; Hobolth, Asger; Halager, Anders E; Malig, Maika; Hernandez-Rodriguez, Jessica; Hernando-Herraez, Irene; Prüfer, Kay; Pybus, Marc; Johnstone, Laurel; Lachmann, Michael; Alkan, Can; Twigg, Dorina; Petit, Natalia; Baker, Carl; Hormozdiari, Fereydoun; Fernandez-Callejo, Marcos; Dabad, Marc; Wilson, Michael L; Stevison, Laurie; Camprubí, Cristina; Carvalho, Tiago; Ruiz-Herrera, Aurora; Vives, Laura; Mele, Marta; Abello, Teresa; Kondova, Ivanela; Bontrop, Ronald E; Pusey, Anne; Lankester, Felix; Kiyang, John A; Bergl, Richard A; Lonsdorf, Elizabeth; Myers, Simon; Ventura, Mario; Gagneux, Pascal; Comas, David.

Nature ; 499(7459): 471-5, 2013 Jul 25.

Artigo em Inglês | MEDLINE | ID: mdl-23823723

RESUMO

Most great ape genetic variation remains uncharacterized; however, its study is critical for understanding population history, recombination, selection and susceptibility to disease. Here we sequence to high coverage a total of 79 wild- and captive-born individuals representing all six great ape species and seven subspecies and report 88.8 million single nucleotide polymorphisms. Our analysis provides support for genetically distinct populations within each species, signals of gene flow, and the split of common chimpanzees into two distinct groups: Nigeria-Cameroon/western and central/eastern populations. We find extensive inbreeding in almost all wild populations, with eastern gorillas being the most extreme. Inferred effective population sizes have varied radically over time in different lineages and this appears to have a profound effect on the genetic diversity at, or close to, genes in almost all species. We discover and assign 1,982 loss-of-function variants throughout the human and great ape lineages, determining that the rate of gene loss has not been different in the human branch compared to other internal branches in the great ape phylogeny. This comprehensive catalogue of great ape genome diversity provides a framework for understanding evolution and a resource for more effective management of wild and captive great ape populations.

Assuntos

Variação Genética , Hominidae/genética , África , Animais , Animais Selvagens/genética , Animais de Zoológico/genética , Sudeste Asiático , Evolução Molecular , Fluxo Gênico/genética , Genética Populacional , Genoma/genética , Gorilla gorilla/classificação , Gorilla gorilla/genética , Hominidae/classificação , Humanos , Endogamia , Pan paniscus/classificação , Pan paniscus/genética , Pan troglodytes/classificação , Pan troglodytes/genética , Filogenia , Polimorfismo de Nucleotídeo Único/genética , Densidade Demográfica

16.

Demographically-Based Evaluation of Genomic Regions under Selection in Domestic Dogs.

Freedman, Adam H; Schweizer, Rena M; Ortega-Del Vecchyo, Diego; Han, Eunjung; Davis, Brian W; Gronau, Ilan; Silva, Pedro M; Galaverni, Marco; Fan, Zhenxin; Marx, Peter; Lorente-Galdos, Belen; Ramirez, Oscar; Hormozdiari, Farhad; Alkan, Can; Vilà, Carles; Squire, Kevin; Geffen, Eli; Kusak, Josip; Boyko, Adam R; Parker, Heidi G; Lee, Clarence; Tadigotla, Vasisht; Siepel, Adam; Bustamante, Carlos D; Harkins, Timothy T; Nelson, Stanley F; Marques-Bonet, Tomas; Ostrander, Elaine A; Wayne, Robert K; Novembre, John.

PLoS Genet ; 12(3): e1005851, 2016 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-26943675

RESUMO

Controlling for background demographic effects is important for accurately identifying loci that have recently undergone positive selection. To date, the effects of demography have not yet been explicitly considered when identifying loci under selection during dog domestication. To investigate positive selection on the dog lineage early in the domestication, we examined patterns of polymorphism in six canid genomes that were previously used to infer a demographic model of dog domestication. Using an inferred demographic model, we computed false discovery rates (FDR) and identified 349 outlier regions consistent with positive selection at a low FDR. The signals in the top 100 regions were frequently centered on candidate genes related to brain function and behavior, including LHFPL3, CADM2, GRIK3, SH3GL2, MBP, PDE7B, NTAN1, and GLRA1. These regions contained significant enrichments in behavioral ontology categories. The 3rd top hit, CCRN4L, plays a major role in lipid metabolism, that is supported by additional metabolism related candidates revealed in our scan, including SCP2D1 and PDXC1. Comparing our method to an empirical outlier approach that does not directly account for demography, we found only modest overlaps between the two methods, with 60% of empirical outliers having no overlap with our demography-based outlier detection approach. Demography-aware approaches have lower-rates of false discovery. Our top candidates for selection, in addition to expanding the set of neurobehavioral candidate genes, include genes related to lipid metabolism, suggesting a dietary target of selection that was important during the period when proto-dogs hunted and fed alongside hunter-gatherers.

Assuntos

Genética Populacional , Genômica , Metabolismo dos Lipídeos/genética , Seleção Genética , Animais , Demografia , Cães , Genoma , Polimorfismo de Nucleotídeo Único

17.

GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies.

Kim, Jeremie S; Senol Cali, Damla; Xin, Hongyi; Lee, Donghyuk; Ghose, Saugata; Alser, Mohammed; Hassan, Hasan; Ergin, Oguz; Alkan, Can; Mutlu, Onur.

BMC Genomics ; 19(Suppl 2): 89, 2018 May 09.

Artigo em Inglês | MEDLINE | ID: mdl-29764378

RESUMO

BACKGROUND: Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments. RESULTS: We propose a novel seed location filtering algorithm, GRIM-Filter, optimized to exploit 3D-stacked memory systems that integrate computation within a logic layer stacked under memory layers, to perform processing-in-memory (PIM). GRIM-Filter quickly filters seed locations by 1) introducing a new representation of coarse-grained segments of the reference genome, and 2) using massively-parallel in-memory operations to identify read presence within each coarse-grained segment. Our evaluations show that for a sequence alignment error tolerance of 0.05, GRIM-Filter 1) reduces the false negative rate of filtering by 5.59x-6.41x, and 2) provides an end-to-end read mapper speedup of 1.81x-3.65x, compared to a state-of-the-art read mapper employing the best previous seed location filtering algorithm. CONCLUSION: GRIM-Filter exploits 3D-stacked memory, which enables the efficient use of processing-in-memory, to overcome the memory bandwidth bottleneck in seed location filtering. We show that GRIM-Filter significantly improves the performance of a state-of-the-art read mapper. GRIM-Filter is a universal seed location filter that can be applied to any read mapper. We hope that our results provide inspiration for new works to design other bioinformatics algorithms that take advantage of emerging technologies and new processing paradigms, such as processing-in-memory using 3D-stacked memory devices.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Algoritmos , Bases de Dados Genéticas , Genoma Humano , Humanos , Software

18.

Discovery and genotyping of novel sequence insertions in many sequenced individuals.

Kavak, Pinar; Lin, Yen-Yi; Numanagic, Ibrahim; Asghari, Hossein; Güngör, Tunga; Alkan, Can; Hach, Faraz.

Bioinformatics ; 33(14): i161-i169, 2017 Jul 15.

Artigo em Inglês | MEDLINE | ID: mdl-28881988

RESUMO

MOTIVATION: Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. RESULT: Here, we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous versus homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects. AVAILABILITY AND IMPLEMENTATION: Pamir is available at https://github.com/vpc-ccg/pamir . CONTACT: fhach@{sfu.ca, prostatecentre.com } or calkan@cs.bilkent.edu.tr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Genoma Humano , Variação Estrutural do Genoma , Técnicas de Genotipagem/métodos , Mutação INDEL , Análise de Sequência de DNA/métodos , Software , Algoritmos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos

19.

GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping.

Alser, Mohammed; Hassan, Hasan; Xin, Hongyi; Ergin, Oguz; Mutlu, Onur; Alkan, Can.

Bioinformatics ; 33(21): 3355-3363, 2017 Nov 01.

Artigo em Inglês | MEDLINE | ID: mdl-28575161

RESUMO

MOTIVATION: High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads- that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and 'candidate' locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (i) it is implemented using quadratic-time dynamic programming algorithms and (ii) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before invoking computationally costly alignment algorithms. RESULTS: We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing, on average, 90-fold and 130-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively. The addition of GateKeeper as a pre-alignment step can reduce the verification time of the mrFAST mapper by a factor of 10. AVAILABILITY AND IMPLEMENTATION: https://github.com/BilkentCompGen/GateKeeper. CONTACT: mohammedalser@bilkent.edu.tr or onur.mutlu@inf.ethz.ch or calkan@cs.bilkent.edu.tr. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Genoma Humano , Humanos , Alinhamento de Sequência/métodos

20.

Toolkit for automated and rapid discovery of structural variants.

Soylev, Arda; Kockan, Can; Hormozdiari, Fereydoun; Alkan, Can.

Methods ; 129: 3-7, 2017 10 01.

Artigo em Inglês | MEDLINE | ID: mdl-28583483

RESUMO

Structural variations (SV) are broadly defined as genomic alterations that affect >50bp of DNA, which are shown to have significant effect on evolution and disease. The advent of high throughput sequencing (HTS) technologies and the ability to perform whole genome sequencing (WGS), makes it feasible to study these variants in depth. However, discovery of all forms of SV using WGS has proven to be challenging as the short reads produced by the predominant HTS platforms (<200bp for current technologies) and the fact that most genomes include large amounts of repeats make it very difficult to unambiguously map and accurately characterize such variants. Furthermore, existing tools for SV discovery are primarily developed for only a few of the SV types, which may have conflicting sequence signatures (i.e. read pairs, read depth, split reads) with other, untargeted SV classes. Here we are introduce a new framework, Tardis, which combines multiple read signatures into a single package to characterize most SV types simultaneously, while preventing such conflicts. Tardis also has a modular structure that makes it easy to extend for the discovery of additional forms of SV.

Assuntos

Variação Estrutural do Genoma/genética , Genômica , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Algoritmos , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala/tendências , Humanos , Análise de Sequência de DNA , Sequenciamento Completo do Genoma

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA