Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 186
Filtrar
Más filtros

Intervalo de año de publicación
1.
Nat Methods ; 20(9): 1346-1354, 2023 09.
Artículo en Inglés | MEDLINE | ID: mdl-37580559

RESUMEN

Even though the recent advances in 'complete genomics' revealed the previously inaccessible genomic regions, analysis of variations in centromeres and other extra-long tandem repeats (ETRs) faces an algorithmic challenge since there are currently no tools for accurate sequence comparison of ETRs. Counterintuitively, the classical alignment approaches, such as the Smith-Waterman algorithm, fail to construct biologically adequate alignments of ETRs. We present UniAligner-the parameter-free sequence alignment algorithm with sequence-dependent alignment scoring that automatically changes for any pair of compared sequences. UniAligner prioritizes matches of rare substrings that are more likely to be relevant to the evolutionary relationship between two sequences. We apply UniAligner to estimate the mutation rates in human centromeres, and quantify the extremely high rate of large duplications and deletions in centromeres. This high rate suggests that centromeres may represent some of the most rapidly evolving regions of the human genome with respect to their structural organization.


Asunto(s)
Algoritmos , Genómica , Humanos , Alineación de Secuencia , Genómica/métodos , Genoma Humano
2.
Nature ; 585(7823): 79-84, 2020 09.
Artículo en Inglés | MEDLINE | ID: mdl-32663838

RESUMEN

After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist1,2. Here we present a human genome assembly that surpasses the continuity of GRCh382, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome3, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes.


Asunto(s)
Cromosomas Humanos X/genética , Genoma Humano/genética , Telómero/genética , Centrómero/genética , Islas de CpG/genética , Metilación de ADN , ADN Satélite/genética , Femenino , Humanos , Mola Hidatiforme/genética , Masculino , Embarazo , Reproducibilidad de los Resultados , Testículo/metabolismo
3.
Genome Res ; 32(6): 1152-1169, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-35545447

RESUMEN

The V(D)J recombination process rearranges the variable (V), diversity (D), and joining (J) genes in the immunoglobulin (IG) loci to generate antibody repertoires. Annotation of these loci across various species and predicting the V, D, and J genes (IG genes) are critical for studies of the adaptive immune system. However, because the standard gene finding algorithms are not suitable for predicting IG genes, they have been semimanually annotated in very few species. We developed the IGDetective algorithm for predicting IG genes and applied it to species with the assembled IG loci. IGDetective generated the first large collection of IG genes across many species and enabled their evolutionary analysis, including the analysis of the "bat IG diversity" hypothesis. This analysis revealed extremely conserved V genes in evolutionary distant species, indicating that these genes may be subjected to the same selective pressure, for example, pressure driven by common pathogens. IGDetective also revealed extremely diverged V genes and a new family of evolutionary conserved V genes in bats with unusual noncanonical cysteines. Moreover, unlike all other previously reported antibodies, these cysteines are located within complementarity-determining regions. Because cysteines form disulfide bonds, we hypothesize that these cysteine-rich V genes might generate antibodies with noncanonical conformations and could potentially form a unique part of the immune repertoire in bats. We also analyzed the diversity landscape of the recombination signal sequences and revealed their features that trigger the high/low usage of the IG genes.


Asunto(s)
Diversidad de Anticuerpos , Recombinación V(D)J , Anticuerpos , Regiones Determinantes de Complementariedad/genética , Genes de Inmunoglobulinas
4.
Genome Res ; 32(11-12): 2119-2133, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36418060

RESUMEN

The advent of long and accurate "HiFi" reads has greatly improved our ability to generate complete metagenome-assembled genomes (MAGs), enabling "complete metagenomics" studies that were nearly impossible to conduct with short reads. In particular, HiFi reads simplify the identification and phasing of mutations in MAGs: It is increasingly feasible to distinguish between positions that are prone to mutations and positions that rarely ever mutate, and to identify co-occurring groups of mutations. However, the problems of identifying rare mutations in MAGs, estimating the false-discovery rate (FDR) of these identifications, and phasing identified mutations remain open in the context of HiFi data. We present strainFlye, a pipeline for the FDR-controlled identification and analysis of rare mutations in MAGs assembled using HiFi reads. We show that deep HiFi sequencing has the potential to reveal and phase tens of thousands of rare mutations in a single MAG, identify hotspots and coldspots of these mutations, and detail MAGs' growth dynamics.


Asunto(s)
Bacterias , Metagenoma , Bacterias/genética , Metagenómica , Mutación
5.
Genome Res ; 32(11-12): 2107-2118, 2022.
Artículo en Inglés | MEDLINE | ID: mdl-36379716

RESUMEN

Recent advancements in long-read sequencing have enabled the telomere-to-telomere (complete) assembly of a human genome and are now contributing to the haplotype-resolved complete assemblies of multiple human genomes. Because the accuracy of read mapping tools deteriorates in highly repetitive regions, there is a need to develop accurate, error-exposing (detecting potential assembly errors), and diploid-aware (distinguishing different haplotypes) tools for read mapping in complete assemblies. We describe the first accurate, error-exposing, and partially diploid-aware VerityMap tool for long-read mapping to complete assemblies.


Asunto(s)
Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADN , Secuencias Repetitivas de Ácidos Nucleicos , Diploidia
6.
Genome Res ; 32(6): 1137-1151, 2022 06.
Artículo en Inglés | MEDLINE | ID: mdl-35545449

RESUMEN

Recent advances in long-read sequencing opened a possibility to address the long-standing questions about the architecture and evolution of human centromeres. They also emphasized the need for centromere annotation (partitioning human centromeres into monomers and higher-order repeats [HORs]). Although there was a half-century-long series of semi-manual studies of centromere architecture, a rigorous centromere annotation algorithm is still lacking. Moreover, an automated centromere annotation is a prerequisite for studies of genetic diseases associated with centromeres and evolutionary studies of centromeres across multiple species. Although the monomer decomposition (transforming a centromere into a monocentromere written in the monomer alphabet) and the HOR decomposition (representing a monocentromere in the alphabet of HORs) are currently viewed as two separate problems, we show that they should be integrated into a single framework in such a way that HOR (monomer) inference affects monomer (HOR) inference. We thus developed the HORmon algorithm that integrates the monomer/HOR inference and automatically generates the human monomers/HORs that are largely consistent with the previous semi-manual inference.


Asunto(s)
Algoritmos , Centrómero , Centrómero/genética , Humanos
7.
Genome Res ; 32(4): 791-804, 2022 04.
Artículo en Inglés | MEDLINE | ID: mdl-35361626

RESUMEN

An important challenge in vaccine development is to figure out why a vaccine succeeds in some individuals and fails in others. Although antibody repertoires hold the key to answering this question, there have been very few personalized immunogenomics studies so far aimed at revealing how variations in immunoglobulin genes affect a vaccine response. We conducted an immunosequencing study of 204 calves vaccinated against bovine respiratory disease (BRD) with the goal to reveal variations in immunoglobulin genes and somatic hypermutations that impact the efficacy of vaccine response. Our study represents the largest longitudinal personalized immunogenomics study reported to date across all species, including humans. To analyze the generated data set, we developed an algorithm for identifying variations of the immunoglobulin genes (as well as frequent somatic hypermutations) that affect various features of the antibody repertoire and titers of neutralizing antibodies. In contrast to relatively short human antibodies, cattle have a large fraction of ultralong antibodies that have opened new therapeutic opportunities. Our study reveals that ultralong antibodies are a key component of the immune response against the costliest disease of beef cattle in North America. The detected variants of the cattle immunoglobulin genes, which are implicated in the success/failure of the BRD vaccine, have the potential to direct the selection of individual cattle for ongoing breeding programs.


Asunto(s)
Enfermedades de los Bovinos , Vacunas , Animales , Anticuerpos , Bovinos , Enfermedades de los Bovinos/prevención & control , América del Norte , Vacunas/genética
8.
Mol Cell Proteomics ; 21(7): 100254, 2022 07.
Artículo en Inglés | MEDLINE | ID: mdl-35654359

RESUMEN

All human diseases involve proteins, yet our current tools to characterize and quantify them are limited. To better elucidate proteins across space, time, and molecular composition, we provide a >10 years of projection for technologies to meet the challenges that protein biology presents. With a broad perspective, we discuss grand opportunities to transition the science of proteomics into a more propulsive enterprise. Extrapolating recent trends, we describe a next generation of approaches to define, quantify, and visualize the multiple dimensions of the proteome, thereby transforming our understanding and interactions with human disease in the coming decade.


Asunto(s)
Proteoma , Proteómica , Humanos , Proteoma/metabolismo , Proteómica/métodos
9.
Genome Res ; 30(11): 1547-1558, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-32948615

RESUMEN

The V(DD)J recombination is currently viewed as an aberrant and inconsequential variant of the canonical V(D)J recombination. Moreover, since the classical 12/23 rule for the V(D)J recombination fails to explain the V(DD)J recombination, the molecular mechanism of tandem D-D fusions has remained unknown since they were discovered three decades ago. Revealing this mechanism is a biomedically important goal since tandem fusions contribute to broadly neutralizing antibodies with ultralong CDR3s. We reveal previously overlooked cryptic nonamers in the recombination signal sequences of human IGHD genes and demonstrate that these nonamers explain the vast majority of tandem fusions in human repertoires. We further reveal large clonal lineages formed by tandem fusions in antigen-stimulated immunosequencing data sets, suggesting that such data sets contain many more tandem fusions than previously thought and that about a quarter of large clonal lineages with unusually long CDR3s are generated through tandem fusions. Finally, we developed the SEARCH-D algorithm for identifying D genes in mammalian genomes and applied it to the recently completed Vertebrate Genomes Project assemblies, nearly doubling the number of mammalian species with known D genes. Our analysis revealed cryptic nonamers in RSSs of many mammalian genomes, thus demonstrating that the V(DD)J recombination is not a "bug" but an important feature preserved throughout mammalian evolution.


Asunto(s)
Regiones Determinantes de Complementariedad/genética , Recombinación V(D)J , Algoritmos , Animales , Antígenos , Genes de las Cadenas Pesadas de las Inmunoglobulinas , Humanos , Mamíferos/genética , Secuencias Repetidas en Tándem
10.
Genome Res ; 30(6): 898-909, 2020 06.
Artículo en Inglés | MEDLINE | ID: mdl-32540955

RESUMEN

Long-range sequencing information is required for haplotype phasing, de novo assembly, and structural variation detection. Current long-read sequencing technologies can provide valuable long-range information but at a high cost with low accuracy and high DNA input requirements. We have developed a single-tube Transposase Enzyme Linked Long-read Sequencing (TELL-seq) technology, which enables a low-cost, high-accuracy, and high-throughput short-read second-generation sequencer to generate over 100 kb of long-range sequencing information with as little as 0.1 ng input material. In a PCR tube, millions of clonally barcoded beads are used to uniquely barcode long DNA molecules in an open bulk reaction without dilution and compartmentation. The barcoded linked-reads are used to successfully assemble genomes ranging from microbes to human. These linked-reads also generate megabase-long phased blocks and provide a cost-effective tool for detecting structural variants in a genome, which are important to identify compound heterozygosity in recessive Mendelian diseases and discover genetic drivers and diagnostic biomarkers in cancers.


Asunto(s)
Biblioteca de Genes , Secuenciación de Nucleótidos de Alto Rendimiento , Análisis de Secuencia de ADN , Biología Computacional/métodos , Código de Barras del ADN Taxonómico/métodos , Variación Genética , Genoma Humano , Genómica/métodos , Antígenos HLA/genética , Haplotipos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/normas , Humanos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/normas , Flujo de Trabajo
11.
Nat Methods ; 17(11): 1103-1110, 2020 11.
Artículo en Inglés | MEDLINE | ID: mdl-33020656

RESUMEN

Long-read sequencing technologies have substantially improved the assemblies of many isolate bacterial genomes as compared to fragmented short-read assemblies. However, assembling complex metagenomic datasets remains difficult even for state-of-the-art long-read assemblers. Here we present metaFlye, which addresses important long-read metagenomic assembly challenges, such as uneven bacterial composition and intra-species heterogeneity. First, we benchmarked metaFlye using simulated and mock bacterial communities and show that it consistently produces assemblies with better completeness and contiguity than state-of-the-art long-read assemblers. Second, we performed long-read sequencing of the sheep microbiome and applied metaFlye to reconstruct 63 complete or nearly complete bacterial genomes within single contigs. Finally, we show that long-read assembly of human microbiomes enables the discovery of full-length biosynthetic gene clusters that encode biomedically important natural products.


Asunto(s)
Genoma Bacteriano/genética , Genoma Humano/genética , Metagenoma/genética , Metagenómica/métodos , Microbiota/genética , Algoritmos , Animales , Benchmarking , Microbioma Gastrointestinal/genética , Humanos , Análisis de Secuencia de ADN/métodos , Ovinos , Programas Informáticos , Especificidad de la Especie
12.
Genome Res ; 29(6): 961-968, 2019 06.
Artículo en Inglés | MEDLINE | ID: mdl-31048319

RESUMEN

Although plasmids are important for bacterial survival and adaptation, plasmid detection and assembly from genomic, let alone metagenomic, samples remain challenging. The recently developed plasmidSPAdes assembler addressed some of these challenges in the case of isolate genomes but stopped short of detecting plasmids in metagenomic assemblies, an untapped source of yet to be discovered plasmids. We present the metaplasmidSPAdes tool for plasmid assembly in metagenomic data sets that reduced the false positive rate of plasmid detection compared with the state-of-the-art approaches. We assembled plasmids in diverse data sets and have shown that thousands of plasmids remained below the radar in already completed genomic and metagenomic studies. Our analysis revealed the extreme variability of plasmids and has led to the discovery of many novel plasmids (including many plasmids carrying antibiotic-resistance genes) without significant similarities to currently known ones.


Asunto(s)
Biología Computacional , Genómica , Metagenoma , Metagenómica , Plásmidos/genética , Biología Computacional/métodos , Bases de Datos Genéticas , Conjuntos de Datos como Asunto , Genómica/métodos , Humanos , Metagenómica/métodos , Anotación de Secuencia Molecular
13.
Genome Res ; 29(8): 1352-1362, 2019 08.
Artículo en Inglés | MEDLINE | ID: mdl-31160374

RESUMEN

Predicting biosynthetic gene clusters (BGCs) is critically important for discovery of antibiotics and other natural products. While BGC prediction from complete genomes is a well-studied problem, predicting BGCs in fragmented genomic assemblies remains challenging. The existing BGC prediction tools often assume that each BGC is encoded within a single contig in the genome assembly, a condition that is violated for most sequenced microbial genomes where BGCs are often scattered through several contigs, making it difficult to reconstruct them. The situation is even more severe in shotgun metagenomics, where the contigs are often short, and the existing tools fail to predict a large fraction of long BGCs. While it is difficult to assemble BGCs in a single contig, the structure of the genome assembly graph often provides clues on how to combine multiple contigs into segments encoding long BGCs. We describe biosyntheticSPAdes, a tool for predicting BGCs in assembly graphs and demonstrate that it greatly improves the reconstruction of BGCs from genomic and metagenomics data sets.


Asunto(s)
Genes Bacterianos , Metagenoma , Metagenómica/métodos , Familia de Multigenes , Programas Informáticos , Mapeo Contig , Conjuntos de Datos como Asunto , Placa Dental/microbiología , Encía/microbiología , Humanos , Internet , Mucosa Bucal/microbiología , Faringe/microbiología , Biosíntesis de Proteínas , Lengua/microbiología
14.
Bioinformatics ; 37(Suppl_1): i196-i204, 2021 07 12.
Artículo en Inglés | MEDLINE | ID: mdl-34252949

RESUMEN

MOTIVATION: Recent advances in long-read sequencing technologies led to rapid progress in centromere assembly in the last year and, for the first time, opened a possibility to address the long-standing questions about the architecture and evolution of human centromeres. However, since these advances have not been yet accompanied by the development of the centromere-specific bioinformatics algorithms, even the fundamental questions (e.g. centromere annotation by deriving the complete set of human monomers and high-order repeats), let alone more complex questions (e.g. explaining how monomers and high-order repeats evolved) about human centromeres remain open. Moreover, even though there was a four-decade-long series of studies aimed at cataloging all human monomers and high-order repeats, the rigorous algorithmic definitions of these concepts are still lacking. Thus, the development of a centromere annotation tool is a prerequisite for follow-up personalized biomedical studies of centromeres across the human population and evolutionary studies of centromeres across various species. RESULTS: We describe the CentromereArchitect, the first tool for the centromere annotation in a newly sequenced genome, apply it to the recently generated complete assembly of a human genome by the Telomere-to-Telomere consortium, generate the complete set of human monomers and high-order repeats for 'live' centromeres, and reveal a vast set of hybrid monomers that may represent the focal points of centromere evolution. AVAILABILITY AND IMPLEMENTATION: CentromereArchitect is publicly available on https://github.com/ablab/stringdecomposer/tree/ismb2021. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Centrómero , Genoma , Algoritmos , Secuencia de Bases , Centrómero/genética , Humanos , Telómero
15.
Genome Res ; 28(6): 901-909, 2018 06.
Artículo en Inglés | MEDLINE | ID: mdl-29735604

RESUMEN

Although segmental duplications (SDs) represent hotbeds for genomic rearrangements and emergence of new genes, there are still no easy-to-use tools for identifying SDs. Moreover, while most previous studies focused on recently emerged SDs, detection of ancient SDs remains an open problem. We developed an SDquest algorithm for SD finding and applied it to analyzing SDs in human, gorilla, and mouse genomes. Our results demonstrate that previous studies missed many SDs in these genomes and show that SDs account for at least 6.05% of the human genome (version hg19), a 17% increase as compared to the previous estimate. Moreover, SDquest classified 6.42% of the latest GRCh38 version of the human genome as SDs, a large increase as compared to previous studies. We thus propose to re-evaluate evolution of SDs based on their accurate representation across multiple genomes. Toward this goal, we analyzed the complex mosaic structure of SDs and decomposed mosaic SDs into elementary SDs, a prerequisite for follow-up evolutionary analysis. We also introduced the concept of the breakpoint graph of mosaic SDs that revealed SD hotspots and suggested that some SDs may have originated from circular extrachromosomal DNA (ecDNA), not unlike ecDNA that contributes to accelerated evolution in cancer.


Asunto(s)
Evolución Molecular , Gorilla gorilla/genética , Mamíferos/genética , Duplicaciones Segmentarias en el Genoma/genética , Animales , Genoma Humano/genética , Humanos , Ratones , Especificidad de la Especie
16.
Bioinformatics ; 36(Suppl_1): i93-i101, 2020 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-32657390

RESUMEN

MOTIVATION: Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) formed by chromosome-specific monomers. Given a set of all human monomers, translating a read from a centromere into the monomer alphabet is modeled as the String Decomposition Problem. The accurate translation of reads into the monomer alphabet turns the notoriously difficult problem of assembling centromeres from reads (in the nucleotide alphabet) into a more tractable problem of assembling centromeres from translated reads. RESULTS: We describe a StringDecomposer (SD) algorithm for solving this problem, benchmark it on the set of long error-prone Oxford Nanopore reads generated by the Telomere-to-Telomere consortium and identify a novel (rare) monomer that extends the set of known X-chromosome specific monomers. Our identification of a novel monomer emphasizes the importance of identification of all (even rare) monomers for future centromere assembly efforts and evolutionary studies. To further analyze novel monomers, we applied SD to the set of recently generated long accurate Pacific Biosciences HiFi reads. This analysis revealed that the set of known human monomers and HORs remains incomplete. SD opens a possibility to generate a complete set of human monomers and HORs for using in the ongoing efforts to generate the complete assembly of the human genome. AVAILABILITY AND IMPLEMENTATION: StringDecomposer is publicly available on https://github.com/ablab/stringdecomposer. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Centrómero , Nanoporos , Algoritmos , Centrómero/genética , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Análisis de Secuencia de ADN , Secuencias Repetidas en Tándem
17.
Bioinformatics ; 36(14): 4126-4129, 2020 08 15.
Artículo en Inglés | MEDLINE | ID: mdl-32413137

RESUMEN

MOTIVATION: Although the set of currently known viruses has been steadily expanding, only a tiny fraction of the Earth's virome has been sequenced so far. Shotgun metagenomic sequencing provides an opportunity to reveal novel viruses but faces the computational challenge of identifying viral genomes that are often difficult to detect in metagenomic assemblies. RESULTS: We describe a MetaviralSPAdes tool for identifying viral genomes in metagenomic assembly graphs that is based on analyzing variations in the coverage depth between viruses and bacterial chromosomes. We benchmarked MetaviralSPAdes on diverse metagenomic datasets, verified our predictions using a set of virus-specific Hidden Markov Models and demonstrated that it improves on the state-of-the-art viral identification pipelines. AVAILABILITY AND IMPLEMENTATION: Metaviral SPAdes includes ViralAssembly, ViralVerify and ViralComplete modules that are available as standalone packages: https://github.com/ablab/spades/tree/metaviral_publication, https://github.com/ablab/viralVerify/ and https://github.com/ablab/viralComplete/. CONTACT: d.antipov@spbu.ru. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Programas Informáticos , Virus , Algoritmos , Metagenoma , Metagenómica , Análisis de Secuencia de ADN , Virus/genética
18.
Bioinformatics ; 36(Suppl_1): i75-i83, 2020 07 01.
Artículo en Inglés | MEDLINE | ID: mdl-32657355

RESUMEN

MOTIVATION: Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETRs remains an open problem, it is not clear how to polish draft ETR assemblies. RESULTS: To address these problems, we developed the TandemTools software that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improves the recently generated assemblies of human centromeres. AVAILABILITY AND IMPLEMENTATION: https://github.com/ablab/TandemTools. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Asunto(s)
Secuenciación de Nucleótidos de Alto Rendimiento , Programas Informáticos , Eucariontes , Humanos , Análisis de Secuencia de ADN , Secuencias Repetidas en Tándem
19.
PLoS Comput Biol ; 16(4): e1007837, 2020 04.
Artículo en Inglés | MEDLINE | ID: mdl-32339161

RESUMEN

Immunoglobulin genes are formed through V(D)J recombination, which joins the variable (V), diversity (D), and joining (J) germline genes. Since variations in germline genes have been linked to various diseases, personalized immunogenomics focuses on finding alleles of germline genes across various patients. Although reconstruction of V and J genes is a well-studied problem, the more challenging task of reconstructing D genes remained open until the IgScout algorithm was developed in 2019. In this work, we address limitations of IgScout by developing a probabilistic MINING-D algorithm for D gene reconstruction, apply it to hundreds of immunosequencing datasets from multiple species, and validate the newly inferred D genes by analyzing diverse whole genome sequencing datasets and haplotyping heterozygous V genes.


Asunto(s)
Biología Computacional/métodos , Genes de Inmunoglobulinas/genética , Inmunoglobulina D/genética , Algoritmos , Animales , Bases de Datos Genéticas , Humanos , Inmunidad/genética
20.
IEEE Trans Inf Theory ; 67(6): 3295-3314, 2021 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-34176957

RESUMEN

The problem of reconstructing a string from its error-prone copies, the trace reconstruction problem, was introduced by Vladimir Levenshtein two decades ago. While there has been considerable theoretical work on trace reconstruction, practical solutions have only recently started to emerge in the context of two rapidly developing research areas: immunogenomics and DNA data storage. In immunogenomics, traces correspond to mutated copies of genes, with mutations generated naturally by the adaptive immune system. In DNA data storage, traces correspond to noisy copies of DNA molecules that encode digital data, with errors being artifacts of the data retrieval process. In this paper, we introduce several new trace generation models and open questions relevant to trace reconstruction for immunogenomics and DNA data storage, survey theoretical results on trace reconstruction, and highlight their connections to computational biology. Throughout, we discuss the applicability and shortcomings of known solutions and suggest future research directions.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA