Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 676
Filtrar
1.
J Bioinform Comput Biol ; 22(4): 2450019, 2024 Aug.
Artículo en Inglés | MEDLINE | ID: mdl-39215522

RESUMEN

The graph of sequences represents the genetic variations of pan-genome concisely and space-efficiently than multiple linear reference genome. In order to accelerate aligning reads to the graph, an index of graph-based reference genomes is used to obtain candidate locations. However, the potential combinatorial explosion of nodes on the sequence graph leads to increasing the index space and maximum memory usage of alignment process considerably, especially for large-scale datasets. For this, existing methods typically attempt to prune complex regions, or extend the length of seeds, which sacrifices the recall of alignment algorithm despite reducing space usage slightly. We present the Sparse-index of Graph (SIG) and alignment algorithm SIG-Aligner, capable of indexing and aligning at the lower memory cost. SIG builds the non-overlapping minimizers index inside nodes of sequence graph and SIG-Aligner filters out most of the false positive matches by the method based on the pigeonhole principle. Compared to Giraffe, the results of computational experiments show that SIG achieves a significant reduction in index memory space ranging from 50% to 75% for the human pan-genome graphs, while still preserving superior or comparable accuracy of alignment and the faster alignment time.


Asunto(s)
Algoritmos , Alineación de Secuencia , Análisis de Secuencia de ADN , Humanos , Alineación de Secuencia/métodos , Alineación de Secuencia/estadística & datos numéricos , Análisis de Secuencia de ADN/métodos , Análisis de Secuencia de ADN/estadística & datos numéricos , Genoma Humano , Programas Informáticos , Genómica/métodos , Genómica/estadística & datos numéricos , Genoma
2.
Brief Bioinform ; 25(4)2024 May 23.
Artículo en Inglés | MEDLINE | ID: mdl-38985929

RESUMEN

Recent advances in sequencing, mass spectrometry, and cytometry technologies have enabled researchers to collect multiple 'omics data types from a single sample. These large datasets have led to a growing consensus that a holistic approach is needed to identify new candidate biomarkers and unveil mechanisms underlying disease etiology, a key to precision medicine. While many reviews and benchmarks have been conducted on unsupervised approaches, their supervised counterparts have received less attention in the literature and no gold standard has emerged yet. In this work, we present a thorough comparison of a selection of six methods, representative of the main families of intermediate integrative approaches (matrix factorization, multiple kernel methods, ensemble learning, and graph-based methods). As non-integrative control, random forest was performed on concatenated and separated data types. Methods were evaluated for classification performance on both simulated and real-world datasets, the latter being carefully selected to cover different medical applications (infectious diseases, oncology, and vaccines) and data modalities. A total of 15 simulation scenarios were designed from the real-world datasets to explore a large and realistic parameter space (e.g. sample size, dimensionality, class imbalance, effect size). On real data, the method comparison showed that integrative approaches performed better or equally well than their non-integrative counterpart. By contrast, DIABLO and the four random forest alternatives outperform the others across the majority of simulation scenarios. The strengths and limitations of these methods are discussed in detail as well as guidelines for future applications.


Asunto(s)
Biología Computacional , Humanos , Biología Computacional/métodos , Algoritmos , Genómica/métodos , Genómica/estadística & datos numéricos , Multiómica
3.
PLoS Comput Biol ; 20(7): e1012241, 2024 Jul.
Artículo en Inglés | MEDLINE | ID: mdl-38985831

RESUMEN

Dimension reduction tools preserving similarity and graph structure such as t-SNE and UMAP can capture complex biological patterns in high-dimensional data. However, these tools typically are not designed to separate effects of interest from unwanted effects due to confounders. We introduce the partial embedding (PARE) framework, which enables removal of confounders from any distance-based dimension reduction method. We then develop partial t-SNE and partial UMAP and apply these methods to genomic and neuroimaging data. For lower-dimensional visualization, our results show that the PARE framework can remove batch effects in single-cell sequencing data as well as separate clinical and technical variability in neuroimaging measures. We demonstrate that the PARE framework extends dimension reduction methods to highlight biological patterns of interest while effectively removing confounding effects.


Asunto(s)
Algoritmos , Biología Computacional , Neuroimagen , Humanos , Neuroimagen/métodos , Biología Computacional/métodos , Genómica/métodos , Genómica/estadística & datos numéricos , Análisis de la Célula Individual/métodos , Análisis de la Célula Individual/estadística & datos numéricos
4.
J Comput Biol ; 29(2): 155-168, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-35108101

RESUMEN

k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.


Asunto(s)
Mutación , Análisis de Secuencia de ADN/estadística & datos numéricos , Algoritmos , Secuencia de Bases , Biología Computacional , Intervalos de Confianza , Genómica/estadística & datos numéricos , Humanos , Modelos Genéticos , Alineación de Secuencia/estadística & datos numéricos , Programas Informáticos
5.
PLoS Biol ; 20(2): e3001536, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-35167588

RESUMEN

The importance of sampling from globally representative populations has been well established in human genomics. In human microbiome research, however, we lack a full understanding of the global distribution of sampling in research studies. This information is crucial to better understand global patterns of microbiome-associated diseases and to extend the health benefits of this research to all populations. Here, we analyze the country of origin of all 444,829 human microbiome samples that are available from the world's 3 largest genomic data repositories, including the Sequence Read Archive (SRA). The samples are from 2,592 studies of 19 body sites, including 220,017 samples of the gut microbiome. We show that more than 71% of samples with a known origin come from Europe, the United States, and Canada, including 46.8% from the US alone, despite the country representing only 4.3% of the global population. We also find that central and southern Asia is the most underrepresented region: Countries such as India, Pakistan, and Bangladesh account for more than a quarter of the world population but make up only 1.8% of human microbiome samples. These results demonstrate a critical need to ensure more global representation of participants in microbiome studies.


Asunto(s)
Microbioma Gastrointestinal/genética , Genómica/métodos , Metagenoma/genética , Metagenómica/métodos , Microbiota/genética , Asia , Bangladesh , Canadá , Países Desarrollados , Europa (Continente) , Genómica/estadística & datos numéricos , Geografía , Humanos , India , Metagenómica/estadística & datos numéricos , Pakistán , Estados Unidos
6.
J Comput Biol ; 29(1): 19-22, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-34985990

RESUMEN

Although the availability of various sequencing technologies allows us to capture different genome properties at single-cell resolution, with the exception of a few co-assaying technologies, applying different sequencing assays on the same single cell is impossible. Single-cell alignment using optimal transport (SCOT) is an unsupervised algorithm that addresses this limitation by using optimal transport to align single-cell multiomics data. First, it preserves the local geometry by constructing a k-nearest neighbor (k-NN) graph for each data set (or domain) to capture the intra-domain distances. SCOT then finds a probabilistic coupling matrix that minimizes the discrepancy between the intra-domain distance matrices. Finally, it uses the coupling matrix to project one single-cell data set onto another through barycentric projection, thus aligning them. SCOT requires tuning only two hyperparameters and is robust to the choice of one. Furthermore, the Gromov-Wasserstein distance in the algorithm can guide SCOT's hyperparameter tuning in a fully unsupervised setting when no orthogonal alignment information is available. Thus, SCOT is a fast and accurate alignment method that provides a heuristic for hyperparameter selection in a real-world unsupervised single-cell data alignment scenario. We provide a tutorial for SCOT and make its source code publicly available on GitHub.


Asunto(s)
Algoritmos , Alineación de Secuencia/estadística & datos numéricos , Análisis de la Célula Individual/estadística & datos numéricos , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Genómica/estadística & datos numéricos , Heurística , Humanos , Redes Neurales de la Computación , Análisis de Secuencia/estadística & datos numéricos , Programas Informáticos , Aprendizaje Automático no Supervisado
7.
J Comput Biol ; 29(1): 56-73, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-34986026

RESUMEN

Over the past decade, a promising line of cancer research has utilized machine learning to mine statistical patterns of mutations in cancer genomes for information. Recent work shows that these statistical patterns, commonly referred to as "mutational signatures," have diverse therapeutic potential as biomarkers for cancer therapies. However, translating this potential into reality is hindered by limited access to sequencing in the clinic. Almost all methods for mutational signature analysis (MSA) rely on whole genome or whole exome sequencing data, while sequencing in the clinic is typically limited to small gene panels. To improve clinical access to MSA, we considered the question of whether targeted panels could be designed for the purpose of mutational signature detection. Here we present ScalpelSig, to our knowledge the first algorithm that automatically designs genomic panels optimized for detection of a given mutational signature. The algorithm learns from data to identify genome regions that are particularly indicative of signature activity. Using a cohort of breast cancer genomes as training data, we show that ScalpelSig panels substantially improve accuracy of signature detection compared to baselines. We find that some ScalpelSig panels even approach the performance of whole exome sequencing, which observes over 10 × as much genomic material. We test our algorithm under a variety of conditions, showing that its performance generalizes to another dataset of breast cancers, to smaller panel sizes, and to lesser amounts of training data.


Asunto(s)
Algoritmos , Análisis Mutacional de ADN/estadística & datos numéricos , Genómica/estadística & datos numéricos , Neoplasias de la Mama/genética , Estudios de Cohortes , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Femenino , Humanos , Aprendizaje Automático , Mutación , Secuenciación Completa del Genoma/estadística & datos numéricos
8.
J Comput Biol ; 29(2): 169-187, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-35041495

RESUMEN

Recently, Gagie et al. proposed a version of the FM-index, called the r-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the r-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the r-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the r-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the r-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called MONI can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.


Asunto(s)
Algoritmos , Genómica/estadística & datos numéricos , Alineación de Secuencia/estadística & datos numéricos , Programas Informáticos , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Genoma Bacteriano , Genoma Humano , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Salmonella/genética , Análisis de Secuencia de ADN/estadística & datos numéricos , Análisis de Ondículas
9.
J Comput Biol ; 29(2): 188-194, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-35041518

RESUMEN

Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in [Formula: see text] space that supports efficient MEM finding, where r is the number of runs in the Burrows-Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in [Formula: see text] space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.


Asunto(s)
Algoritmos , Alineación de Secuencia/estadística & datos numéricos , Programas Informáticos , Biología Computacional , Bases de Datos Genéticas/estadística & datos numéricos , Genoma Humano , Genómica/estadística & datos numéricos , Humanos , Análisis de Secuencia de ADN/estadística & datos numéricos
10.
J Comput Biol ; 29(2): 140-154, 2022 02.
Artículo en Inglés | MEDLINE | ID: mdl-35049334

RESUMEN

k-mer counts are important features used by many bioinformatics pipelines. Existing k-mer counting methods focus on optimizing either time or memory usage, producing in output very large count tables explicitly representing k-mers together with their counts. Storing k-mers is not needed if the set of k-mers is known, making it possible to only keep counters and their association to k-mers. Solutions avoiding explicit representation of k-mers include Minimal Perfect Hash Functions (MPHFs) and Count-Min sketches. We introduce Set-Min sketch-a sketching technique for representing associative maps inspired from Count-Min-and apply it to the problem of representing k-mer count tables. Set-Min is provably more accurate than both Count-Min and Max-Min-an improved variant of Count-Min for static datasets that we define here. We show that Set-Min sketch provides a very low error rate, in terms of both the probability and the size of errors, at the expense of a very moderate memory increase. On the other hand, Set-Min sketches are shown to take up to an order of magnitude less space than MPHF-based solutions, for fully assembled genomes and large k. Space-efficiency of Set-Min in this case takes advantage of the power-law distribution of k-mer counts in genomic datasets.


Asunto(s)
Biología Computacional/métodos , Genómica/estadística & datos numéricos , Programas Informáticos , Algoritmos , Animales , Gráficos por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Genoma Humano , Humanos , Modelos Estadísticos , Anotación de Secuencia Molecular/estadística & datos numéricos
11.
J Comput Biol ; 29(1): 3-18, 2022 01.
Artículo en Inglés | MEDLINE | ID: mdl-35050714

RESUMEN

Recent advances in sequencing technologies have allowed us to capture various aspects of the genome at single-cell resolution. However, with the exception of a few of co-assaying technologies, it is not possible to simultaneously apply different sequencing assays on the same single cell. In this scenario, computational integration of multi-omic measurements is crucial to enable joint analyses. This integration task is particularly challenging due to the lack of sample-wise or feature-wise correspondences. We present single-cell alignment with optimal transport (SCOT), an unsupervised algorithm that uses the Gromov-Wasserstein optimal transport to align single-cell multi-omics data sets. SCOT performs on par with the current state-of-the-art unsupervised alignment methods, is faster, and requires tuning of fewer hyperparameters. More importantly, SCOT uses a self-tuning heuristic to guide hyperparameter selection based on the Gromov-Wasserstein distance. Thus, in the fully unsupervised setting, SCOT aligns single-cell data sets better than the existing methods without requiring any orthogonal correspondence information.


Asunto(s)
Algoritmos , Genómica/estadística & datos numéricos , Alineación de Secuencia/estadística & datos numéricos , Análisis de la Célula Individual/estadística & datos numéricos , Biología Computacional , Simulación por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Humanos , Modelos Estadísticos , Aprendizaje Automático no Supervisado
13.
PLoS Comput Biol ; 17(11): e1009161, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34762640

RESUMEN

Network propagation refers to a class of algorithms that integrate information from input data across connected nodes in a given network. These algorithms have wide applications in systems biology, protein function prediction, inferring condition-specifically altered sub-networks, and prioritizing disease genes. Despite the popularity of network propagation, there is a lack of comparative analyses of different algorithms on real data and little guidance on how to select and parameterize the various algorithms. Here, we address this problem by analyzing different combinations of network normalization and propagation methods and by demonstrating schemes for the identification of optimal parameter settings on real proteome and transcriptome data. Our work highlights the risk of a 'topology bias' caused by the incorrect use of network normalization approaches. Capitalizing on the fact that network propagation is a regularization approach, we show that minimizing the bias-variance tradeoff can be utilized for selecting optimal parameters. The application to real multi-omics data demonstrated that optimal parameters could also be obtained by either maximizing the agreement between different omics layers (e.g. proteome and transcriptome) or by maximizing the consistency between biological replicates. Furthermore, we exemplified the utility and robustness of network propagation on multi-omics datasets for identifying ageing-associated genes in brain and liver tissues of rats and for elucidating molecular mechanisms underlying prostate cancer progression. Overall, this work compares different network propagation approaches and it presents strategies for how to use network propagation algorithms to optimally address a specific research question at hand.


Asunto(s)
Algoritmos , Biología Computacional/métodos , Envejecimiento/genética , Envejecimiento/metabolismo , Animales , Sesgo , Encéfalo/metabolismo , Biología Computacional/estadística & datos numéricos , Interpretación Estadística de Datos , Progresión de la Enfermedad , Perfilación de la Expresión Génica/estadística & datos numéricos , Redes Reguladoras de Genes , Genómica/estadística & datos numéricos , Humanos , Hígado/metabolismo , Masculino , Neoplasias de la Próstata/etiología , Neoplasias de la Próstata/genética , Neoplasias de la Próstata/metabolismo , Mapas de Interacción de Proteínas , Proteómica/estadística & datos numéricos , ARN Mensajero/genética , ARN Mensajero/metabolismo , Ratas , Biología de Sistemas
14.
PLoS Comput Biol ; 17(11): e1009449, 2021 11.
Artículo en Inglés | MEDLINE | ID: mdl-34780468

RESUMEN

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.


Asunto(s)
Algoritmos , Genoma , Genómica/estadística & datos numéricos , Secuencias Repetitivas de Ácidos Nucleicos , Programas Informáticos , Animales , Biología Computacional , Simulación por Computador , Bases de Datos Genéticas/estadística & datos numéricos , Humanos , Invertebrados/clasificación , Invertebrados/genética , Análisis de los Mínimos Cuadrados , Modelos Lineales , Mamíferos/clasificación , Mamíferos/genética , Modelos Genéticos , Filogenia , Plantas/clasificación , Plantas/genética , Vertebrados/clasificación , Vertebrados/genética
15.
Clin Epigenetics ; 13(1): 179, 2021 09 25.
Artículo en Inglés | MEDLINE | ID: mdl-34563241

RESUMEN

BACKGROUND: Nasal intestinal-type adenocarcinomas (ITAC) are strongly related to chronic wood dust exposure: The intestinal phenotype relies on CDX2 overexpression but underlying molecular mechanisms remain unknown. Our objectives were to investigate transcriptomic and methylation differences between healthy non-exposed and tumor olfactory cleft mucosae and to compare transcriptomic profiles between non-exposed, wood dust-exposed and ITAC mucosa cells. METHODS: We conducted a prospective monocentric study (NCT0281823) including 16 woodworkers with ITAC, 16 healthy exposed woodworkers and 13 healthy, non-exposed, controls. We compared tumor samples with healthy non-exposed samples, both in transcriptome and in methylome analyses. We also investigated wood dust-induced transcriptome modifications of exposed (without tumor) male woodworkers' samples and of contralateral sides of woodworkers with tumors. We conducted in parallel transcriptome and methylome analysis, and then, the transcriptome analysis was focused on the genes highlighted in methylome analysis. We replicated our results on dataset GSE17433. RESULTS: Several clusters of genes enabled the distinction between healthy and ITAC samples. Transcriptomic and IHC analysis confirmed a constant overexpression of CDX2 in ITAC samples, without any specific DNA methylation profile regarding the CDX2 locus. ITAC woodworkers also exhibited a specific transcriptomic profile in their contralateral (non-tumor) olfactory cleft, different from that of other exposed woodworkers, suggesting that they had a different exposure or a different susceptibility. Two top-loci (CACNA1C/CACNA1C-AS1 and SLC26A10) were identified with a hemimethylated profile, but only CACNA1C appeared to be overexpressed both in transcriptomic analysis and in immunohistochemistry. CONCLUSIONS: Several clusters of genes enable the distinction between healthy mucosa and ITAC samples even in contralateral nasal fossa thus paving the way for a simple diagnostic tool for ITAC in male woodworkers. CACNA1C might be considered as a master gene of ITAC and should be further investigated. TRIAL REGISTRATION: NIH ClinicalTrials, NCT0281823, registered May 23d 2016, https://www.clinicaltrials.gov/NCT0281823 .


Asunto(s)
Canales de Calcio Tipo L/metabolismo , Genómica/métodos , Neoplasias Intestinales/genética , Neoplasias Nasales/genética , Adenocarcinoma/epidemiología , Adenocarcinoma/genética , Anciano , Canales de Calcio Tipo L/genética , Metilación de ADN/efectos de los fármacos , Femenino , Genómica/instrumentación , Genómica/estadística & datos numéricos , Humanos , Neoplasias Intestinales/epidemiología , Masculino , Persona de Mediana Edad , Neoplasias Nasales/epidemiología , Exposición Profesional/análisis , Madera
16.
PLoS Comput Biol ; 17(8): e1009224, 2021 08.
Artículo en Inglés | MEDLINE | ID: mdl-34383739

RESUMEN

Computational integrative analysis has become a significant approach in the data-driven exploration of biological problems. Many integration methods for cancer subtyping have been proposed, but evaluating these methods has become a complicated problem due to the lack of gold standards. Moreover, questions of practical importance remain to be addressed regarding the impact of selecting appropriate data types and combinations on the performance of integrative studies. Here, we constructed three classes of benchmarking datasets of nine cancers in TCGA by considering all the eleven combinations of four multi-omics data types. Using these datasets, we conducted a comprehensive evaluation of ten representative integration methods for cancer subtyping in terms of accuracy measured by combining both clustering accuracy and clinical significance, robustness, and computational efficiency. We subsequently investigated the influence of different omics data on cancer subtyping and the effectiveness of their combinations. Refuting the widely held intuition that incorporating more types of omics data always produces better results, our analyses showed that there are situations where integrating more omics data negatively impacts the performance of integration methods. Our analyses also suggested several effective combinations for most cancers under our studies, which may be of particular interest to researchers in omics data analysis.


Asunto(s)
Biología Computacional/métodos , Neoplasias/clasificación , Neoplasias/genética , Algoritmos , Biomarcadores de Tumor/genética , Interpretación Estadística de Datos , Bases de Datos Genéticas/estadística & datos numéricos , Aprendizaje Profundo , Femenino , Genómica/estadística & datos numéricos , Humanos , Masculino , Aprendizaje Automático no Supervisado
17.
PLoS Comput Biol ; 17(8): e1009254, 2021 08.
Artículo en Inglés | MEDLINE | ID: mdl-34343164

RESUMEN

Driven by the necessity to survive environmental pathogens, the human immune system has evolved exceptional diversity and plasticity, to which several factors contribute including inheritable structural polymorphism of the underlying genes. Characterizing this variation is challenging due to the complexity of these loci, which contain extensive regions of paralogy, segmental duplication and high copy-number repeats, but recent progress in long-read sequencing and optical mapping techniques suggests this problem may now be tractable. Here we assess this by using long-read sequencing platforms from PacBio and Oxford Nanopore, supplemented with short-read sequencing and Bionano optical mapping, to sequence DNA extracted from CD14+ monocytes and peripheral blood mononuclear cells from a single European individual identified as HV31. We use this data to build a de novo assembly of eight genomic regions encoding four key components of the immune system, namely the human leukocyte antigen, immunoglobulins, T cell receptors, and killer-cell immunoglobulin-like receptors. Validation of our assembly using k-mer based and alignment approaches suggests that it has high accuracy, with estimated base-level error rates below 1 in 10 kb, although we identify a small number of remaining structural errors. We use the assembly to identify heterozygous and homozygous structural variation in comparison to GRCh38. Despite analyzing only a single individual, we find multiple large structural variants affecting core genes at all three immunoglobulin regions and at two of the three T cell receptor regions. Several of these variants are not accurately callable using current algorithms, implying that further methodological improvements are needed. Our results demonstrate that assessing haplotype variation in these regions is possible given sufficiently accurate long-read and associated data. Continued reductions in the cost of these technologies will enable application of these methods to larger samples and provide a broader catalogue of germline structural variation at these loci, an important step toward making these regions accessible to large-scale genetic association studies.


Asunto(s)
Variación Genética , Genoma Humano/inmunología , Sistema Inmunológico , Algoritmos , Biología Computacional , Variaciones en el Número de Copia de ADN , Genómica/métodos , Genómica/estadística & datos numéricos , Antígenos HLA/genética , Haplotipos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos , Fenómenos Inmunogenéticos , Inmunoglobulinas/genética , Receptores de Antígenos de Linfocitos T/genética , Receptores KIR/genética , Análisis de Secuencia de ADN/estadística & datos numéricos
18.
Genome Biol ; 22(1): 208, 2021 07 13.
Artículo en Inglés | MEDLINE | ID: mdl-34256818

RESUMEN

One challenge facing omics association studies is the loss of statistical power when adjusting for confounders and multiple testing. The traditional statistical procedure involves fitting a confounder-adjusted regression model for each omics feature, followed by multiple testing correction. Here we show that the traditional procedure is not optimal and present a new approach, 2dFDR, a two-dimensional false discovery rate control procedure, for powerful confounder adjustment in multiple testing. Through extensive evaluation, we demonstrate that 2dFDR is more powerful than the traditional procedure, and in the presence of strong confounding and weak signals, the power improvement could be more than 100%.


Asunto(s)
Algoritmos , Estudio de Asociación del Genoma Completo , Genómica/estadística & datos numéricos , Atlas como Asunto , Carcinoma Hepatocelular/genética , Carcinoma Hepatocelular/metabolismo , Metilación de ADN , Conjuntos de Datos como Asunto , Reacciones Falso Positivas , Microbioma Gastrointestinal/genética , Genómica/métodos , Hepatitis B/genética , Hepatitis B/metabolismo , Virus de la Hepatitis B/genética , Virus de la Hepatitis B/patogenicidad , Humanos , Modelos Lineales , Neoplasias Hepáticas/genética , Neoplasias Hepáticas/metabolismo
19.
PLoS Comput Biol ; 17(7): e1009229, 2021 07.
Artículo en Inglés | MEDLINE | ID: mdl-34280186

RESUMEN

Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 - 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.


Asunto(s)
Algoritmos , Compresión de Datos/métodos , Genómica/métodos , Secuenciación de Nucleótidos de Alto Rendimiento/métodos , Animales , Biología Computacional , Gráficos por Computador , Compresión de Datos/estadística & datos numéricos , Bases de Datos Genéticas/estadística & datos numéricos , Genómica/estadística & datos numéricos , Secuenciación de Nucleótidos de Alto Rendimiento/estadística & datos numéricos , Humanos
20.
Mol Genet Genomics ; 296(5): 1103-1119, 2021 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-34170407

RESUMEN

In genome-wide quantitative trait locus (QTL) mapping studies, multiple quantitative traits are often measured along with the marker genotypes. Multi-trait QTL (MtQTL) analysis, which includes multiple quantitative traits together in a single model, is an efficient technique to increase the power of QTL identification. The two most widely used classical approaches for MtQTL mapping are Gaussian Mixture Model-based MtQTL (GMM-MtQTL) and Linear Regression Model-based MtQTL (LRM-MtQTL) analyses. There are two types of LRM-MtQTL approach known as least squares-based LRM-MtQTL (LS-LRM-MtQTL) and maximum likelihood-based LRM-MtQTL (ML-LRM-MtQTL). These three classical approaches are equivalent alternatives for QTL detection, but ML-LRM-MtQTL is computationally faster than GMM-MtQTL and LS-LRM-MtQTL. However, one major limitation common to all the above classical approaches is that they are very sensitive to outliers, which leads to misleading results. Therefore, in this study, we developed an LRM-based robust MtQTL approach, called LRM-RobMtQTL, for the backcross population based on the robust estimation of regression parameters by maximizing the ß-likelihood function induced from the ß-divergence with multivariate normal distribution. When ß = 0, the proposed LRM-RobMtQTL method reduces to the classical ML-LRM-MtQTL approach. Simulation studies showed that both ML-LRM-MtQTL and LRM-RobMtQTL methods identified the same QTL positions in the absence of outliers. However, in the presence of outliers, only the proposed method was able to identify all the true QTL positions. Real data analysis results revealed that in the presence of outliers only our LRM-RobMtQTL approach can identify all the QTL positions as those identified in the absence of outliers by both methods. We conclude that our proposed LRM-RobMtQTL analysis approach outperforms the classical MtQTL analysis methods.


Asunto(s)
Genómica/métodos , Sitios de Carácter Cuantitativo , Animales , Mapeo Cromosómico , Simulación por Computador , Femenino , Genética de Población/métodos , Genómica/estadística & datos numéricos , Hordeum/genética , Funciones de Verosimilitud , Ratones Endogámicos
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...