Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 30
Filtrar
1.
PLoS Comput Biol ; 20(7): e1012265, 2024 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-39058741

RESUMO

Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) is a valuable experimental tool to study the immune state in health and following immune challenges such as infectious diseases, (auto)immune diseases, and cancer. Several tools have been developed to reconstruct B cell and T cell receptor sequences from AIRR-seq data and infer B and T cell clonal relationships. However, currently available tools offer limited parallelization across samples, scalability or portability to high-performance computing infrastructures. To address this need, we developed nf-core/airrflow, an end-to-end bulk and single-cell AIRR-seq processing workflow which integrates the Immcantation Framework following BCR and TCR sequencing data analysis best practices. The Immcantation Framework is a comprehensive toolset, which allows the processing of bulk and single-cell AIRR-seq data from raw read processing to clonal inference. nf-core/airrflow is written in Nextflow and is part of the nf-core project, which collects community contributed and curated Nextflow workflows for a wide variety of analysis tasks. We assessed the performance of nf-core/airrflow on simulated sequencing data with sequencing errors and show example results with real datasets. To demonstrate the applicability of nf-core/airrflow to the high-throughput processing of large AIRR-seq datasets, we validated and extended previously reported findings of convergent antibody responses to SARS-CoV-2 by analyzing 97 COVID-19 infected individuals and 99 healthy controls, including a mixture of bulk and single-cell sequencing datasets. Using this dataset, we extended the convergence findings to 20 additional subjects, highlighting the applicability of nf-core/airrflow to validate findings in small in-house cohorts with reanalysis of large publicly available AIRR datasets.


Assuntos
COVID-19 , Biologia Computacional , Receptores de Antígenos de Linfócitos T , SARS-CoV-2 , Fluxo de Trabalho , Humanos , COVID-19/imunologia , COVID-19/virologia , COVID-19/genética , SARS-CoV-2/imunologia , SARS-CoV-2/genética , Receptores de Antígenos de Linfócitos T/genética , Receptores de Antígenos de Linfócitos T/imunologia , Biologia Computacional/métodos , Receptores de Antígenos de Linfócitos B/genética , Receptores de Antígenos de Linfócitos B/imunologia , Software , Análise de Célula Única/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Imunidade Adaptativa/genética , Linfócitos B/imunologia , Linfócitos T/imunologia
2.
Appl Environ Microbiol ; 89(6): e0007923, 2023 06 28.
Artigo em Inglês | MEDLINE | ID: mdl-37191555

RESUMO

Bacteriophages have received recent attention for their therapeutic potential to treat antibiotic-resistant bacterial infections. One particular idea in phage therapy is to use phages that not only directly kill their bacterial hosts but also rely on particular bacterial receptors, such as proteins involved in virulence or antibiotic resistance. In such cases, the evolution of phage resistance would correspond to the loss of those receptors, an approach termed evolutionary steering. We previously found that during experimental evolution, phage U136B can exert selection pressure on Escherichia coli to lose or modify its receptor, the antibiotic efflux protein TolC, often resulting in reduced antibiotic resistance. However, for TolC-reliant phages like U136B to be used therapeutically, we also need to study their own evolutionary potential. Understanding phage evolution is critical for the development of improved phage therapies as well as the tracking of phage populations during infection. Here, we characterized phage U136B evolution in 10 replicate experimental populations. We quantified phage dynamics that resulted in five surviving phage populations at the end of the 10-day experiment. We found that phages from all five surviving populations had evolved higher rates of adsorption on either ancestral or coevolved E. coli hosts. Using whole-genome and whole-population sequencing, we established that these higher rates of adsorption were associated with parallel molecular evolution in phage tail protein genes. These findings will be useful in future studies to predict how key phage genotypes and phenotypes influence phage efficacy and survival despite the evolution of host resistance. IMPORTANCE Antibiotic resistance is a persistent problem in health care and a factor that may help maintain bacterial diversity in natural environments. Bacteriophages ("phages") are viruses that specifically infect bacteria. We previously discovered and characterized a phage called U136B, which infects bacteria through TolC. TolC is an antibiotic resistance protein that helps bacteria pump antibiotics out of the cell. Over short timescales, phage U136B can be used to evolutionarily "steer" bacterial populations to lose or modify the TolC protein, sometimes reducing antibiotic resistance. In this study, we investigate whether U136B itself evolves to better infect bacterial cells. We discovered that the phage can readily evolve specific mutations that increase its infection rate. This work will be useful for understanding how phages can be used to treat bacterial infections.


Assuntos
Bacteriófagos , Bacteriófagos/genética , Escherichia coli/genética , Adsorção , Mutação , Antibacterianos/farmacologia
3.
Am J Hum Genet ; 100(4): 581-591, 2017 Apr 06.
Artigo em Inglês | MEDLINE | ID: mdl-28285767

RESUMO

Efforts to decipher the causal relationships between differences in gene regulation and corresponding differences in phenotype have been stymied by several basic technical challenges. Although detecting local, cis-eQTLs is now routine, trans-eQTLs, which are distant from the genes of origin, are far more difficult to find because millions of SNPs must currently be compared to thousands of transcripts. Here, we demonstrate an alternative approach: we looked for SNPs associated with the expression of many genes simultaneously and found that hundreds of trans-eQTLs each affect hundreds of transcripts in lymphoblastoid cell lines across three African populations. These trans-eQTLs target the same genes across the three populations and show the same direction of effect. We discovered that target transcripts of a high-confidence set of trans-eQTLs encode proteins that interact more frequently than expected by chance, are bound by the same transcription factors, and are enriched for pathway annotations indicative of roles in basic cell homeostasis. We thus demonstrate that our approach can uncover trans-acting transcriptional control circuits that affect co-regulated groups of genes: a key to understanding how cellular pathways and processes are orchestrated.


Assuntos
Regulação da Expressão Gênica , Locos de Características Quantitativas , Transcrição Gênica , Algoritmos , População Negra/genética , Linhagem Celular , Perfilação da Expressão Gênica , Projeto HapMap , Humanos , Polimorfismo de Nucleotídeo Único , Mapas de Interação de Proteínas
4.
Nature ; 498(7453): 220-3, 2013 Jun 13.
Artigo em Inglês | MEDLINE | ID: mdl-23665959

RESUMO

Congenital heart disease (CHD) is the most frequent birth defect, affecting 0.8% of live births. Many cases occur sporadically and impair reproductive fitness, suggesting a role for de novo mutations. Here we compare the incidence of de novo mutations in 362 severe CHD cases and 264 controls by analysing exome sequencing of parent-offspring trios. CHD cases show a significant excess of protein-altering de novo mutations in genes expressed in the developing heart, with an odds ratio of 7.5 for damaging (premature termination, frameshift, splice site) mutations. Similar odds ratios are seen across the main classes of severe CHD. We find a marked excess of de novo mutations in genes involved in the production, removal or reading of histone 3 lysine 4 (H3K4) methylation, or ubiquitination of H2BK120, which is required for H3K4 methylation. There are also two de novo mutations in SMAD2, which regulates H3K27 methylation in the embryonic left-right organizer. The combination of both activating (H3K4 methylation) and inactivating (H3K27 methylation) chromatin marks characterizes 'poised' promoters and enhancers, which regulate expression of key developmental genes. These findings implicate de novo point mutations in several hundreds of genes that collectively contribute to approximately 10% of severe CHD.


Assuntos
Cardiopatias/congênito , Cardiopatias/genética , Histonas/metabolismo , Adulto , Estudos de Casos e Controles , Criança , Cromatina/química , Cromatina/metabolismo , Análise Mutacional de DNA , Elementos Facilitadores Genéticos/genética , Exoma/genética , Feminino , Genes Controladores do Desenvolvimento/genética , Cardiopatias/metabolismo , Histonas/química , Humanos , Lisina/química , Lisina/metabolismo , Masculino , Metilação , Mutação , Razão de Chances , Regiões Promotoras Genéticas/genética
5.
Nature ; 485(7397): 237-41, 2012 Apr 04.
Artigo em Inglês | MEDLINE | ID: mdl-22495306

RESUMO

Multiple studies have confirmed the contribution of rare de novo copy number variations to the risk for autism spectrum disorders. But whereas de novo single nucleotide variants have been identified in affected individuals, their contribution to risk has yet to be clarified. Specifically, the frequency and distribution of these mutations have not been well characterized in matched unaffected controls, and such data are vital to the interpretation of de novo coding mutations observed in probands. Here we show, using whole-exome sequencing of 928 individuals, including 200 phenotypically discordant sibling pairs, that highly disruptive (nonsense and splice-site) de novo mutations in brain-expressed genes are associated with autism spectrum disorders and carry large effects. On the basis of mutation rates in unaffected individuals, we demonstrate that multiple independent de novo single nucleotide variants in the same gene among unrelated probands reliably identifies risk alleles, providing a clear path forward for gene discovery. Among a total of 279 identified de novo coding mutations, there is a single instance in probands, and none in siblings, in which two independent nonsense variants disrupt the same gene, SCN2A (sodium channel, voltage-gated, type II, α subunit), a result that is highly unlikely by chance.


Assuntos
Transtorno Autístico/genética , Exoma/genética , Éxons/genética , Predisposição Genética para Doença/genética , Mutação/genética , Proteínas do Tecido Nervoso/genética , Canais de Sódio/genética , Alelos , Códon sem Sentido/genética , Heterogeneidade Genética , Humanos , Canal de Sódio Disparado por Voltagem NAV1.2 , Sítios de Splice de RNA/genética , Irmãos
6.
Nature ; 482(7383): 98-102, 2012 Jan 22.
Artigo em Inglês | MEDLINE | ID: mdl-22266938

RESUMO

Hypertension affects one billion people and is a principal reversible risk factor for cardiovascular disease. Pseudohypoaldosteronism type II (PHAII), a rare Mendelian syndrome featuring hypertension, hyperkalaemia and metabolic acidosis, has revealed previously unrecognized physiology orchestrating the balance between renal salt reabsorption and K(+) and H(+) excretion. Here we used exome sequencing to identify mutations in kelch-like 3 (KLHL3) or cullin 3 (CUL3) in PHAII patients from 41 unrelated families. KLHL3 mutations are either recessive or dominant, whereas CUL3 mutations are dominant and predominantly de novo. CUL3 and BTB-domain-containing kelch proteins such as KLHL3 are components of cullin-RING E3 ligase complexes that ubiquitinate substrates bound to kelch propeller domains. Dominant KLHL3 mutations are clustered in short segments within the kelch propeller and BTB domains implicated in substrate and cullin binding, respectively. Diverse CUL3 mutations all result in skipping of exon 9, producing an in-frame deletion. Because dominant KLHL3 and CUL3 mutations both phenocopy recessive loss-of-function KLHL3 mutations, they may abrogate ubiquitination of KLHL3 substrates. Disease features are reversed by thiazide diuretics, which inhibit the Na-Cl cotransporter in the distal nephron of the kidney; KLHL3 and CUL3 are expressed in this location, suggesting a mechanistic link between KLHL3 and CUL3 mutations, increased Na-Cl reabsorption, and disease pathogenesis. These findings demonstrate the utility of exome sequencing in disease gene identification despite the combined complexities of locus heterogeneity, mixed models of transmission and frequent de novo mutation, and establish a fundamental role for KLHL3 and CUL3 in blood pressure, K(+) and pH homeostasis.


Assuntos
Proteínas de Transporte/genética , Proteínas Culina/genética , Hipertensão/genética , Mutação/genética , Pseudo-Hipoaldosteronismo/genética , Desequilíbrio Hidroeletrolítico/genética , Proteínas Adaptadoras de Transdução de Sinal , Sequência de Aminoácidos , Animais , Sequência de Bases , Pressão Sanguínea/genética , Proteínas de Transporte/química , Estudos de Coortes , Proteínas Culina/química , Eletrólitos , Éxons/genética , Feminino , Perfilação da Expressão Gênica , Genes Dominantes/genética , Genes Recessivos/genética , Genótipo , Homeostase/genética , Humanos , Concentração de Íons de Hidrogênio , Hipertensão/complicações , Hipertensão/fisiopatologia , Masculino , Camundongos , Proteínas dos Microfilamentos , Modelos Moleculares , Dados de Sequência Molecular , Fenótipo , Potássio/metabolismo , Pseudo-Hipoaldosteronismo/complicações , Pseudo-Hipoaldosteronismo/fisiopatologia , Cloreto de Sódio/metabolismo , Desequilíbrio Hidroeletrolítico/complicações , Desequilíbrio Hidroeletrolítico/fisiopatologia
7.
Proc Natl Acad Sci U S A ; 110(38): E3640-9, 2013 Sep 17.
Artigo em Inglês | MEDLINE | ID: mdl-24003131

RESUMO

Despite considerable efforts to sequence hypermutated cancers such as melanoma, distinguishing cancer-driving genes from thousands of recurrently mutated genes remains a significant challenge. To circumvent the problematic background mutation rates and identify new melanoma driver genes, we carried out a low-copy piggyBac transposon mutagenesis screen in mice. We induced eleven melanomas with mutation burdens that were 100-fold lower relative to human melanomas. Thirty-eight implicated genes, including two known drivers of human melanoma, were classified into three groups based on high, low, or background-level mutation frequencies in human melanomas, and we further explored the functional significance of genes in each group. For two genes overlooked by prevailing discovery methods, we found that loss of membrane associated guanylate kinase, WW and PDZ domain containing 2 and protein tyrosine phosphatase, receptor type, O cooperated with the v-raf murine sarcoma viral oncogene homolog B (BRAF) recurrent V600E mutation to promote cellular transformation. Moreover, for infrequently mutated genes often disregarded by current methods, we discovered recurrent mitogen-activated protein kinase kinase kinase 1 (Map3k1)-activating insertions in our screen, mirroring recurrent MAP3K1 up-regulation in human melanomas. Aberrant expression of Map3k1 enabled growth factor-autonomous proliferation and drove BRAF-independent ERK signaling, thus shedding light on alternative means of activating this prominent signaling pathway in melanoma. In summary, our study contributes several previously undescribed genes involved in melanoma and establishes an important proof-of-principle for the utility of the low-copy transposon mutagenesis approach for identifying cancer-driving genes, especially those masked by hypermutation.


Assuntos
Elementos de DNA Transponíveis/genética , Regulação Neoplásica da Expressão Gênica/fisiologia , MAP Quinase Quinase Quinase 1/metabolismo , Melanoma/genética , Mutagênese Insercional/genética , Transdução de Sinais/fisiologia , Animais , Western Blotting , Primers do DNA/genética , Regulação Neoplásica da Expressão Gênica/genética , Testes Genéticos , Células HEK293 , Humanos , Imuno-Histoquímica , Camundongos , Camundongos Transgênicos , Reação em Cadeia da Polimerase Via Transcriptase Reversa , Transdução de Sinais/genética , Especificidade da Espécie
8.
BMC Bioinformatics ; 15: 231, 2014 Jul 03.
Artigo em Inglês | MEDLINE | ID: mdl-24990767

RESUMO

BACKGROUND: Current research suggests that a small set of "driver" mutations are responsible for tumorigenesis while a larger body of "passenger" mutations occur in the tumor but do not progress the disease. Due to recent pharmacological successes in treating cancers caused by driver mutations, a variety of methodologies that attempt to identify such mutations have been developed. Based on the hypothesis that driver mutations tend to cluster in key regions of the protein, the development of cluster identification algorithms has become critical. RESULTS: We have developed a novel methodology, SpacePAC (Spatial Protein Amino acid Clustering), that identifies mutational clustering by considering the protein tertiary structure directly in 3D space. By combining the mutational data in the Catalogue of Somatic Mutations in Cancer (COSMIC) and the spatial information in the Protein Data Bank (PDB), SpacePAC is able to identify novel mutation clusters in many proteins such as FGFR3 and CHRM2. In addition, SpacePAC is better able to localize the most significant mutational hotspots as demonstrated in the cases of BRAF and ALK. The R package is available on Bioconductor at: http://www.bioconductor.org/packages/release/bioc/html/SpacePAC.html. CONCLUSION: SpacePAC adds a valuable tool to the identification of mutational clusters while considering protein tertiary structure.


Assuntos
Biologia Computacional/métodos , Mutação , Proteínas/química , Proteínas/genética , Algoritmos , Análise por Conglomerados , Bases de Dados de Proteínas , Genes Neoplásicos/genética , Humanos , Neoplasias/genética , Estrutura Terciária de Proteína
9.
bioRxiv ; 2024 Jan 28.
Artigo em Inglês | MEDLINE | ID: mdl-38293151

RESUMO

Adaptive Immune Receptor Repertoire sequencing (AIRR-seq) is a valuable experimental tool to study the immune state in health and following immune challenges such as infectious diseases, (auto)immune diseases, and cancer. Several tools have been developed to reconstruct B cell and T cell receptor sequences from AIRR-seq data and infer B and T cell clonal relationships. However, currently available tools offer limited parallelization across samples, scalability or portability to high-performance computing infrastructures. To address this need, we developed nf-core/airrflow, an end-to-end bulk and single-cell AIRR-seq processing workflow which integrates the Immcantation Framework following BCR and TCR sequencing data analysis best practices. The Immcantation Framework is a comprehensive toolset, which allows the processing of bulk and single-cell AIRR-seq data from raw read processing to clonal inference. nf-core/airrflow is written in Nextflow and is part of the nf-core project, which collects community contributed and curated Nextflow workflows for a wide variety of analysis tasks. We assessed the performance of nf-core/airrflow on simulated sequencing data with sequencing errors and show example results with real datasets. To demonstrate the applicability of nf-core/airrflow to the high-throughput processing of large AIRR-seq datasets, we validated and extended previously reported findings of convergent antibody responses to SARS-CoV-2 by analyzing 97 COVID-19 infected individuals and 99 healthy controls, including a mixture of bulk and single-cell sequencing datasets. Using this dataset, we extended the convergence findings to 20 additional subjects, highlighting the applicability of nf-core/airrflow to validate findings in small in-house cohorts with reanalysis of large publicly available AIRR datasets. nf-core/airrflow is available free of charge, under the MIT license on GitHub (https://github.com/nf-core/airrflow). Detailed documentation and example results are available on the nf-core website at (https://nf-co.re/airrflow).

10.
Genome Res ; 20(7): 960-71, 2010 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-20430783

RESUMO

Recent metagenomics studies have begun to sample the genomic diversity among disparate habitats and relate this variation to features of the environment. Membrane proteins are an intuitive, but thus far overlooked, choice in this type of analysis as they directly interact with the environment, receiving signals from the outside and transporting nutrients. Using global ocean sampling (GOS) data, we found nearly approximately 900,000 membrane proteins in large-scale metagenomic sequence, approximately a fifth of which are completely novel, suggesting a large space of hitherto unexplored protein diversity. Using GPS coordinates for the GOS sites, we extracted additional environmental features via interpolation from the World Ocean Database, the National Center for Ecological Analysis and Synthesis, and empirical models of dust occurrence. This allowed us to study membrane protein variation in terms of natural features, such as phosphate and nitrate concentrations, and also in terms of human impacts, such as pollution and climate change. We show that there is widespread variation in membrane protein content across marine sites, which is correlated with changes in both oceanographic variables and human factors. Furthermore, using these data, we developed an approach, protein families and environment features network (PEN), to quantify and visualize the correlations. PEN identifies small groups of covarying environmental features and membrane protein families, which we call "bimodules." Using this approach, we find that the affinity of phosphate transporters is related to the concentration of phosphate and that the occurrence of iron transporters is connected to the amount of shipping, pollution, and iron-containing dust.


Assuntos
Meio Ambiente , Redes Reguladoras de Genes , Proteínas de Membrana/genética , Metagenômica , Proteínas/genética , Adaptação Biológica/genética , Análise por Conglomerados , Redes Reguladoras de Genes/fisiologia , Geografia , Humanos , Biologia Marinha/métodos , Proteínas de Membrana/análise , Família Multigênica/fisiologia , Oceanos e Mares , Filogenia , Análise de Componente Principal , Análise de Sequência de DNA
11.
Bioinformatics ; 27(8): 1152-4, 2011 Apr 15.
Artigo em Inglês | MEDLINE | ID: mdl-21349863

RESUMO

UNLABELLED: We have implemented aggregation and correlation toolbox (ACT), an efficient, multifaceted toolbox for analyzing continuous signal and discrete region tracks from high-throughput genomic experiments, such as RNA-seq or ChIP-chip signal profiles from the ENCODE and modENCODE projects, or lists of single nucleotide polymorphisms from the 1000 genomes project. It is able to generate aggregate profiles of a given track around a set of specified anchor points, such as transcription start sites. It is also able to correlate related tracks and analyze them for saturation--i.e. how much of a certain feature is covered with each new succeeding experiment. The ACT site contains downloadable code in a variety of formats, interactive web servers (for use on small quantities of data), example datasets, documentation and a gallery of outputs. Here, we explain the components of the toolbox in more detail and apply them in various contexts. AVAILABILITY: ACT is available at http://act.gersteinlab.org CONTACT: pi@gersteinlab.org.


Assuntos
Genômica/métodos , Software , Polimorfismo de Nucleotídeo Único , Sítio de Iniciação de Transcrição
12.
Mol Syst Biol ; 7: 522, 2011 Aug 02.
Artigo em Inglês | MEDLINE | ID: mdl-21811232

RESUMO

To study allele-specific expression (ASE) and binding (ASB), that is, differences between the maternally and paternally derived alleles, we have developed a computational pipeline (AlleleSeq). Our pipeline initially constructs a diploid personal genome sequence (and corresponding personalized gene annotation) using genomic sequence variants (SNPs, indels, and structural variants), and then identifies allele-specific events with significant differences in the number of mapped reads between maternal and paternal alleles. There are many technical challenges in the construction and alignment of reads to a personal diploid genome sequence that we address, for example, bias of reads mapping to the reference allele. We have applied AlleleSeq to variation data for NA12878 from the 1000 Genomes Project as well as matched, deeply sequenced RNA-Seq and ChIP-Seq data sets generated for this purpose. In addition to observing fairly widespread allele-specific behavior within individual functional genomic data sets (including results consistent with X-chromosome inactivation), we can study the interaction between ASE and ASB. Furthermore, we investigate the coordination between ASE and ASB from multiple transcription factors events using a regulatory network framework. Correlation analyses and network motifs show mostly coordinated ASB and ASE.


Assuntos
Alelos , Proteínas de Ligação a DNA/genética , Redes Reguladoras de Genes , Análise de Sequência de RNA , Linhagem Celular , Mapeamento Cromossômico , Cromossomos Humanos X/genética , Cromossomos Humanos Y/genética , Proteínas de Ligação a DNA/metabolismo , Bases de Dados Genéticas , Regulação da Expressão Gênica , Genoma Humano , Humanos , Anotação de Sequência Molecular , Análise de Sequência com Séries de Oligonucleotídeos , Polimorfismo de Nucleotídeo Único , Fatores de Transcrição/genética , Fatores de Transcrição/metabolismo
13.
Proc Natl Acad Sci U S A ; 106(5): 1374-9, 2009 Feb 03.
Artigo em Inglês | MEDLINE | ID: mdl-19164758

RESUMO

Recently, approaches have been developed to sample the genetic content of heterogeneous environments (metagenomics). However, by what means these sequences link distinct environmental conditions with specific biological processes is not well understood. Thus, a major challenge is how the usage of particular pathways and subnetworks reflects the adaptation of microbial communities across environments and habitats-i.e., how network dynamics relates to environmental features. Previous research has treated environments as discrete, somewhat simplified classes (e.g., terrestrial vs. marine), and searched for obvious metabolic differences among them (i.e., treating the analysis as a typical classification problem). However, environmental differences result from combinations of many factors, which often vary only slightly. Therefore, we introduce an approach that employs correlation and regression to relate multiple, continuously varying factors defining an environment to the extent of particular microbial pathways present in a geographic site. Moreover, rather than looking only at individual correlations (one-to-one), we adapted canonical correlation analysis and related techniques to define an ensemble of weighted pathways that maximally covaries with a combination of environmental variables (many-to-many), which we term a metabolic footprint. Applied to available aquatic datasets, we identified footprints predictive of their environment that can potentially be used as biosensors. For example, we show a strong multivariate correlation between the energy-conversion strategies of a community and multiple environmental gradients (e.g., temperature). Moreover, we identified covariation in amino acid transport and cofactor synthesis, suggesting that limiting amounts of cofactor can (partially) explain increased import of amino acids in nutrient-limited conditions.


Assuntos
Genômica , Microbiologia , Aminoácidos/biossíntese , Técnicas Biossensoriais , Metabolismo dos Lipídeos , Polissacarídeos/metabolismo
14.
Hum Hered ; 72(2): 85-97, 2011.
Artigo em Inglês | MEDLINE | ID: mdl-21934324

RESUMO

BACKGROUND: Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the presence of complex interactions is of great importance. Here, we systematically evaluate the power of three leading algorithms: random forests (RF), Monte Carlo logic regression (MCLR), and multifactor dimensionality reduction (MDR). METHODS: We use the algorithm-specific variable importance measures (VIMs) as statistics and employ permutation-based resampling to generate the null distribution and associated p values. The power of the three is assessed via simulation studies. Additionally, in a data analysis, we evaluate the associations between individual SNPs in pro-inflammatory and immunoregulatory genes and the risk of non-Hodgkin lymphoma. RESULTS: The power of RF is highest in all simulation models, that of MCLR is similar to RF in half, and that of MDR is consistently the lowest. CONCLUSIONS: Our study indicates that the power of RF VIMs is most reliable. However, in addition to tuning parameters, the power of RF is notably influenced by the type of variable (continuous vs. categorical) and the chosen VIM.


Assuntos
Mineração de Dados/métodos , Epistasia Genética , Estudos de Associação Genética , Algoritmos , Simulação por Computador , Loci Gênicos , Genoma Humano , Haplótipos , Humanos , Linfoma não Hodgkin/genética , Método de Monte Carlo , Polimorfismo de Nucleotídeo Único
15.
Appl Environ Microbiol ; 77(23): 8400-8, 2011 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-21948847

RESUMO

Vertical transmission of obligate symbionts generates a predictable evolutionary history of symbionts that reflects that of their hosts. In insects, evolutionary associations between symbionts and their hosts have been investigated primarily among species, leaving population-level processes largely unknown. In this study, we investigated the tsetse (Diptera: Glossinidae) bacterial symbiont, Wigglesworthia glossinidia, to determine whether observed codiversification of symbiont and tsetse host species extends to a single host species (Glossina fuscipes fuscipes) in Uganda. To explore symbiont genetic variation in G. f. fuscipes populations, we screened two variable loci (lon and lepA) from the Wigglesworthia glossinidia bacterium in the host species Glossina fuscipes fuscipes (W. g. fuscipes) and examined phylogeographic and demographic characteristics in multiple host populations. Symbiont genetic variation was apparent within and among populations. We identified two distinct symbiont lineages, in northern and southern Uganda. Incongruence length difference (ILD) tests indicated that the two lineages corresponded exactly to northern and southern G. f. fuscipes mitochondrial DNA (mtDNA) haplogroups (P = 1.0). Analysis of molecular variance (AMOVA) confirmed that most variation was partitioned between the northern and southern lineages defined by host mtDNA (85.44%). However, ILD tests rejected finer-scale congruence within the northern and southern populations (P = 0.009). This incongruence was potentially due to incomplete lineage sorting that resulted in novel combinations of symbiont genetic variants and host background. Identifying these novel combinations may have public health significance, since tsetse is the sole vector of sleeping sickness and Wigglesworthia is known to influence host vector competence. Thus, understanding the adaptive value of these host-symbiont combinations may afford opportunities to develop vector control methods.


Assuntos
Variação Genética , Filogeografia , Simbiose , Moscas Tsé-Tsé/microbiologia , Wigglesworthia/classificação , Wigglesworthia/isolamento & purificação , Animais , DNA Mitocondrial/química , DNA Mitocondrial/genética , Dados de Sequência Molecular , Protease La/genética , Análise de Sequência de DNA , Fatores de Elongação da Transcrição/genética , Moscas Tsé-Tsé/genética , Uganda , Wigglesworthia/genética , Wigglesworthia/fisiologia
16.
Proteins ; 78(2): 309-24, 2010 Feb 01.
Artigo em Inglês | MEDLINE | ID: mdl-19705487

RESUMO

Advances in structure determination have made possible the analysis of large macromolecular complexes (some with nearly 10,000 residues, such as GroEL). The large-scale conformational changes associated with these complexes require new approaches. Historically, a crucial component of motion analysis has been the identification of moving rigid blocks from the comparison of different conformations. However, existing tools do not allow consistent block identification in very large structures. Here, we describe a novel method, RigidFinder, for such identification of rigid blocks from different conformations-across many scales, from large complexes to small loops. RigidFinder defines rigidity in terms of blocks, where inter-residue distances are conserved across conformations. Distance conservation, unlike the averaged values (e.g., RMSD) used by many other methods, allows for sensitive identification of motions. A further distinguishing feature of our method, is that, it is capable of finding blocks made from nonconsecutive fragments of multiple polypeptide chains. In our implementation, we utilize an efficient quasi-dynamic programming search algorithm that allows for real-time application to very large structures. RigidFinder can be used at a dedicated web server (http://rigidfinder.molmovdb.org). The server also provides links to examples at various scales such as loop closure, domain motions, partial refolding, and subunit shifts. Moreover, here we describe the detailed application of RigidFinder to four large structures: Pyruvate Phosphate Dikinase, T7 RNA polymerase, RNA polymerase II, and GroEL. The results of the method are in excellent agreement with the expert-described rigid blocks.


Assuntos
Algoritmos , Proteínas/química , Animais , Chaperonina 60/química , Simulação por Computador , RNA Polimerases Dirigidas por DNA/química , Bases de Dados de Proteínas , Humanos , Modelos Moleculares , Movimento (Física) , Conformação Proteica , Piruvato Ortofosfato Diquinase/química , RNA Polimerase II/química , Proteínas Virais/química
17.
PLoS Comput Biol ; 5(7): e1000432, 2009 Jul.
Artigo em Inglês | MEDLINE | ID: mdl-19593373

RESUMO

The goal of human genome re-sequencing is obtaining an accurate assembly of an individual's genome. Recently, there has been great excitement in the development of many technologies for this (e.g. medium and short read sequencing from companies such as 454 and SOLiD, and high-density oligo-arrays from Affymetrix and NimbelGen), with even more expected to appear. The costs and sensitivities of these technologies differ considerably from each other. As an important goal of personal genomics is to reduce the cost of re-sequencing to an affordable point, it is worthwhile to consider optimally integrating technologies. Here, we build a simulation toolbox that will help us optimally combine different technologies for genome re-sequencing, especially in reconstructing large structural variants (SVs). SV reconstruction is considered the most challenging step in human genome re-sequencing. (It is sometimes even harder than de novo assembly of small genomes because of the duplications and repetitive sequences in the human genome.) To this end, we formulate canonical problems that are representative of issues in reconstruction and are of small enough scale to be computationally tractable and simulatable. Using semi-realistic simulations, we show how we can combine different technologies to optimally solve the assembly at low cost. With mapability maps, our simulations efficiently handle the inhomogeneous repeat-containing structure of the human genome and the computational complexity of practical assembly algorithms. They quantitatively show how combining different read lengths is more cost-effective than using one length, how an optimal mixed sequencing strategy for reconstructing large novel SVs usually also gives accurate detection of SNPs/indels, how paired-end reads can improve reconstruction efficiency, and how adding in arrays is more efficient than just sequencing for disentangling some complex SVs. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost.


Assuntos
Genômica/métodos , Modelos Genéticos , Mapeamento Cromossômico/economia , Mapeamento Cromossômico/métodos , Simulação por Computador , Bases de Dados Genéticas , Genômica/economia , Modelos Estatísticos , Análise de Sequência com Séries de Oligonucleotídeos , Análise de Sequência de DNA/economia , Análise de Sequência de DNA/métodos , Software
18.
BMC Med Genomics ; 13(Suppl 7): 78, 2020 07 21.
Artigo em Inglês | MEDLINE | ID: mdl-32693796

RESUMO

BACKGROUND: Genomic variants are considered sensitive information, revealing potentially private facts about individuals. Therefore, it is important to control access to such data. A key aspect of controlled access is secure storage and efficient query of access logs, for potential misuse. However, there are challenges to securing logs, such as designing against the consequences of "single points of failure". A potential approach to circumvent these challenges is blockchain technology, which is currently popular in cryptocurrency due to its properties of security, immutability, and decentralization. One of the tasks of the iDASH (Integrating Data for Analysis, Anonymization, and Sharing) Secure Genome Analysis Competition in 2018 was to develop time- and space-efficient blockchain-based ledgering solutions to log and query user activity accessing genomic datasets across multiple sites, using MultiChain. METHODS: MultiChain is a specific blockchain platform that offers "data streams" embedded in the chain for rapid and secure data storage. We devised a storage protocol taking advantage of the keys in the MultiChain data streams and created a data frame from the chain allowing efficient query. Our solution to the iDASH competition was selected as the winner at a workshop held in San Diego, CA in October 2018. Although our solution worked well in the challenge, it has the drawback that it requires downloading all the data from the chain and keeping it locally in memory for fast query. To address this, we provide an alternate "bigmem" solution that uses indices rather than local storage for rapid queries. RESULTS: We profiled the performance of both of our solutions using logs with 100,000 to 600,000 entries, both for querying the chain and inserting data into it. The challenge solution requires 12 seconds time and 120 Mb of memory for querying from 100,000 entries. The memory requirement increases linearly and reaches 470 MB for a chain with 600,000 entries. Although our alternate bigmem solution is slower and requires more memory (408 seconds and 250 MB, respectively, for 100,000 entries), the memory requirement increases at a slower rate and reaches only 360 MB for 600,000 entries. CONCLUSION: Overall, we demonstrate that genomic access log files can be stored and queried efficiently with blockchain. Beyond this, our protocol potentially could be applied to other types of health data such as electronic health records.


Assuntos
Blockchain , Conjuntos de Dados como Assunto , Genômica , Armazenamento e Recuperação da Informação , Humanos
19.
Neuron ; 99(2): 302-314.e4, 2018 07 25.
Artigo em Inglês | MEDLINE | ID: mdl-29983323

RESUMO

Congenital hydrocephalus (CH), featuring markedly enlarged brain ventricles, is thought to arise from failed cerebrospinal fluid (CSF) homeostasis and is treated with lifelong surgical CSF shunting with substantial morbidity. CH pathogenesis is poorly understood. Exome sequencing of 125 CH trios and 52 additional probands identified three genes with significant burden of rare damaging de novo or transmitted mutations: TRIM71 (p = 2.15 × 10-7), SMARCC1 (p = 8.15 × 10-10), and PTCH1 (p = 1.06 × 10-6). Additionally, two de novo duplications were identified at the SHH locus, encoding the PTCH1 ligand (p = 1.2 × 10-4). Together, these probands account for ∼10% of studied cases. Strikingly, all four genes are required for neural tube development and regulate ventricular zone neural stem cell fate. These results implicate impaired neurogenesis (rather than active CSF accumulation) in the pathogenesis of a subset of CH patients, with potential diagnostic, prognostic, and therapeutic ramifications.


Assuntos
Hidrocefalia/diagnóstico , Hidrocefalia/genética , Mutação/genética , Células-Tronco Neurais/fisiologia , Estudos de Coortes , Exoma/genética , Feminino , Humanos , Masculino , Células-Tronco Neurais/patologia , Receptor Patched-1/genética , Linhagem , Fatores de Transcrição/genética , Sequenciamento do Exoma/métodos
20.
Sci Rep ; 7(1): 4287, 2017 06 27.
Artigo em Inglês | MEDLINE | ID: mdl-28655895

RESUMO

Despite efforts to interrogate human genome variation through large-scale databases, systematic preference toward populations of Caucasian descendants has resulted in unintended reduction of power in studying non-Caucasians. Here we report a compilation of coding variants from 1,055 healthy Korean individuals (KOVA; Korean Variant Archive). The samples were sequenced to a mean depth of 75x, yielding 101 singleton variants per individual. Population genetics analysis demonstrates that the Korean population is a distinct ethnic group comparable to other discrete ethnic groups in Africa and Europe, providing a rationale for such independent genomic datasets. Indeed, KOVA conferred 22.8% increased variant filtering power in addition to Exome Aggregation Consortium (ExAC) when used on Korean exomes. Functional assessment of nonsynonymous variant supported the presence of purifying selection in Koreans. Analysis of copy number variants detected 5.2 deletions and 10.3 amplifications per individual with an increased fraction of novel variants among smaller and rarer copy number variable segments. We also report a list of germline variants that are associated with increased tumor susceptibility. This catalog can function as a critical addition to the pre-existing variant databases in pursuing genetic studies of Korean individuals.


Assuntos
Povo Asiático/genética , Bases de Dados Genéticas , Variação Genética , Genética Populacional , Variações do Número de Cópias de DNA , Exoma , Predisposição Genética para Doença , Mutação em Linhagem Germinativa , Humanos , Neoplasias/genética , Polimorfismo de Nucleotídeo Único , República da Coreia
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA