Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 110
Filtrar
Mais filtros

Intervalo de ano de publicação
1.
Cell ; 185(18): 3426-3440.e19, 2022 09 01.
Artigo em Inglês | MEDLINE | ID: mdl-36055201

RESUMO

The 1000 Genomes Project (1kGP) is the largest fully open resource of whole-genome sequencing (WGS) data consented for public distribution without access or use restrictions. The final, phase 3 release of the 1kGP included 2,504 unrelated samples from 26 populations and was based primarily on low-coverage WGS. Here, we present a high-coverage 3,202-sample WGS 1kGP resource, which now includes 602 complete trios, sequenced to a depth of 30X using Illumina. We performed single-nucleotide variant (SNV) and short insertion and deletion (INDEL) discovery and generated a comprehensive set of structural variants (SVs) by integrating multiple analytic methods through a machine learning model. We show gains in sensitivity and precision of variant calls compared to phase 3, especially among rare SNVs as well as INDELs and SVs spanning frequency spectrum. We also generated an improved reference imputation panel, making variants discovered here accessible for association studies.


Assuntos
Genoma Humano , Sequenciamento Completo do Genoma , Feminino , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Humanos , Mutação INDEL , Masculino , Polimorfismo de Nucleotídeo Único
2.
RNA ; 28(4): 478-492, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35110373

RESUMO

Polymorphism drives survival under stress and provides adaptability. Genetic polymorphism of ribosomal RNA (rRNA) genes derives from internal repeat variation of this multicopy gene, and from interindividual variation. A considerable amount of rRNA sequence heterogeneity has been proposed but has been challenging to estimate given the scarcity of accurate reference sequences. We identified four rDNA copies on chromosome 21 (GRCh38) with 99% similarity to recently introduced reference sequence KY962518.1. We customized a GATK bioinformatics pipeline using the four rDNA loci, spanning a total 145 kb, for variant calling and used high-coverage whole-genome sequencing (WGS) data from the 1000 Genomes Project to analyze variants in 2504 individuals from 26 populations. We identified a total of 3791 variant positions. The variants positioned nonrandomly on the rRNA gene. Invariant regions included the promoter, early 5' ETS, most of 18S, 5.8S, ITS1, and large areas of the intragenic spacer. A total of 470 variant positions were observed on 28S rRNA. The majority of the 28S rRNA variants were located on highly flexible human-expanded rRNA helical folds ES7L and ES27L, suggesting that these represent positions of diversity and are potentially under continuous evolution. Several variants were validated based on RNA-seq analyses. Population analyses showed remarkable ancestry-linked genetic variance and the presence of both high penetrance and frequent variants in the 5' ETS, ITS2, and 28S regions segregating according to the continental populations. These findings provide a genetic view of rRNA gene array heterogeneity and raise the need to functionally assess how the 28S rRNA variants affect ribosome functions.


Assuntos
Heterogeneidade Genética , Genoma , DNA Ribossômico/genética , Genes de RNAr/genética , Humanos , RNA Ribossômico/genética , RNA Ribossômico 18S , RNA Ribossômico 28S/genética
3.
Hemoglobin ; 48(2): 71-78, 2024 Mar.
Artigo em Inglês | MEDLINE | ID: mdl-38632980

RESUMO

To explore a noninvasive method for diagnosis of SEA-thalassemia and to investigate whether the regional factors affect the accuracy of this method. The method involved using a public database and bioinformatics software to construct parental haplotypes for proband and predicting fetal genotypes using relative haplotype dosage. We screened and downloaded sequencing data of couples who were both SEA-thalassemia carriers from the China National Genebank public data platform, and matched the sequencing data format with that of the reference panel using Ubuntu system tools. We then used Beagle software to construct parental haplotypes, predicted fetal haplotypes by relative haplotype dosage. Finally, we used Hidden Markov Model and Viterbi algorithm to determine fetal pathogenic haplotypes. All noninvasive fetal genotype diagnosis results were compared with gold standard gap-PCR electrophoresis results. Our method was successful in diagnosing 13 families with SEA-thalassemia carriers. The best diagnostic results were obtained when Southern Chinese Han was used as the reference panel, and 10 families showed full agreement between our noninvasive diagnostic results and the gap-PCR electrophoresis results. The accuracy of our method was higher when using a Chinese Han as the reference panel for haplotype construction in the Southern Chinese Han region as opposed to Beijing Chinese region. The combined use of public databases and relative haplotype dosage for diagnosing SEA-thalassemia is a feasible approach. Our method produces the best noninvasive diagnostic results when the test samples and population reference panel are closely matched in both ethnicity and geography. When constructing parental haplotypes with our method, it is important to consider the effect of region in addition to population background alone.


Assuntos
Haplótipos , Humanos , Feminino , Gravidez , Talassemia/genética , Talassemia/diagnóstico , Bases de Dados Genéticas , Diagnóstico Pré-Natal/métodos , Teste Pré-Natal não Invasivo/métodos , Genótipo , China/epidemiologia
4.
Hum Reprod ; 38(Supplement_2): ii57-ii68, 2023 Nov 20.
Artigo em Inglês | MEDLINE | ID: mdl-37982420

RESUMO

STUDY QUESTION: Was polycystic ovary syndrome (PCOS), which impairs fertility and adheres to the evolutionary paradox, subject to evolutionary selection during ancestral times and did rapidly diminish in prevalence? SUMMARY ANSWER: This study strengthened the hypothesis that positive selection of genetic variants occurred and may account for the high prevalence of PCOS observed today. WHAT IS KNOWN ALREADY: PCOS is a complex endocrine disorder characterized by both reproductive and metabolic disturbances. As a heritable disease that impairs fertility, PCOS should diminish rapidly in prevalence; however, it is the most common cause of female subfertility globally. Few scientific genetic studies have attempted to provide evidence for the positive selection of gene variants underlying PCOS. STUDY DESIGN, SIZE, DURATION: We performed an evolutionary analysis of 2,504 individuals from 14 populations of the 1000 Genomes Project. PARTICIPANTS/MATERIALS, SETTING, METHODS: We tested the signature of positive selection for 37 single-nucleotide polymorphisms (SNPs) associated with PCOS in previous genome-wide association studies using six parameters of positive selection. MAIN RESULTS AND THE ROLE OF CHANCE: Analyzing the evolutionary indices together, there was obvious positive selection at the PCOS-related SNPs loci, especially within the original evolution window of humans, demonstrated by significant Tajima's D values. Compared to the genome background, six of the 37 SNPs in or close to five genes (DENN domain-containing protein 1A: DENND1A, chromosome 9 open reading frame 3: AOPEP, aminopeptidase O: THADA, diacylglycerol kinase iota: DGKI, and netrin receptor UNC5C: UNC5C) showed significant evidence of positive selection, among which DENND1A, AOPEP, and THADA represent the set of most established susceptibility genes for PCOS. LIMITATIONS, REASONS FOR CAUTION: First, only well-documented SNPs were selected from well-designed experiments. Second, it is difficult to determine which hypothesis of PCOS evolution is at play. After considering the most significant functions of these genes, we found that they had a wide variety of functions with no obvious association between them. WIDER IMPLICATIONS OF THE FINDINGS: Our findings provide additional evidence for the positive evolution of PCOS. Our analyses require confirmation in a larger study with more evolutionary indicators and larger data range. Further research to identify the roles of the DENND1A, AOPEP, THADA, DGKI, and UNC5C genes is also necessary. STUDY FUNDING/COMPETING INTEREST(S): This study was supported by the National Key Research and Development Program of China (2021YFC2700400 and 2021YFC2700701), Basic Science Center Program of NSFC (31988101), CAMS Innovation Fund for Medical Sciences (2021-I2M-5-001), National Natural Science Foundation of China (82192874, 31871509, and 82071606), Shandong Provincial Key Research and Development Program (2020ZLYS02), Taishan Scholars Program of Shandong Province (ts20190988), and Fundamental Research Funds of Shandong University. The authors have no conflicts of interest to disclose. TRIAL REGISTRATION NUMBER: N/A.


Assuntos
Infertilidade Feminina , Síndrome do Ovário Policístico , Humanos , Feminino , Síndrome do Ovário Policístico/genética , Estudo de Associação Genômica Ampla , Fertilidade , Reprodução
5.
Biochem Genet ; 61(5): 1675-1703, 2023 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-36725786

RESUMO

In Brazil, high levels of agricultural activity are reflected in the consumption of enormous amounts of pesticides. The production of grain in Brazil has been estimated at 289.8 million tons in the 2022 harvest, an expansion of 14.7% compared with 2021. These advances are likely associated with a progressive increase in the occupational exposure of a population to pesticides. The Paraoxonase 1 gene (PON1) is involved in liver detoxification; the rs662 variant of this gene modifies the activity of the enzyme. The repair of pesticide-induced genetic damage depends on the protein produced by the X-Ray Repair Cross-Complementing Group 1 gene (XRCC). Its function is impaired due to an rs25487 variant. The present study describes the frequencies of the rs662 and rs25487 and their haplotypes in a sample population from Goiás, Brazil. It compares the frequencies with other populations worldwide to verify the variation in the distribution of these SNPs, with 494 unrelated individuals in the state of Goiás. The A allele of the rs25487 variant had a frequency of 26% in the Goiás population, and the modified rs662 G allele had a frequency of 42.8%. Four haplotypes were recorded for the rs25487 (G > A) and rs662 (A > G) markers, with a frequency of 11.9% being recorded for the A-G haplotype (both modified alleles), 30.8% for the G-G haplotype, 14.3% for the A-A haplotype, and 42.8% for the G-A haplotype (both wild-type alleles). We demonstrated the distribution of important SNPs associated with pesticide exposure in an area with a high agricultural activity level, Central Brazil.


Assuntos
Praguicidas , Polimorfismo de Nucleotídeo Único , Humanos , Genótipo , Brasil , Incidência , Arildialquilfosfatase/genética , Proteína 1 Complementadora Cruzada de Reparo de Raio-X/genética
6.
BMC Bioinformatics ; 23(1): 401, 2022 Sep 29.
Artigo em Inglês | MEDLINE | ID: mdl-36175857

RESUMO

BACKGROUND: Population variant analysis is of great importance for gathering insights into the links between human genotype and phenotype. The 1000 Genomes Project established a valuable reference for human genetic variation; however, the integrative use of the corresponding data with other datasets within existing repositories and pipelines is not fully supported. Particularly, there is a pressing need for flexible and fast selection of population partitions based on their variant and metadata-related characteristics. RESULTS: Here, we target general germline or somatic mutation data sources for their seamless inclusion within an interoperable-format repository, supporting integration among them and with other genomic data, as well as their integrated use within bioinformatic workflows. In addition, we provide VarSum, a data summarization service working on sub-populations of interest selected using filters on population metadata and/or variant characteristics. The service is developed as an optimized computational framework with an Application Programming Interface (API) that can be called from within any existing computing pipeline or programming script. Provided example use cases of biological interest show the relevance, power and ease of use of the API functionalities. CONCLUSIONS: The proposed data integration pipeline and data set extraction and summarization API pave the way for solid computational infrastructures that quickly process cumbersome variation data, and allow biologists and bioinformaticians to easily perform scalable analysis on user-defined partitions of large cohorts from increasingly available genetic variation studies. With the current tendency to large (cross)nation-wide sequencing and variation initiatives, we expect an ever growing need for the kind of computational support hereby proposed.


Assuntos
Genômica , Metadados , Biologia Computacional , Genótipo , Humanos , Software
7.
Hum Mutat ; 43(12): 1979-1993, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-36054329

RESUMO

Detection of de novo variants (DNVs) is critical for studies of disease-related variation and mutation rates. To accelerate DNV calling, we developed a graphics processing units-based workflow. We applied our workflow to whole-genome sequencing data from three parent-child sequenced cohorts including the Simons Simplex Collection (SSC), Simons Foundation Powering Autism Research (SPARK), and the 1000 Genomes Project (1000G) that were sequenced using DNA from blood, saliva, and lymphoblastoid cell lines (LCLs), respectively. The SSC and SPARK DNV callsets were within expectations for number of DNVs, percent at CpG sites, phasing to the paternal chromosome of origin, and average allele balance. However, the 1000G DNV callset was not within expectations and contained excessive DNVs that are likely cell line artifacts. Mutation signature analysis revealed 30% of 1000G DNV signatures matched B-cell lymphoma. Furthermore, we found variants in DNA repair genes and at Clinvar pathogenic or likely-pathogenic sites and significant excess of protein-coding DNVs in IGLL5; a gene known to be involved in B-cell lymphomas. Our study provides a new rapid DNV caller for the field and elucidates important implications of using sequencing data from LCLs for reference building and disease-related projects.


Assuntos
Neoplasias , Humanos , Alelos , Mutação , Neoplasias/genética , Sequenciamento Completo do Genoma
8.
Am J Hum Genet ; 105(1): 78-88, 2019 07 03.
Artigo em Inglês | MEDLINE | ID: mdl-31178127

RESUMO

Relationship estimation and segment detection between individuals is an important aspect of disease gene mapping. Existing methods are either tailored for computational efficiency or require phasing to improve accuracy. We developed TRUFFLE, a method that integrates computational techniques and statistical principles for the identification and visualization of identity-by-descent (IBD) segments using un-phased data. By skipping the haplotype phasing step and, instead, relying on a simpler region-based approach, our method is computationally efficient while maintaining inferential accuracy. In addition, an error model corrects for segment break-ups that occur as a consequence of genotyping errors. TRUFFLE can estimate relatedness for 3.1 million pairs from the 1000 Genomes Project data in a few minutes on a typical laptop computer. Consistent with expectation, we identified only three second cousin or closer pairs across different populations, while commonly used methods identified a large number of such pairs. Similarly, within populations, we identified many fewer related pairs. Compared to methods relying on phased data, TRUFFLE has comparable accuracy but is drastically faster and has fewer broken segments. We also identified specific local genomic regions that are commonly shared within populations, suggesting selection. When applied to pedigree data, we observed 99.6% accuracy in detecting 1st to 5th degree relationships. As genomic datasets become much larger, TRUFFLE can enable disease gene mapping through implicit shared haplotypes by accurate IBD segment detection.


Assuntos
Mapeamento Cromossômico/métodos , Predisposição Genética para Doença , Genética Populacional , Estudo de Associação Genômica Ampla/métodos , Polimorfismo de Nucleotídeo Único , Característica Quantitativa Herdável , Software , Algoritmos , Simulação por Computador , Feminino , Ligação Genética , Genoma Humano , Genômica , Mutação em Linhagem Germinativa , Haplótipos , Humanos , Masculino , Modelos Genéticos , Linhagem
9.
RNA ; 25(12): 1765-1778, 2019 12.
Artigo em Inglês | MEDLINE | ID: mdl-31519742

RESUMO

Circular RNAs (circRNAs) are abundant in eukaryotic transcriptomes and have been linked to various human disorders. However, understanding genetic control of circular RNA expression is in the early stages. Here we present the first integrated analysis of circRNAs and genome sequence variation from lymphoblastoid cell lines of the 1000 Genomes Project. We identified thousands of circRNAs in the RNA-seq data and show their association with local single-nucleotide polymorphic sites, referred to as circQTLs, which influence the circRNA transcript abundance. Strikingly, we found that circQTLs exist independently of eQTLs with most circQTLs having no effect on mRNA expression. Only a fraction of the polymorphic sites are shared and linked to both circRNA and mRNA expression with mostly similar effects on circular and linear RNA. A shared intronic QTL, rs55928920, of HMSD gene drives the circular and linear expression in opposite directions, potentially modulating circRNA levels at the expense of mRNA. Finally, circQTLs and eQTLs are largely independent and exist in separate linkage disequilibrium (LD) blocks with circQTLs highly enriched for functional genomic elements and regulatory regions. This study reveals a previously uncharacterized role of DNA sequence variation in human circular RNA regulation.


Assuntos
Regulação da Expressão Gênica , Variação Genética , RNA Circular , Perfilação da Expressão Gênica , Redes Reguladoras de Genes , Genoma Humano , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , MicroRNAs/genética , Polimorfismo de Nucleotídeo Único , Locos de Características Quantitativas , RNA Mensageiro/genética , Análise de Sequência de DNA , Transcriptoma
10.
BMC Biol ; 18(1): 167, 2020 11 13.
Artigo em Inglês | MEDLINE | ID: mdl-33187521

RESUMO

BACKGROUND: Structural variants comprise diverse genomic arrangements including deletions, insertions, inversions, and translocations, which can generally be detected in humans through sequence comparison to the reference genome. Among structural variants, insertions are the least frequently identified variants, mainly due to ascertainment bias in the reference genome, lack of previous sequence knowledge, and low complexity of typical insertion sequences. Though recent developments in long-read sequencing deliver promise in annotating individual non-reference insertions, population-level catalogues on non-reference insertion variants have not been identified and the possible functional roles of these hidden variants remain elusive. RESULTS: To detect non-reference insertion variants, we developed a pipeline, InserTag, which generates non-reference contigs by local de novo assembly and then infers the full-sequence of insertion variants by tracing contigs from non-human primates and other human genome assemblies. Application of the pipeline to data from 2535 individuals of the 1000 Genomes Project helped identify 1696 non-reference insertion variants and re-classify the variants as retention of ancestral sequences or novel sequence insertions based on the ancestral state. Genotyping of the variants showed that individuals had, on average, 0.92-Mbp sequences missing from the reference genome, 92% of the variants were common (allele frequency > 5%) among human populations, and more than half of the variants were major alleles. Among human populations, African populations were the most divergent and had the most non-reference sequences, which was attributed to the greater prevalence of high-frequency insertion variants. The subsets of insertion variants were in high linkage disequilibrium with phenotype-associated SNPs and showed signals of recent continent-specific selection. CONCLUSIONS: Non-reference insertion variants represent an important type of genetic variation in the human population, and our developed pipeline, InserTag, provides the frameworks for the detection and genotyping of non-reference sequences missing from human populations.


Assuntos
Mapeamento de Sequências Contíguas , Frequência do Gene , Genoma Humano , Mutagênese Insercional , Humanos
11.
BMC Biol ; 18(1): 38, 2020 04 13.
Artigo em Inglês | MEDLINE | ID: mdl-32279660

RESUMO

BACKGROUND: The advent of next generation sequencing (NGS) has allowed the discovery of short and long non-coding RNAs (ncRNAs) in an unbiased manner using reverse genetics approaches, enabling the discovery of multiple categories of ncRNAs and characterization of the way their expression is regulated. We previously showed that the identities and abundances of microRNA isoforms (isomiRs) and transfer RNA-derived fragments (tRFs) are tightly regulated, and that they depend on a person's sex and population origin, as well as on tissue type, tissue state, and disease type. Here, we characterize the regulation and distribution of fragments derived from ribosomal RNAs (rRNAs). rRNAs form a group that includes four (5S, 5.8S, 18S, 28S) rRNAs encoded by the human nuclear genome and two (12S, 16S) by the mitochondrial genome. rRNAs constitute the most abundant RNA type in eukaryotic cells. RESULTS: We analyzed rRNA-derived fragments (rRFs) across 434 transcriptomic datasets obtained from lymphoblastoid cell lines (LCLs) derived from healthy participants of the 1000 Genomes Project. The 434 datasets represent five human populations and both sexes. We examined each of the six rRNAs and their respective rRFs, and did so separately for each population and sex. Our analysis shows that all six rRNAs produce rRFs with unique identities, normalized abundances, and lengths. The rRFs arise from the 5'-end (5'-rRFs), the interior (i-rRFs), and the 3'-end (3'-rRFs) or straddle the 5' or 3' terminus of the parental rRNA (x-rRFs). Notably, a large number of rRFs are produced in a population-specific or sex-specific manner. Preliminary evidence suggests that rRF production is also tissue-dependent. Of note, we find that rRF production is not affected by the identity of the processing laboratory or the library preparation kit. CONCLUSIONS: Our findings suggest that rRFs are produced in a regimented manner by currently unknown processes that are influenced by both ubiquitous as well as population-specific and sex-specific factors. The properties of rRFs mirror the previously reported properties of isomiRs and tRFs and have implications for the study of homeostasis and disease.


Assuntos
MicroRNAs/genética , RNA Ribossômico/genética , Idoso , Linhagem Celular , Feminino , Humanos , Masculino , MicroRNAs/metabolismo , Pessoa de Meia-Idade , RNA Ribossômico/metabolismo , Fatores Sexuais , Transcriptoma
12.
Yi Chuan ; 43(10): 962-971, 2021 Oct 20.
Artigo em Inglês | MEDLINE | ID: mdl-34702708

RESUMO

Microhaplotype loci (microhaplotype, MHs), defined by two or more closely linked single nucleotide polymorphisms, are a type of molecular marker within a short segment of DNA. As emerging forensic genetic markers, MHs have no stutter artefacts and higher polymorphism, and permit the design of smaller amplicons. In order to identify the markers from a genome wide perspective and explore their potential application further, we constructed the most comprehensive MH dataset to date, based on the whole genome sequencing data of 105 Han individuals in Southern China from 1000 Genomes Project. The results showed that there were 9,490,075 MH loci in the range of 350 bp in the human genome, and the distribution density of microhaplotypes suggests gene variation. Polymorphism analysis of MHs from various base spans showed that the polymorphism of MHs could reach or exceed common short tandem repeat sites. In addition, based on their flexible assembly, a scheme to build the public database of microhaplotypes was proposed.


Assuntos
Impressões Digitais de DNA , Sequenciamento de Nucleotídeos em Larga Escala , China , Genética Forense , Frequência do Gene , Genética Populacional , Genômica , Haplótipos , Humanos , Repetições de Microssatélites , Polimorfismo de Nucleotídeo Único/genética
13.
BMC Bioinformatics ; 21(1): 14, 2020 Jan 10.
Artigo em Inglês | MEDLINE | ID: mdl-31924160

RESUMO

BACKGROUND: Linkage disequilibrium (LD)-the non-random association of alleles at different loci-defines population-specific haplotypes which vary by genomic ancestry. Assessment of allelic frequencies and LD patterns from a variety of ancestral populations enables researchers to better understand population histories as well as improve genetic understanding of diseases in which risk varies by ethnicity. RESULTS: We created an interactive web module which allows for quick geographic visualization of linkage disequilibrium (LD) patterns between two user-specified germline variants across geographic populations included in the 1000 Genomes Project. Interactive maps and a downloadable, sortable summary table allow researchers to easily compute and compare allele frequencies and LD statistics of dbSNP catalogued variants. The geographic mapping of each SNP's allele frequencies by population as well as visualization of LD statistics allows the user to easily trace geographic allelic correlation patterns and examine population-specific differences. CONCLUSIONS: LDpop is a free and publicly available cross-platform web tool which can be accessed online at https://ldlink.nci.nih.gov/?tab=ldpop.


Assuntos
Estudo de Associação Genômica Ampla , Desequilíbrio de Ligação , Interface Usuário-Computador , Alelos , Frequência do Gene , Genômica/métodos , Haplótipos , Humanos , Polimorfismo de Nucleotídeo Único
14.
BMC Genomics ; 21(1): 842, 2020 Nov 30.
Artigo em Inglês | MEDLINE | ID: mdl-33256598

RESUMO

BACKGROUND: Core promoter controls transcription initiation. However, little is known for core promoter diversity in the human genome and its relationship with diseases. We hypothesized that as a functional important component in the genome, the core promoter in the human genome could be under evolutionary selection, as reflected by its highly diversification in order to adjust gene expression for better adaptation to the different environment. RESULTS: Applying the "Exome-based Variant Detection in Core-promoters" method, we analyzed human core-promoter diversity by using the 2682 exome data sets of 25 worldwide human populations sequenced by the 1000 Genome Project. Collectively, we identified 31,996 variants in the core promoter region (- 100 to + 100) of 12,509 human genes ( https://dbhcpd.fhs.um.edu.mo ). Analyzing the rich variation data identified highly ethnic-specific patterns of core promoter variation between different ethnic populations, the genes with highly variable core promoters, the motifs affected by the variants, and their involved functional pathways. eQTL test revealed that 12% of core promoter variants can significantly alter gene expression level. Comparison with GWAS data we located 163 variants as the GWAS identified traits associated with multiple diseases, half of these variants can alter gene expression. CONCLUSION: Data from our study reals the highly diversified nature of core promoter in the human genome, and highlights that core promoter variation could play important roles not only in gene expression regulation but also in disease predisposition.


Assuntos
Regulação da Expressão Gênica , Genoma Humano , Evolução Biológica , Expressão Gênica , Humanos , Regiões Promotoras Genéticas
15.
Am J Hum Genet ; 100(4): 635-649, 2017 Apr 06.
Artigo em Inglês | MEDLINE | ID: mdl-28366442

RESUMO

The vast majority of genome-wide association studies (GWASs) are performed in Europeans, and their transferability to other populations is dependent on many factors (e.g., linkage disequilibrium, allele frequencies, genetic architecture). As medical genomics studies become increasingly large and diverse, gaining insights into population history and consequently the transferability of disease risk measurement is critical. Here, we disentangle recent population history in the widely used 1000 Genomes Project reference panel, with an emphasis on populations underrepresented in medical studies. To examine the transferability of single-ancestry GWASs, we used published summary statistics to calculate polygenic risk scores for eight well-studied phenotypes. We identify directional inconsistencies in all scores; for example, height is predicted to decrease with genetic distance from Europeans, despite robust anthropological evidence that West Africans are as tall as Europeans on average. To gain deeper quantitative insights into GWAS transferability, we developed a complex trait coalescent-based simulation framework considering effects of polygenicity, causal allele frequency divergence, and heritability. As expected, correlations between true and inferred risk are typically highest in the population from which summary statistics were derived. We demonstrate that scores inferred from European GWASs are biased by genetic drift in other populations even when choosing the same causal variants and that biases in any direction are possible and unpredictable. This work cautions that summarizing findings from large-scale GWASs may have limited portability to other populations using standard approaches and highlights the need for generalized risk prediction methods and the inclusion of more diverse individuals in medical genomics.


Assuntos
Predisposição Genética para Doença , Grupos Raciais/genética , América , Genética Médica , Genética Populacional , Haplótipos , Projeto Genoma Humano , Humanos , Herança Multifatorial
16.
BMC Bioinformatics ; 20(1): 26, 2019 Jan 15.
Artigo em Inglês | MEDLINE | ID: mdl-30646839

RESUMO

BACKGROUND: Simulation of genetic variants data is frequently required for the evaluation of statistical methods in the fields of human and animal genetics. Although a number of high-quality genetic simulators have been developed, many of them require advanced knowledge in population genetics or in computation to be used effectively. In addition, generating simulated data in the context of family-based studies demands sophisticated methods and advanced computer programming. RESULTS: To address these issues, we propose a new user-friendly and integrated R package, sim1000G, which simulates variants in genomic regions among unrelated individuals or among families. The only input needed is a raw phased Variant Call Format (VCF) file. Haplotypes are extracted to compute linkage disequilibrium (LD) in the simulated genomic regions and for the generation of new genotype data among unrelated individuals. The covariance across variants is used to preserve the LD structure of the original population. Pedigrees of arbitrary sizes are generated by modeling recombination events with sim1000G. To illustrate the application of sim1000G, various scenarios are presented assuming unrelated individuals from a single population or two distinct populations, or alternatively for three-generation pedigree data. Sim1000G can capture allele frequency diversity, short and long-range linkage disequilibrium (LD) patterns and subtle population differences in LD structure without the need of any tuning parameters. CONCLUSION: Sim1000G fills a gap in the vast area of genetic variants simulators by its simplicity and independence from external tools. Currently, it is one of the few simulation packages completely integrated into R and able to simulate multiple genetic variants among unrelated individuals and within families. Its implementation will facilitate the application and development of computational methods for association studies with both rare and common variants.


Assuntos
Biologia Computacional/métodos , Ligação Genética , Marcadores Genéticos , Genética Populacional , Modelos Genéticos , Polimorfismo de Nucleotídeo Único , Software , Feminino , Humanos , Desequilíbrio de Ligação , Masculino , Linhagem
17.
Genet Epidemiol ; 42(7): 636-647, 2018 10.
Artigo em Inglês | MEDLINE | ID: mdl-30156736

RESUMO

Complex traits can share a substantial proportion of their polygenic heritability. However, genome-wide polygenic correlations between pairs of traits can mask heterogeneity in their shared polygenic effects across loci. We propose a novel method (weighted maximum likelihood-regional polygenic correlation [RPC]) to evaluate polygenic correlation between two complex traits in small genomic regions using summary association statistics. Our method tests for evidence that the polygenic effect at a given region affects two traits concurrently. We show through simulations that our method is well calibrated, powerful, and more robust to misspecification of linkage disequilibrium than other methods under a polygenic model. As small genomic regions are more likely to harbor specific genetic effects, our method is ideal to identify heterogeneity in shared polygenic correlation across regions. We illustrate the usefulness of our method by addressing two questions related to cardiometabolic traits. First, we explored how RPC can inform on the strong epidemiological association between high-density lipoprotein cholesterol and coronary artery disease (CAD), suggesting a key role for triglycerides metabolism. Second, we investigated the potential role of PPARγ activators in the prevention of CAD. Our results provide a compelling argument that shared heritability between complex traits is highly heterogeneous across loci.


Assuntos
Desequilíbrio de Ligação/genética , Herança Multifatorial/genética , HDL-Colesterol/genética , Simulação por Computador , Doença da Artéria Coronariana/tratamento farmacológico , Doença da Artéria Coronariana/genética , Loci Gênicos , Genoma Humano , Estudo de Associação Genômica Ampla , Haplótipos/genética , Humanos , Modelos Genéticos , PPAR gama/metabolismo , Fenótipo , Polimorfismo de Nucleotídeo Único/genética , Fatores de Risco , Tiazolidinedionas/uso terapêutico
18.
BMC Genomics ; 20(1): 620, 2019 Aug 16.
Artigo em Inglês | MEDLINE | ID: mdl-31416423

RESUMO

BACKGROUND: Data from the 1000 Genomes project is quite often used as a reference for human genomic analysis. However, its accuracy needs to be assessed to understand the quality of predictions made using this reference. We present here an assessment of the genotyping, phasing, and imputation accuracy data in the 1000 Genomes project. We compare the phased haplotype calls from the 1000 Genomes project to experimentally phased haplotypes for 28 of the same individuals sequenced using the 10X Genomics platform. RESULTS: We observe that phasing and imputation for rare variants are unreliable, which likely reflects the limited sample size of the 1000 Genomes project data. Further, it appears that using a population specific reference panel does not improve the accuracy of imputation over using the entire 1000 Genomes data set as a reference panel. We also note that the error rates and trends depend on the choice of definition of error, and hence any error reporting needs to take these definitions into account. CONCLUSIONS: The quality of the 1000 Genomes data needs to be considered while using this database for further studies. This work presents an analysis that can be used for these assessments.


Assuntos
Genoma Humano/genética , Haplótipos/genética , Grupos Raciais/genética , Frequência do Gene/genética , Sequenciamento de Nucleotídeos em Larga Escala , Projeto Genoma Humano , Humanos , Polimorfismo de Nucleotídeo Único , Grupos Raciais/etnologia , Erro Científico Experimental
19.
RNA ; 23(1): 14-22, 2017 01.
Artigo em Inglês | MEDLINE | ID: mdl-27807179

RESUMO

As most RNA structures are elusive to structure determination, obtaining solvent accessible surface areas (ASAs) of nucleotides in an RNA structure is an important first step to characterize potential functional sites and core structural regions. Here, we developed RNAsnap, the first machine-learning method trained on protein-bound RNA structures for solvent accessibility prediction. Built on sequence profiles from multiple sequence alignment (RNAsnap-prof), the method provided robust prediction in fivefold cross-validation and an independent test (Pearson correlation coefficients, r, between predicted and actual ASA values are 0.66 and 0.63, respectively). Application of the method to 6178 mRNAs revealed its positive correlation to mRNA accessibility by dimethyl sulphate (DMS) experimentally measured in vivo (r = 0.37) but not in vitro (r = 0.07), despite the lack of training on mRNAs and the fact that DMS accessibility is only an approximation to solvent accessibility. We further found strong association across coding and noncoding regions between predicted solvent accessibility of the mutation site of a single nucleotide variant (SNV) and the frequency of that variant in the population for 2.2 million SNVs obtained in the 1000 Genomes Project. Moreover, mapping solvent accessibility of RNAs to the human genome indicated that introns, 5' cap of 5' and 3' cap of 3' untranslated regions, are more solvent accessible, consistent with their respective functional roles. These results support conformational selections as the mechanism for the formation of RNA-protein complexes and highlight the utility of genome-scale characterization of RNA tertiary structures by RNAsnap. The server and its stand-alone downloadable version are available at http://sparks-lab.org.


Assuntos
RNA/química , RNA/genética , Solventes/química , Biologia Computacional/métodos , Genoma Humano , Humanos , Aprendizado de Máquina , Modelos Moleculares , Conformação Molecular
20.
Int J Immunogenet ; 46(2): 49-58, 2019 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-30659741

RESUMO

Allele-specific analyses to understand frequency differences across populations, particularly populations not well studied, are important to help identify variants that may have a functional effect on disease mechanisms and phenotypic predisposition, facilitating new Genome-Wide Association Studies (GWAS). We aimed to compare the allele frequency of 11 asthma-associated and 16 liver disease-associated single nucleotide polymorphisms (SNPs) between the Estonian, HapMap and 1000 genome project populations. When comparing EGCUT with HapMap populations, the largest difference in allele frequencies was observed with the Maasai population in Kinyawa, Kenya, with 12 SNP variants reporting statistical significance. Similarly, when comparing EGCUT with 1000 genomes project populations, the largest difference in allele frequencies was observed with pooled African populations with 22 SNP variants reporting statistical significance. For 11 asthma-associated and 16 liver disease-associated SNPs, Estonians are genetically similar to other European populations but significantly different from African populations. Understanding differences in genetic architecture between ethnic populations is important to facilitate new GWAS targeted at underserved ethnic groups to enable novel genetic findings to aid the development of new therapies to reduce morbidity and mortality.


Assuntos
Asma/genética , Frequência do Gene/genética , Genética Populacional , Genoma Humano , Projeto HapMap , Hepatopatias/genética , Polimorfismo de Nucleotídeo Único/genética , Estônia , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA