Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 15.627
Filtrar
Mais filtros

Intervalo de ano de publicação
1.
Nat Immunol ; 19(7): 674-684, 2018 07.
Artigo em Inglês | MEDLINE | ID: mdl-29925982

RESUMO

Genome-wide association studies are transformative in revealing the polygenetic basis of common diseases, with autoimmune diseases leading the charge. Although the field is just over 10 years old, advances in understanding the underlying mechanistic pathways of these conditions, which result from a dense multifactorial blend of genetic, developmental and environmental factors, have already been informative, including insights into therapeutic possibilities. Nevertheless, the challenge of identifying the actual causal genes and pathways and their biological effects on altering disease risk remains for many identified susceptibility regions. It is this fundamental knowledge that will underpin the revolution in patient stratification, the discovery of therapeutic targets and clinical trial design in the next 20 years. Here we outline recent advances in analytical and phenotyping approaches and the emergence of large cohorts with standardized gene-expression data and other phenotypic data that are fueling a bounty of discovery and improved understanding of human physiology.


Assuntos
Doenças Autoimunes/genética , Doenças Autoimunes/microbiologia , Mapeamento Cromossômico , Variação Genética , Estudo de Associação Genômica Ampla , Humanos , Infecções/complicações , Microbiota , Distribuição Aleatória , Tamanho da Amostra
2.
Nature ; 626(7999): 491-499, 2024 Feb.
Artigo em Inglês | MEDLINE | ID: mdl-38356064

RESUMO

Social scientists have increasingly turned to the experimental method to understand human behaviour. One critical issue that makes solving social problems difficult is scaling up the idea from a small group to a larger group in more diverse situations. The urgency of scaling policies impacts us every day, whether it is protecting the health and safety of a community or enhancing the opportunities of future generations. Yet, a common result is that, when we scale up ideas, most experience a 'voltage drop'-that is, on scaling, the cost-benefit profile depreciates considerably. Here I argue that, to reduce voltage drops, we must optimally generate policy-based evidence. Optimality requires answering two crucial questions: what information should be generated and in what sequence. The economics underlying the science of scaling provides insights into these questions, which are in some cases at odds with conventional approaches. For example, there are important situations in which I advocate flipping the traditional social science research model to an approach that, from the beginning, produces the type of policy-based evidence that the science of scaling demands. To do so, I propose augmenting efficacy trials by including relevant tests of scale in the original discovery process, which forces the scientist to naturally start with a recognition of the big picture: what information do I need to have scaling confidence?


Assuntos
Tamanho da Amostra , Ciências Sociais , Humanos , Ciências Sociais/métodos , Ciências Sociais/normas , Pesquisa Comportamental/métodos , Análise Custo-Benefício
3.
Nature ; 620(7974): 595-599, 2023 Aug.
Artigo em Inglês | MEDLINE | ID: mdl-37558871

RESUMO

Migratory songbirds have the remarkable ability to extract directional information from the Earth's magnetic field1,2. The exact mechanism of this light-dependent magnetic compass sense, however, is not fully understood. The most promising hypothesis focuses on the quantum spin dynamics of transient radical pairs formed in cryptochrome proteins in the retina3-5. Frustratingly, much of the supporting evidence for this theory is circumstantial, largely because of the extreme challenges posed by genetic modification of wild birds. Drosophila has therefore been recruited as a model organism, and several influential reports of cryptochrome-mediated magnetic field effects on fly behaviour have been widely interpreted as support for a radical pair-based mechanism in birds6-23. Here we report the results of an extensive study testing magnetic field effects on 97,658 flies moving in a two-arm maze and on 10,960 flies performing the spontaneous escape behaviour known as negative geotaxis. Under meticulously controlled conditions and with vast sample sizes, we have been unable to find evidence for magnetically sensitive behaviour in Drosophila. Moreover, after reassessment of the statistical approaches and sample sizes used in the studies that we tried to replicate, we suggest that many-if not all-of the original results were false positives. Our findings therefore cast considerable doubt on the existence of magnetic sensing in Drosophila and thus strongly suggest that night-migratory songbirds remain the organism of choice for elucidating the mechanism of light-dependent magnetoreception.


Assuntos
Drosophila melanogaster , Campos Magnéticos , Resultados Negativos , Animais , Migração Animal , Criptocromos/metabolismo , Aves Canoras/fisiologia , Drosophila melanogaster/fisiologia , Modelos Animais , Reação de Fuga , Aprendizagem em Labirinto , Tamanho da Amostra , Luz
4.
Nature ; 610(7933): 704-712, 2022 10.
Artigo em Inglês | MEDLINE | ID: mdl-36224396

RESUMO

Common single-nucleotide polymorphisms (SNPs) are predicted to collectively explain 40-50% of phenotypic variation in human height, but identifying the specific variants and associated regions requires huge sample sizes1. Here, using data from a genome-wide association study of 5.4 million individuals of diverse ancestries, we show that 12,111 independent SNPs that are significantly associated with height account for nearly all of the common SNP-based heritability. These SNPs are clustered within 7,209 non-overlapping genomic segments with a mean size of around 90 kb, covering about 21% of the genome. The density of independent associations varies across the genome and the regions of increased density are enriched for biologically relevant genes. In out-of-sample estimation and prediction, the 12,111 SNPs (or all SNPs in the HapMap 3 panel2) account for 40% (45%) of phenotypic variance in populations of European ancestry but only around 10-20% (14-24%) in populations of other ancestries. Effect sizes, associated regions and gene prioritization are similar across ancestries, indicating that reduced prediction accuracy is likely to be explained by linkage disequilibrium and differences in allele frequency within associated regions. Finally, we show that the relevant biological pathways are detectable with smaller sample sizes than are needed to implicate causal genes and variants. Overall, this study provides a comprehensive map of specific genomic regions that contain the vast majority of common height-associated variants. Although this map is saturated for populations of European ancestry, further research is needed to achieve equivalent saturation in other ancestries.


Assuntos
Estatura , Mapeamento Cromossômico , Polimorfismo de Nucleotídeo Único , Humanos , Estatura/genética , Frequência do Gene/genética , Genoma Humano/genética , Estudo de Associação Genômica Ampla , Haplótipos/genética , Desequilíbrio de Ligação/genética , Polimorfismo de Nucleotídeo Único/genética , Europa (Continente)/etnologia , Tamanho da Amostra , Fenótipo
5.
Nature ; 612(7941): 720-724, 2022 12.
Artigo em Inglês | MEDLINE | ID: mdl-36477530

RESUMO

Tobacco and alcohol use are heritable behaviours associated with 15% and 5.3% of worldwide deaths, respectively, due largely to broad increased risk for disease and injury1-4. These substances are used across the globe, yet genome-wide association studies have focused largely on individuals of European ancestries5. Here we leveraged global genetic diversity across 3.4 million individuals from four major clines of global ancestry (approximately 21% non-European) to power the discovery and fine-mapping of genomic loci associated with tobacco and alcohol use, to inform function of these loci via ancestry-aware transcriptome-wide association studies, and to evaluate the genetic architecture and predictive power of polygenic risk within and across populations. We found that increases in sample size and genetic diversity improved locus identification and fine-mapping resolution, and that a large majority of the 3,823 associated variants (from 2,143 loci) showed consistent effect sizes across ancestry dimensions. However, polygenic risk scores developed in one ancestry performed poorly in others, highlighting the continued need to increase sample sizes of diverse ancestries to realize any potential benefit of polygenic prediction.


Assuntos
Consumo de Bebidas Alcoólicas , Predisposição Genética para Doença , Variação Genética , Internacionalidade , Herança Multifatorial , Uso de Tabaco , Humanos , Predisposição Genética para Doença/genética , Variação Genética/genética , Estudo de Associação Genômica Ampla/métodos , Herança Multifatorial/genética , Fatores de Risco , Uso de Tabaco/genética , Consumo de Bebidas Alcoólicas/genética , Transcriptoma , Tamanho da Amostra , Loci Gênicos/genética , Europa (Continente)/etnologia
6.
PLoS Biol ; 22(4): e3002456, 2024 Apr.
Artigo em Inglês | MEDLINE | ID: mdl-38603525

RESUMO

A recent article claimed that researchers need not increase the overall sample size for a study that includes both sexes. This Formal Comment points out that that study assumed two sexes to have the same variance, and explains why this is a unrealistic assumption.


Assuntos
Projetos de Pesquisa , Masculino , Feminino , Humanos , Tamanho da Amostra
7.
PLoS Biol ; 22(1): e3002423, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-38190355

RESUMO

Power analysis currently dominates sample size determination for experiments, particularly in grant and ethics applications. Yet, this focus could paradoxically result in suboptimal study design because publication biases towards studies with the largest effects can lead to the overestimation of effect sizes. In this Essay, we propose a paradigm shift towards better study designs that focus less on statistical power. We also advocate for (pre)registration and obligatory reporting of all results (regardless of statistical significance), better facilitation of team science and multi-institutional collaboration that incorporates heterogenization, and the use of prospective and living meta-analyses to generate generalizable results. Such changes could make science more effective and, potentially, more equitable, helping to cultivate better collaborations.


Assuntos
Projetos de Pesquisa , Estudos Prospectivos , Tamanho da Amostra , Viés de Publicação
8.
Nature ; 590(7845): 290-299, 2021 02.
Artigo em Inglês | MEDLINE | ID: mdl-33568819

RESUMO

The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases. The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds. Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data. The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1. In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome. Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci. Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals). These rare variants provide insights into mutational processes and recent human evolutionary history. The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation. Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 0.01%.


Assuntos
Variação Genética/genética , Genoma Humano/genética , Genômica , National Heart, Lung, and Blood Institute (U.S.) , Medicina de Precisão , Citocromo P-450 CYP2D6/genética , Haplótipos/genética , Heterozigoto , Humanos , Mutação INDEL , Mutação com Perda de Função , Mutagênese , Fenótipo , Polimorfismo de Nucleotídeo Único , Densidade Demográfica , Medicina de Precisão/normas , Controle de Qualidade , Tamanho da Amostra , Estados Unidos , Sequenciamento Completo do Genoma/normas
9.
Nature ; 600(7890): 695-700, 2021 12.
Artigo em Inglês | MEDLINE | ID: mdl-34880504

RESUMO

Surveys are a crucial tool for understanding public opinion and behaviour, and their accuracy depends on maintaining statistical representativeness of their target populations by minimizing biases from all sources. Increasing data size shrinks confidence intervals but magnifies the effect of survey bias: an instance of the Big Data Paradox1. Here we demonstrate this paradox in estimates of first-dose COVID-19 vaccine uptake in US adults from 9 January to 19 May 2021 from two large surveys: Delphi-Facebook2,3 (about 250,000 responses per week) and Census Household Pulse4 (about 75,000 every two weeks). In May 2021, Delphi-Facebook overestimated uptake by 17 percentage points (14-20 percentage points with 5% benchmark imprecision) and Census Household Pulse by 14 (11-17 percentage points with 5% benchmark imprecision), compared to a retroactively updated benchmark the Centers for Disease Control and Prevention published on 26 May 2021. Moreover, their large sample sizes led to miniscule margins of error on the incorrect estimates. By contrast, an Axios-Ipsos online panel5 with about 1,000 responses per week following survey research best practices6 provided reliable estimates and uncertainty quantification. We decompose observed error using a recent analytic framework1 to explain the inaccuracy in the three surveys. We then analyse the implications for vaccine hesitancy and willingness. We show how a survey of 250,000 respondents can produce an estimate of the population mean that is no more accurate than an estimate from a simple random sample of size 10. Our central message is that data quality matters more than data quantity, and that compensating the former with the latter is a mathematically provable losing proposition.


Assuntos
Vacinas contra COVID-19/administração & dosagem , Pesquisas sobre Atenção à Saúde , Vacinação/estatística & dados numéricos , Benchmarking , Viés , Big Data , COVID-19/epidemiologia , COVID-19/prevenção & controle , Centers for Disease Control and Prevention, U.S. , Conjuntos de Dados como Assunto/normas , Feminino , Pesquisas sobre Atenção à Saúde/normas , Humanos , Masculino , Projetos de Pesquisa , Tamanho da Amostra , Mídias Sociais , Estados Unidos/epidemiologia , Hesitação Vacinal/estatística & dados numéricos
10.
Brief Bioinform ; 25(3)2024 Mar 27.
Artigo em Inglês | MEDLINE | ID: mdl-38581417

RESUMO

Untargeted metabolomics based on liquid chromatography-mass spectrometry technology is quickly gaining widespread application, given its ability to depict the global metabolic pattern in biological samples. However, the data are noisy and plagued by the lack of clear identity of data features measured from samples. Multiple potential matchings exist between data features and known metabolites, while the truth can only be one-to-one matches. Some existing methods attempt to reduce the matching uncertainty, but are far from being able to remove the uncertainty for most features. The existence of the uncertainty causes major difficulty in downstream functional analysis. To address these issues, we develop a novel approach for Bayesian Analysis of Untargeted Metabolomics data (BAUM) to integrate previously separate tasks into a single framework, including matching uncertainty inference, metabolite selection and functional analysis. By incorporating the knowledge graph between variables and using relatively simple assumptions, BAUM can analyze datasets with small sample sizes. By allowing different confidence levels of feature-metabolite matching, the method is applicable to datasets in which feature identities are partially known. Simulation studies demonstrate that, compared with other existing methods, BAUM achieves better accuracy in selecting important metabolites that tend to be functionally consistent and assigning confidence scores to feature-metabolite matches. We analyze a COVID-19 metabolomics dataset and a mouse brain metabolomics dataset using BAUM. Even with a very small sample size of 16 mice per group, BAUM is robust and stable. It finds pathways that conform to existing knowledge, as well as novel pathways that are biologically plausible.


Assuntos
Metabolômica , Camundongos , Animais , Teorema de Bayes , Tamanho da Amostra , Incerteza , Metabolômica/métodos , Simulação por Computador
11.
PLoS Biol ; 21(3): e3002007, 2023 03.
Artigo em Inglês | MEDLINE | ID: mdl-36862747

RESUMO

We assess inferential quality in the field of differential expression profiling by high-throughput sequencing (HT-seq) based on analysis of datasets submitted from 2008 to 2020 to the NCBI GEO data repository. We take advantage of the parallel differential expression testing over thousands of genes, whereby each experiment leads to a large set of p-values, the distribution of which can indicate the validity of assumptions behind the test. From a well-behaved p-value set π0, the fraction of genes that are not differentially expressed can be estimated. We found that only 25% of experiments resulted in theoretically expected p-value histogram shapes, although there is a marked improvement over time. Uniform p-value histogram shapes, indicative of <100 actual effects, were extremely few. Furthermore, although many HT-seq workflows assume that most genes are not differentially expressed, 37% of experiments have π0-s of less than 0.5, as if most genes changed their expression level. Most HT-seq experiments have very small sample sizes and are expected to be underpowered. Nevertheless, the estimated π0-s do not have the expected association with N, suggesting widespread problems of experiments with controlling false discovery rate (FDR). Both the fractions of different p-value histogram types and the π0 values are strongly associated with the differential expression analysis program used by the original authors. While we could double the proportion of theoretically expected p-value distributions by removing low-count features from the analysis, this treatment did not remove the association with the analysis program. Taken together, our results indicate widespread bias in the differential expression profiling field and the unreliability of statistical methods used to analyze HT-seq data.


Assuntos
Perfilação da Expressão Gênica , Sequenciamento de Nucleotídeos em Larga Escala , Perfilação da Expressão Gênica/métodos , Análise de Sequência de RNA/métodos , Tamanho da Amostra
12.
PLoS Biol ; 21(6): e3002129, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-37289836

RESUMO

In recent years, there has been a strong drive to improve the inclusion of animals of both sexes in the design of in vivo research studies, driven by a need to increase sex representation in fundamental biology and drug development. This has resulted in inclusion mandates by funding bodies and journals, alongside numerous published manuscripts highlighting the issue and providing guidance to scientists. However, progress is slow and barriers to the routine use of both sexes remain. A frequent, major concern is the perceived need for a higher overall sample size to achieve an equivalent level of statistical power, which would result in an increased ethical and resource burden. This perception arises from either the belief that sex inclusion will increase variability in the data (either through a baseline difference or a treatment effect that depends on sex), thus reducing the sensitivity of statistical tests, or from misapprehensions about the correct way to analyse the data, including disaggregation or pooling by sex. Here, we conduct an in-depth examination of the consequences of including both sexes on statistical power. We performed simulations by constructing artificial datasets that encompass a range of outcomes that may occur in studies studying a treatment effect in the context of both sexes. This includes both baseline sex differences and situations in which the size of the treatment effect depends on sex in both the same and opposite directions. The data were then analysed using either a factorial analysis approach, which is appropriate for the design, or a t test approach following pooling or disaggregation of the data, which are common but erroneous strategies. The results demonstrate that there is no loss of power to detect treatment effects when splitting the sample size across sexes in most scenarios, providing that the data are analysed using an appropriate factorial analysis method (e.g., two-way ANOVA). In the rare situations where power is lost, the benefit of understanding the role of sex outweighs the power considerations. Additionally, use of the inappropriate analysis pipelines results in a loss of statistical power. Therefore, we recommend analysing data collected from both sexes using factorial analysis and splitting the sample size across male and female mice as a standard strategy.


Assuntos
Projetos de Pesquisa , Caracteres Sexuais , Masculino , Feminino , Camundongos , Animais , Tamanho da Amostra , Análise de Variância
13.
Nature ; 580(7802): 232-234, 2020 04.
Artigo em Inglês | MEDLINE | ID: mdl-32269340

RESUMO

Environmental change is rapidly accelerating, and many species will need to adapt to survive1. Ensuring that protected areas cover populations across a broad range of environmental conditions could safeguard the processes that lead to such adaptations1-3. However, international conservation policies have largely neglected these considerations when setting targets for the expansion of protected areas4. Here we show that-of 19,937 vertebrate species globally5-8-the representation of environmental conditions across their habitats in protected areas (hereafter, niche representation) is inadequate for 4,836 (93.1%) amphibian, 8,653 (89.5%) bird and 4,608 (90.9%) terrestrial mammal species. Expanding existing protected areas to cover these gaps would encompass 33.8% of the total land surface-exceeding the current target of 17% that has been adopted by governments. Priority locations for expanding the system of protected areas to improve niche representation occur in global biodiversity hotspots9, including Colombia, Papua New Guinea, South Africa and southwest China, as well as across most of the major land masses of the Earth. Conversely, we also show that planning for the expansion of protected areas without explicitly considering environmental conditions would marginally reduce the land area required to 30.7%, but that this would lead to inadequate niche representation for 7,798 (39.1%) species. As the governments of the world prepare to renegotiate global conservation targets, policymakers have the opportunity to help to maintain the adaptive potential of species by considering niche representation within protected areas1,2.


Assuntos
Conservação dos Recursos Naturais/legislação & jurisprudência , Ecossistema , Política Ambiental/legislação & jurisprudência , Internacionalidade , Animais , Biodiversidade , Governo Federal , Cooperação Internacional/legislação & jurisprudência , Tamanho da Amostra
14.
Nature ; 581(7809): 459-464, 2020 05.
Artigo em Inglês | MEDLINE | ID: mdl-32461653

RESUMO

Naturally occurring human genetic variants that are predicted to inactivate protein-coding genes provide an in vivo model of human gene inactivation that complements knockout studies in cells and model organisms. Here we report three key findings regarding the assessment of candidate drug targets using human loss-of-function variants. First, even essential genes, in which loss-of-function variants are not tolerated, can be highly successful as targets of inhibitory drugs. Second, in most genes, loss-of-function variants are sufficiently rare that genotype-based ascertainment of homozygous or compound heterozygous 'knockout' humans will await sample sizes that are approximately 1,000 times those presently available, unless recruitment focuses on consanguineous individuals. Third, automated variant annotation and filtering are powerful, but manual curation remains crucial for removing artefacts, and is a prerequisite for recall-by-genotype efforts. Our results provide a roadmap for human knockout studies and should guide the interpretation of loss-of-function variants in drug development.


Assuntos
Genes Essenciais/efeitos dos fármacos , Genes Essenciais/genética , Mutação com Perda de Função/genética , Terapia de Alvo Molecular , Artefatos , Automação , Consanguinidade , Éxons/genética , Mutação com Ganho de Função/genética , Frequência do Gene , Técnicas de Silenciamento de Genes , Heterozigoto , Homozigoto , Humanos , Proteína Huntingtina/genética , Serina-Treonina Proteína Quinase-2 com Repetições Ricas em Leucina/genética , Doenças Neurodegenerativas/genética , Proteínas Priônicas/genética , Reprodutibilidade dos Testes , Tamanho da Amostra , Proteínas tau/genética
15.
Nature ; 586(7831): 757-762, 2020 10.
Artigo em Inglês | MEDLINE | ID: mdl-33057194

RESUMO

De novo mutations in protein-coding genes are a well-established cause of developmental disorders1. However, genes known to be associated with developmental disorders account for only a minority of the observed excess of such de novo mutations1,2. Here, to identify previously undescribed genes associated with developmental disorders, we integrate healthcare and research exome-sequence data from 31,058 parent-offspring trios of individuals with developmental disorders, and develop a simulation-based statistical test to identify gene-specific enrichment of de novo mutations. We identified 285 genes that were significantly associated with developmental disorders, including 28 that had not previously been robustly associated with developmental disorders. Although we detected more genes associated with developmental disorders, much of the excess of de novo mutations in protein-coding genes remains unaccounted for. Modelling suggests that more than 1,000 genes associated with developmental disorders have not yet been described, many of which are likely to be less penetrant than the currently known genes. Research access to clinical diagnostic datasets will be critical for completing the map of genes associated with developmental disorders.


Assuntos
Análise Mutacional de DNA , Análise de Dados , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Atenção à Saúde/estatística & dados numéricos , Deficiências do Desenvolvimento/genética , Doenças Genéticas Inatas/genética , Estudos de Coortes , Variações do Número de Cópias de DNA/genética , Deficiências do Desenvolvimento/diagnóstico , Europa (Continente) , Feminino , Doenças Genéticas Inatas/diagnóstico , Mutação em Linhagem Germinativa/genética , Haploinsuficiência/genética , Humanos , Masculino , Mutação de Sentido Incorreto/genética , Penetrância , Morte Perinatal , Tamanho da Amostra
16.
Proc Natl Acad Sci U S A ; 120(2): e2207046120, 2023 01 10.
Artigo em Inglês | MEDLINE | ID: mdl-36603029

RESUMO

Recent research identifies and corrects bias, such as excess dispersion, in the leading sample eigenvector of a factor-based covariance matrix estimated from a high-dimension low sample size (HL) data set. We show that eigenvector bias can have a substantial impact on variance-minimizing optimization in the HL regime, while bias in estimated eigenvalues may have little effect. We describe a data-driven eigenvector shrinkage estimator in the HL regime called "James-Stein for eigenvectors" (JSE) and its close relationship with the James-Stein (JS) estimator for a collection of averages. We show, both theoretically and with numerical experiments, that, for certain variance-minimizing problems of practical importance, efforts to correct eigenvalues have little value in comparison to the JSE correction of the leading eigenvector. When certain extra information is present, JSE is a consistent estimator of the leading eigenvector.


Assuntos
Viés , Tamanho da Amostra
17.
Brief Bioinform ; 25(1)2023 11 22.
Artigo em Inglês | MEDLINE | ID: mdl-38037235

RESUMO

OBJECTIVE: The performances of popular genome-wide association study (GWAS) models have not been examined yet in a consistent manner under the scenario of genetic admixture, which introduces several challenging aspects: heterogeneity of minor allele frequency (MAF), wide spectrum of case-control ratio, varying effect sizes, etc. METHODS: We generated a cohort of synthetic individuals (N = 19 234) that simulates (i) a large sample size; (ii) two-way admixture (Native American and European ancestry) and (iii) a binary phenotype. We then benchmarked three popular GWAS tools [generalized linear mixed model associated test (GMMAT), scalable and accurate implementation of generalized mixed model (SAIGE) and Tractor] by computing inflation factors and power calculations under different MAFs, case-control ratios, sample sizes and varying ancestry proportions. We also employed a cohort of Peruvians (N = 249) to further examine the performances of the testing models on (i) real genetic and phenotype data and (ii) small sample sizes. RESULTS: In the synthetic cohort, SAIGE performed better than GMMAT and Tractor in terms of type-I error rate, especially under severe unbalanced case-control ratio. On the contrary, power analysis identified Tractor as the best method to pinpoint ancestry-specific causal variants but showed decreased power when the effect size displayed limited heterogeneity between ancestries. In the Peruvian cohort, only Tractor identified two suggestive loci (P-value $\le 1\ast{10}^{-5}$) associated with Native American ancestry. DISCUSSION: The current study illustrates best practice and limitations for available GWAS tools under the scenario of genetic admixture. Incorporating local ancestry in GWAS analyses boosts power, although careful consideration of complex scenarios (small sample sizes, imbalance case-control ratio, MAF heterogeneity) is needed.


Assuntos
Benchmarking , Estudo de Associação Genômica Ampla , Humanos , Estudo de Associação Genômica Ampla/métodos , Frequência do Gene , Fenótipo , Tamanho da Amostra , Polimorfismo de Nucleotídeo Único
18.
Bioinformatics ; 40(3)2024 Mar 04.
Artigo em Inglês | MEDLINE | ID: mdl-38430463

RESUMO

MOTIVATION: Large-scale gene expression studies allow gene network construction to uncover associations among genes. To study direct associations among genes, partial correlation-based networks are preferred over marginal correlations. However, FDR control for partial correlation-based network construction is not well-studied. In addition, currently available partial correlation-based methods cannot take existing biological knowledge to help network construction while controlling FDR. RESULTS: In this paper, we propose a method called Partial Correlation Graph with Information Incorporation (PCGII). PCGII estimates partial correlations between each pair of genes by regularized node-wise regression that can incorporate prior knowledge while controlling the effects of all other genes. It handles high-dimensional data where the number of genes can be much larger than the sample size and controls FDR at the same time. We compare PCGII with several existing approaches through extensive simulation studies and demonstrate that PCGII has better FDR control and higher power. We apply PCGII to a plant gene expression dataset where it recovers confirmed regulatory relationships and a hub node, as well as several direct associations that shed light on potential functional relationships in the system. We also introduce a method to supplement observed data with a pseudogene to apply PCGII when no prior information is available, which also allows checking FDR control and power for real data analysis. AVAILABILITY AND IMPLEMENTATION: R package is freely available for download at https://cran.r-project.org/package=PCGII.


Assuntos
Algoritmos , Redes Reguladoras de Genes , Simulação por Computador , Genes de Plantas , Tamanho da Amostra
19.
Bioinformatics ; 40(4)2024 Mar 29.
Artigo em Inglês | MEDLINE | ID: mdl-38569898

RESUMO

MOTIVATION: Research is improving our understanding of how the microbiome interacts with the human body and its impact on human health. Existing machine learning methods have shown great potential in discriminating healthy from diseased microbiome states. However, Machine Learning based prediction using microbiome data has challenges such as, small sample size, imbalance between cases and controls and high cost of collecting large number of samples. To address these challenges, we propose a deep learning framework phylaGAN to augment the existing datasets with generated microbiome data using a combination of conditional generative adversarial network (C-GAN) and autoencoder. Conditional generative adversarial networks train two models against each other to compute larger simulated datasets that are representative of the original dataset. Autoencoder maps the original and the generated samples onto a common subspace to make the prediction more accurate. RESULTS: Extensive evaluation and predictive analysis was conducted on two datasets, T2D study and Cirrhosis study showing an improvement in mean AUC using data augmentation by 11% and 5% respectively. External validation on a cohort classifying between obese and lean subjects, with a smaller sample size provided an improvement in mean AUC close to 32% when augmented through phylaGAN as compared to using the original cohort. Our findings not only indicate that the generative adversarial networks can create samples that mimic the original data across various diversity metrics, but also highlight the potential of enhancing disease prediction through machine learning models trained on synthetic data. AVAILABILITY AND IMPLEMENTATION: https://github.com/divya031090/phylaGAN.


Assuntos
Benchmarking , Microbiota , Humanos , Aprendizado de Máquina , Tamanho da Amostra
20.
Hum Genomics ; 18(1): 25, 2024 Mar 14.
Artigo em Inglês | MEDLINE | ID: mdl-38486307

RESUMO

With the development of next-generation sequencing technology, de novo variants (DNVs) with deleterious effects can be identified and investigated for their effects on birth defects such as congenital heart disease (CHD). However, statistical power is still limited for such studies because of the small sample size due to the high cost of recruiting and sequencing samples and the low occurrence of DNVs. DNV analysis is further complicated by genetic heterogeneity across diseased individuals. Therefore, it is critical to jointly analyze DNVs with other types of genomic/biological information to improve statistical power to identify genes associated with birth defects. In this review, we discuss the general workflow, recent developments in statistical methods, and future directions for DNV analysis.


Assuntos
Heterogeneidade Genética , Genômica , Humanos , Sequenciamento de Nucleotídeos em Larga Escala , Tamanho da Amostra , Fluxo de Trabalho
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA