RESUMO
Rare copy-number variants (rCNVs) include deletions and duplications that occur infrequently in the global human population and can confer substantial risk for disease. In this study, we aimed to quantify the properties of haploinsufficiency (i.e., deletion intolerance) and triplosensitivity (i.e., duplication intolerance) throughout the human genome. We harmonized and meta-analyzed rCNVs from nearly one million individuals to construct a genome-wide catalog of dosage sensitivity across 54 disorders, which defined 163 dosage sensitive segments associated with at least one disorder. These segments were typically gene dense and often harbored dominant dosage sensitive driver genes, which we were able to prioritize using statistical fine-mapping. Finally, we designed an ensemble machine-learning model to predict probabilities of dosage sensitivity (pHaplo & pTriplo) for all autosomal genes, which identified 2,987 haploinsufficient and 1,559 triplosensitive genes, including 648 that were uniquely triplosensitive. This dosage sensitivity resource will provide broad utility for human disease research and clinical genetics.
Assuntos
Variações do Número de Cópias de DNA , Genoma Humano , Variações do Número de Cópias de DNA/genética , Dosagem de Genes , Haploinsuficiência/genética , HumanosRESUMO
Genome-wide sequencing of human populations has revealed substantial variation among genes in the intensity of purifying selection acting on damaging genetic variants1. Although genes under the strongest selective constraint are highly enriched for associations with Mendelian disorders, most of these genes are not associated with disease and therefore the nature of the selection acting on them is not known2. Here we show that genetic variants that damage these genes are associated with markedly reduced reproductive success, primarily owing to increased childlessness, with a stronger effect in males than in females. We present evidence that increased childlessness is probably mediated by genetically associated cognitive and behavioural traits, which may mean that male carriers are less likely to find reproductive partners. This reduction in reproductive success may account for 20% of purifying selection against heterozygous variants that ablate protein-coding genes. Although this genetic association may only account for a very minor fraction of the overall likelihood of being childless (less than 1%), especially when compared to more influential sociodemographic factors, it may influence how genes evolve over time.
Assuntos
Reprodução , Seleção Genética , Mapeamento Cromossômico , Feminino , Heterozigoto , Humanos , Masculino , Fenótipo , Reprodução/genéticaRESUMO
BACKGROUND: Pediatric disorders include a range of highly penetrant, genetically heterogeneous conditions amenable to genomewide diagnostic approaches. Finding a molecular diagnosis is challenging but can have profound lifelong benefits. METHODS: We conducted a large-scale sequencing study involving more than 13,500 families with probands with severe, probably monogenic, difficult-to-diagnose developmental disorders from 24 regional genetics services in the United Kingdom and Ireland. Standardized phenotypic data were collected, and exome sequencing and microarray analyses were performed to investigate novel genetic causes. We developed an iterative variant analysis pipeline and reported candidate variants to clinical teams for validation and diagnostic interpretation to inform communication with families. Multiple regression analyses were performed to evaluate factors affecting the probability of diagnosis. RESULTS: A total of 13,449 probands were included in the analyses. On average, we reported 1.0 candidate variant per parent-offspring trio and 2.5 variants per singleton proband. Using clinical and computational approaches to variant classification, we made a diagnosis in approximately 41% of probands (5502 of 13,449). Of 3599 probands in trios who received a diagnosis by clinical assertion, approximately 76% had a pathogenic de novo variant. Another 22% of probands (2997 of 13,449) had variants of uncertain significance in genes that were strongly linked to monogenic developmental disorders. Recruitment in a parent-offspring trio had the largest effect on the probability of diagnosis (odds ratio, 4.70; 95% confidence interval [CI], 4.16 to 5.31). Probands were less likely to receive a diagnosis if they were born extremely prematurely (i.e., 22 to 27 weeks' gestation; odds ratio, 0.39; 95% CI, 0.22 to 0.68), had in utero exposure to antiepileptic medications (odds ratio, 0.44; 95% CI, 0.29 to 0.67), had mothers with diabetes (odds ratio, 0.52; 95% CI, 0.41 to 0.67), or were of African ancestry (odds ratio, 0.51; 95% CI, 0.31 to 0.78). CONCLUSIONS: Among probands with severe, probably monogenic, difficult-to-diagnose developmental disorders, multimodal analysis of genomewide data had good diagnostic power, even after previous attempts at diagnosis. (Funded by the Health Innovation Challenge Fund and Wellcome Sanger Institute.).
Assuntos
Genômica , Doenças Raras , Criança , Humanos , Exoma , Irlanda/epidemiologia , Reino Unido/epidemiologia , Doenças Raras/diagnóstico , Doenças Raras/epidemiologia , Doenças Raras/genética , Análise de Sequência com Séries de Oligonucleotídeos , Estudos de Associação Genética , Transtornos do Neurodesenvolvimento/diagnóstico , Transtornos do Neurodesenvolvimento/genética , Anormalidades Congênitas/diagnóstico , Anormalidades Congênitas/genética , Transtornos do Crescimento/diagnóstico , Transtornos do Crescimento/genética , Fácies , Transtornos do Comportamento Infantil/diagnóstico , Transtornos do Comportamento Infantil/genética , Doenças Genéticas Inatas/diagnóstico , Doenças Genéticas Inatas/genéticaRESUMO
De novo mutations in protein-coding genes are a well-established cause of developmental disorders1. However, genes known to be associated with developmental disorders account for only a minority of the observed excess of such de novo mutations1,2. Here, to identify previously undescribed genes associated with developmental disorders, we integrate healthcare and research exome-sequence data from 31,058 parent-offspring trios of individuals with developmental disorders, and develop a simulation-based statistical test to identify gene-specific enrichment of de novo mutations. We identified 285 genes that were significantly associated with developmental disorders, including 28 that had not previously been robustly associated with developmental disorders. Although we detected more genes associated with developmental disorders, much of the excess of de novo mutations in protein-coding genes remains unaccounted for. Modelling suggests that more than 1,000 genes associated with developmental disorders have not yet been described, many of which are likely to be less penetrant than the currently known genes. Research access to clinical diagnostic datasets will be critical for completing the map of genes associated with developmental disorders.
Assuntos
Análise Mutacional de DNA , Análise de Dados , Bases de Dados Genéticas , Conjuntos de Dados como Assunto , Atenção à Saúde/estatística & dados numéricos , Deficiências do Desenvolvimento/genética , Doenças Genéticas Inatas/genética , Estudos de Coortes , Variações do Número de Cópias de DNA/genética , Deficiências do Desenvolvimento/diagnóstico , Europa (Continente) , Feminino , Doenças Genéticas Inatas/diagnóstico , Mutação em Linhagem Germinativa/genética , Haploinsuficiência/genética , Humanos , Masculino , Mutação de Sentido Incorreto/genética , Penetrância , Morte Perinatal , Tamanho da AmostraRESUMO
Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes1. Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases.
Assuntos
Exoma/genética , Genes Essenciais/genética , Variação Genética/genética , Genoma Humano/genética , Adulto , Encéfalo/metabolismo , Doenças Cardiovasculares/genética , Estudos de Coortes , Bases de Dados Genéticas , Feminino , Predisposição Genética para Doença/genética , Estudo de Associação Genômica Ampla , Humanos , Mutação com Perda de Função/genética , Masculino , Taxa de Mutação , Pró-Proteína Convertase 9/genética , RNA Mensageiro/genética , Reprodutibilidade dos Testes , Sequenciamento do Exoma , Sequenciamento Completo do GenomaRESUMO
Clinical genetic testing of protein-coding regions identifies a likely causative variant in only around half of developmental disorder (DD) cases. The contribution of regulatory variation in non-coding regions to rare disease, including DD, remains very poorly understood. We screened 9,858 probands from the Deciphering Developmental Disorders (DDD) study for de novo mutations in the 5' untranslated regions (5' UTRs) of genes within which variants have previously been shown to cause DD through a dominant haploinsufficient mechanism. We identified four single-nucleotide variants and two copy-number variants upstream of MEF2C in a total of ten individual probands. We developed multiple bespoke and orthogonal experimental approaches to demonstrate that these variants cause DD through three distinct loss-of-function mechanisms, disrupting transcription, translation, and/or protein function. These non-coding region variants represent 23% of likely diagnoses identified in MEF2C in the DDD cohort, but these would all be missed in standard clinical genetics approaches. Nonetheless, these variants are readily detectable in exome sequence data, with 30.7% of 5' UTR bases across all genes well covered in the DDD dataset. Our analyses show that non-coding variants upstream of genes within which coding variants are known to cause DD are an important cause of severe disease and demonstrate that analyzing 5' UTRs can increase diagnostic yield. We also show how non-coding variants can help inform both the disease-causing mechanism underlying protein-coding variants and dosage tolerance of the gene.
Assuntos
Regiões 5' não Traduzidas , Deficiências do Desenvolvimento/etiologia , Predisposição Genética para Doença , Mutação com Perda de Função , Criança , Estudos de Coortes , Variações do Número de Cópias de DNA , Deficiências do Desenvolvimento/patologia , Humanos , Fatores de Transcrição MEF2/genética , Sequenciamento do ExomaRESUMO
This corrects the article DOI: 10.1038/nature19356.
RESUMO
A major goal of biomedicine is to understand the function of every gene in the human genome. Loss-of-function mutations can disrupt both copies of a given gene in humans and phenotypic analysis of such 'human knockouts' can provide insight into gene function. Consanguineous unions are more likely to result in offspring carrying homozygous loss-of-function mutations. In Pakistan, consanguinity rates are notably high. Here we sequence the protein-coding regions of 10,503 adult participants in the Pakistan Risk of Myocardial Infarction Study (PROMIS), designed to understand the determinants of cardiometabolic diseases in individuals from South Asia. We identified individuals carrying homozygous predicted loss-of-function (pLoF) mutations, and performed phenotypic analysis involving more than 200 biochemical and disease traits. We enumerated 49,138 rare (<1% minor allele frequency) pLoF mutations. These pLoF mutations are estimated to knock out 1,317 genes, each in at least one participant. Homozygosity for pLoF mutations at PLA2G7 was associated with absent enzymatic activity of soluble lipoprotein-associated phospholipase A2; at CYP2F1, with higher plasma interleukin-8 concentrations; at TREH, with lower concentrations of apoB-containing lipoprotein subfractions; at either A3GALT2 or NRG4, with markedly reduced plasma insulin C-peptide concentrations; and at SLC9A3R1, with mediators of calcium and phosphate signalling. Heterozygous deficiency of APOC3 has been shown to protect against coronary heart disease; we identified APOC3 homozygous pLoF carriers in our cohort. We recruited these human knockouts and challenged them with an oral fat load. Compared with family members lacking the mutation, individuals with APOC3 knocked out displayed marked blunting of the usual post-prandial rise in plasma triglycerides. Overall, these observations provide a roadmap for a 'human knockout project', a systematic effort to understand the phenotypic consequences of complete disruption of genes in humans.
Assuntos
Consanguinidade , Análise Mutacional de DNA , Deleção de Genes , Genes/genética , Estudos de Associação Genética/métodos , Homozigoto , Fenótipo , 1-Alquil-2-acetilglicerofosfocolina Esterase/deficiência , 1-Alquil-2-acetilglicerofosfocolina Esterase/genética , Apolipoproteína C-III/deficiência , Apolipoproteína C-III/genética , Estudos de Coortes , Doença das Coronárias/sangue , Doença das Coronárias/genética , Família 2 do Citocromo P450/genética , Gorduras na Dieta/farmacologia , Exoma/genética , Jejum/sangue , Feminino , Frequência do Gene , Humanos , Interleucina-8/sangue , Masculino , Pessoa de Meia-Idade , Infarto do Miocárdio/sangue , Infarto do Miocárdio/genética , Neurregulinas/genética , Paquistão , Linhagem , Fosfoproteínas/genética , Período Pós-Prandial , Sítios de Splice de RNA/genética , Genética Reversa/métodos , Trocadores de Sódio-Hidrogênio/genética , Triglicerídeos/sangueRESUMO
Approximately one-third of all mammalian genes are essential for life. Phenotypes resulting from knockouts of these genes in mice have provided tremendous insight into gene function and congenital disorders. As part of the International Mouse Phenotyping Consortium effort to generate and phenotypically characterize 5,000 knockout mouse lines, here we identify 410 lethal genes during the production of the first 1,751 unique gene knockouts. Using a standardized phenotyping platform that incorporates high-resolution 3D imaging, we identify phenotypes at multiple time points for previously uncharacterized genes and additional phenotypes for genes with previously reported mutant phenotypes. Unexpectedly, our analysis reveals that incomplete penetrance and variable expressivity are common even on a defined genetic background. In addition, we show that human disease genes are enriched for essential genes, thus providing a dataset that facilitates the prioritization and validation of mutations identified in clinical sequencing efforts.
Assuntos
Embrião de Mamíferos/embriologia , Embrião de Mamíferos/metabolismo , Genes Essenciais/genética , Genes Letais/genética , Mutação/genética , Fenótipo , Animais , Sequência Conservada/genética , Doença , Estudo de Associação Genômica Ampla , Ensaios de Triagem em Larga Escala , Humanos , Imageamento Tridimensional , Camundongos , Camundongos Endogâmicos C57BL , Camundongos Knockout , Penetrância , Polimorfismo de Nucleotídeo Único/genética , Homologia de SequênciaRESUMO
Large-scale reference data sets of human genetic variation are critical for the medical and functional interpretation of DNA sequence changes. Here we describe the aggregation and analysis of high-quality exome (protein-coding region) DNA sequence data for 60,706 individuals of diverse ancestries generated as part of the Exome Aggregation Consortium (ExAC). This catalogue of human genetic diversity contains an average of one variant every eight bases of the exome, and provides direct evidence for the presence of widespread mutational recurrence. We have used this catalogue to calculate objective metrics of pathogenicity for sequence variants, and to identify genes subject to strong selection against various classes of mutation; identifying 3,230 genes with near-complete depletion of predicted protein-truncating variants, with 72% of these genes having no currently established human disease phenotype. Finally, we demonstrate that these data can be used for the efficient filtering of candidate disease-causing variants, and for the discovery of human 'knockout' variants in protein-coding genes.
Assuntos
Exoma/genética , Variação Genética/genética , Análise Mutacional de DNA , Conjuntos de Dados como Assunto , Humanos , Fenótipo , Proteoma/genética , Doenças Raras/genética , Tamanho da AmostraRESUMO
Variation in RNA splicing (i.e., alternative splicing) plays an important role in many diseases. Variants near 5' and 3' splice sites often affect splicing, but the effects of these variants on splicing and disease have not been fully characterized beyond the two "essential" splice nucleotides flanking each exon. Here we provide quantitative measurements of tolerance to mutational disruptions by position and reference allele-alternative allele combinations. We show that certain reference alleles are particularly sensitive to mutations, regardless of the alternative alleles into which they are mutated. Using public RNA-seq data, we demonstrate that individuals carrying such variants have significantly lower levels of the correctly spliced transcript, compared to individuals without them, and confirm that these specific substitutions are highly enriched for known Mendelian mutations. Our results propose a more refined definition of the "splice region" and offer a new way to prioritize and provide functional interpretation of variants identified in diagnostic sequencing and association studies.
Assuntos
Processamento Alternativo/genética , Mutação/genética , Nucleotídeos/genética , Sítios de Splice de RNA/genética , Splicing de RNA/genética , Alelos , Éxons/genética , HumanosRESUMO
Worldwide, hundreds of thousands of humans have had their genomes or exomes sequenced, and access to the resulting data sets can provide valuable information for variant interpretation and understanding gene function. Here, we present a lightweight, flexible browser framework to display large population datasets of genetic variation. We demonstrate its use for exome sequence data from 60 706 individuals in the Exome Aggregation Consortium (ExAC). The ExAC browser provides gene- and transcript-centric displays of variation, a critical view for clinical applications. Additionally, we provide a variant display, which includes population frequency and functional annotation data as well as short read support for the called variant. This browser is open-source, freely available at http://exac.broadinstitute.org, and has already been used extensively by clinical laboratories worldwide.
Assuntos
Biologia Computacional/métodos , Bases de Dados Genéticas , Exoma , Genômica/métodos , Navegador , Estudo de Associação Genômica Ampla/métodos , Humanos , Software , Interface Usuário-ComputadorRESUMO
Using robust, integrated analysis of multiple genomic datasets, we show that genes depleted for non-synonymous de novo mutations form a subnetwork of 72 members under strong selective constraint. We further show this subnetwork is preferentially expressed in the early development of the human hippocampus and is enriched for genes mutated in neurological Mendelian disorders. We thus conclude that carefully orchestrated developmental processes are under strong constraint in early brain development, and perturbations caused by mutation have adverse outcomes subject to strong purifying selection. Our findings demonstrate that selective forces can act on groups of genes involved in the same process, supporting the notion that purifying selection can act coordinately on multiple genes. Our approach provides a statistically robust, interpretable way to identify the tissues and developmental times where groups of disease genes are active.
Assuntos
Redes Reguladoras de Genes/genética , Doenças Genéticas Inatas/genética , Genoma/genética , Hipocampo/embriologia , Mapas de Interação de Proteínas/genética , Variação Genética/genética , Humanos , Modelos Genéticos , Mutação/genéticaRESUMO
Autism spectrum disorders (ASD) are believed to have genetic and environmental origins, yet in only a modest fraction of individuals can specific causes be identified. To identify further genetic risk factors, here we assess the role of de novo mutations in ASD by sequencing the exomes of ASD cases and their parents (n = 175 trios). Fewer than half of the cases (46.3%) carry a missense or nonsense de novo variant, and the overall rate of mutation is only modestly higher than the expected rate. In contrast, the proteins encoded by genes that harboured de novo missense or nonsense mutations showed a higher degree of connectivity among themselves and to previous ASD genes as indexed by protein-protein interaction screens. The small increase in the rate of de novo events, when taken together with the protein interaction results, are consistent with an important but limited role for de novo point mutations in ASD, similar to that documented for de novo copy number variants. Genetic models incorporating these data indicate that most of the observed de novo events are unconnected to ASD; those that do confer risk are distributed across many genes and are incompletely penetrant (that is, not necessarily sufficient for disease). Our results support polygenic models in which spontaneous coding mutations in any of a large number of genes increases risk by 5- to 20-fold. Despite the challenge posed by such models, results from de novo events and a large parallel case-control study provide strong evidence in favour of CHD8 and KATNAL2 as genuine autism risk factors.
Assuntos
Transtorno Autístico/genética , Proteínas de Ligação a DNA/genética , Éxons/genética , Predisposição Genética para Doença/genética , Mutação/genética , Fatores de Transcrição/genética , Estudos de Casos e Controles , Exoma/genética , Saúde da Família , Humanos , Modelos Genéticos , Herança Multifatorial/genética , Fenótipo , Distribuição de Poisson , Mapas de Interação de ProteínasRESUMO
Reactive oxygen species (ROS) such as hydrogen peroxide (H2O2) govern cellular homeostasis by inducing signaling. H2O2 modulates the activity of phosphatases and many other signaling molecules through oxidation of critical cysteine residues, which led to the notion that initiation of ROS signaling is broad and nonspecific, and thus fundamentally distinct from other signaling pathways. Here, we report that H2O2 signaling bears hallmarks of a regular signal transduction cascade. It is controlled by hierarchical signaling events resulting in a focused response as the results place the mitochondrial respiratory chain upstream of tyrosine-protein kinase Lyn, Lyn upstream of tyrosine-protein kinase SYK (Syk), and Syk upstream of numerous targets involved in signaling, transcription, translation, metabolism, and cell cycle regulation. The active mediators of H2O2 signaling colocalize as H2O2 induces mitochondria-associated Lyn and Syk phosphorylation, and a pool of Lyn and Syk reside in the mitochondrial intermembrane space. Finally, the same intermediaries control the signaling response in tissues and species responsive to H2O2 as the respiratory chain, Lyn, and Syk were similarly required for H2O2 signaling in mouse B cells, fibroblasts, and chicken DT40 B cells. Consistent with a broad role, the Syk pathway is coexpressed across tissues, is of early metazoan origin, and displays evidence of evolutionary constraint in the human. These results suggest that H2O2 signaling is under control of a signal transduction pathway that links the respiratory chain to the mitochondrial intermembrane space-localized, ubiquitous, and ancient Syk pathway in hematopoietic and nonhematopoietic cells.
Assuntos
Transporte de Elétrons , Peróxido de Hidrogênio/metabolismo , Membranas Mitocondriais/metabolismo , Transdução de Sinais , Animais , Células Cultivadas , Galinhas , Ativação Enzimática , Peptídeos e Proteínas de Sinalização Intracelular/metabolismo , Camundongos , Fosforilação , Proteínas Tirosina Quinases/metabolismo , Espécies Reativas de Oxigênio/metabolismo , Quinase Syk , Tirosina/metabolismoRESUMO
Autism spectrum disorders (ASDs) are a highly heterogeneous group of conditions--phenotypically and genetically--although the link between phenotypic variation and differences in genetic architecture is unclear. This study aimed to determine whether differences in cognitive impairment and symptom severity reflect variation in the degree to which ASD cases reflect de novo or familial influences. Using data from more than 2,000 simplex cases of ASD, we examined the relationship between intelligence quotient (IQ), behavior and language assessments, and rate of de novo loss of function (LOF) mutations and family history of broadly defined psychiatric disease (depressive disorders, bipolar disorder, and schizophrenia; history of psychiatric hospitalization). Proband IQ was negatively associated with de novo LOF rate (P = 0.03) and positively associated with family history of psychiatric disease (P = 0.003). Female cases had a higher frequency of sporadic genetic events across the severity distribution (P = 0.01). High rates of LOF mutation and low frequencies of family history of psychiatric illness were seen in individuals who were unable to complete a traditional IQ test, a group with the greatest degree of language and behavioral impairment. These analyses provide strong evidence that familial risk for neuropsychiatric disease becomes more relevant to ASD etiology as cases become higher functioning. The findings of this study reinforce that there are many routes to the diagnostic category of autism and could lead to genetic studies with more specific insights into individual cases.
Assuntos
Transtornos Globais do Desenvolvimento Infantil/diagnóstico , Transtornos Globais do Desenvolvimento Infantil/genética , Comportamento , Transtorno Bipolar/genética , Transtornos Globais do Desenvolvimento Infantil/epidemiologia , Transtornos Cognitivos , Feminino , Predisposição Genética para Doença , Humanos , Testes de Inteligência , Masculino , Mutação , Fenótipo , Análise de Regressão , Fatores de Risco , Esquizofrenia/genética , ConvulsõesRESUMO
We report on results from whole-exome sequencing (WES) of 1,039 subjects diagnosed with autism spectrum disorders (ASD) and 870 controls selected from the NIMH repository to be of similar ancestry to cases. The WES data came from two centers using different methods to produce sequence and to call variants from it. Therefore, an initial goal was to ensure the distribution of rare variation was similar for data from different centers. This proved straightforward by filtering called variants by fraction of missing data, read depth, and balance of alternative to reference reads. Results were evaluated using seven samples sequenced at both centers and by results from the association study. Next we addressed how the data and/or results from the centers should be combined. Gene-based analyses of association was an obvious choice, but should statistics for association be combined across centers (meta-analysis) or should data be combined and then analyzed (mega-analysis)? Because of the nature of many gene-based tests, we showed by theory and simulations that mega-analysis has better power than meta-analysis. Finally, before analyzing the data for association, we explored the impact of population structure on rare variant analysis in these data. Like other recent studies, we found evidence that population structure can confound case-control studies by the clustering of rare variants in ancestry space; yet, unlike some recent studies, for these data we found that principal component-based analyses were sufficient to control for ancestry and produce test statistics with appropriate distributions. After using a variety of gene-based tests and both meta- and mega-analysis, we found no new risk genes for ASD in this sample. Our results suggest that standard gene-based tests will require much larger samples of cases and controls before being effective for gene discovery, even for a disorder like ASD.
Assuntos
Transtornos Globais do Desenvolvimento Infantil/genética , Exoma , Estudo de Associação Genômica Ampla , Estudos de Casos e Controles , Criança , Transtornos Globais do Desenvolvimento Infantil/fisiopatologia , Predisposição Genética para Doença , Variação Genética , Humanos , Controle da População , Análise de Sequência de DNA , SoftwareRESUMO
Prioritizing missense variants for further experimental investigation is a key challenge in current sequencing studies for exploring complex and Mendelian diseases. A large number of in silico tools have been employed for the task of pathogenicity prediction, including PolyPhen-2, SIFT, FatHMM, MutationTaster-2, MutationAssessor, Combined Annotation Dependent Depletion, LRT, phyloP, and GERP++, as well as optimized methods of combining tool scores, such as Condel and Logit. Due to the wealth of these methods, an important practical question to answer is which of these tools generalize best, that is, correctly predict the pathogenic character of new variants. We here demonstrate in a study of 10 tools on five datasets that such a comparative evaluation of these tools is hindered by two types of circularity: they arise due to (1) the same variants or (2) different variants from the same protein occurring both in the datasets used for training and for evaluation of these tools, which may lead to overly optimistic results. We show that comparative evaluations of predictors that do not address these types of circularity may erroneously conclude that circularity confounded tools are most accurate among all tools, and may even outperform optimized combinations of tools.