Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 32
Filtrar
1.
Nucleic Acids Res ; 52(D1): D1143-D1154, 2024 Jan 05.
Artículo en Inglés | MEDLINE | ID: mdl-38183205

RESUMEN

Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.


Asunto(s)
Variación Genética , Genoma Humano , Aprendizaje Automático , Programas Informáticos , Nucleótidos , Humanos
2.
bioRxiv ; 2023 Mar 06.
Artículo en Inglés | MEDLINE | ID: mdl-36945371

RESUMEN

The human genome contains millions of candidate cis-regulatory elements (CREs) with cell-type-specific activities that shape both health and myriad disease states. However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these CREs. Here, we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of over 680,000 sequences, representing a nearly comprehensive set of all annotated CREs among three cell types (HepG2, K562, and WTC11), finding 41.7% to be functional. By testing sequences in both orientations, we find promoters to have significant strand orientation effects. We also observe that their 200 nucleotide cores function as non-cell-type-specific 'on switches' providing similar expression levels to their associated gene. In contrast, enhancers have weaker orientation effects, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we develop sequence-based models to predict CRE function with high accuracy and delineate regulatory motifs. Testing an additional lentiMPRA library encompassing 60,000 CREs in all three cell types, we further identified factors that determine cell-type specificity. Collectively, our work provides an exhaustive catalog of functional CREs in three widely used cell lines, and showcases how large-scale functional measurements can be used to dissect regulatory grammar.

3.
BMC Bioinformatics ; 23(Suppl 2): 154, 2022 Dec 12.
Artículo en Inglés | MEDLINE | ID: mdl-36510125

RESUMEN

BACKGROUND: Cis-regulatory regions (CRRs) are non-coding regions of the DNA that fine control the spatio-temporal pattern of transcription; they are involved in a wide range of pivotal processes such as the development of specific cell-lines/tissues and the dynamic cell response to physiological stimuli. Recent studies showed that genetic variants occurring in CRRs are strongly correlated with pathogenicity or deleteriousness. Considering the central role of CRRs in the regulation of physiological and pathological conditions, the correct identification of CRRs and of their tissue-specific activity status through Machine Learning methods plays a major role in dissecting the impact of genetic variants on human diseases. Unfortunately, the problem is still open, though some promising results have been already reported by (deep) machine-learning based methods that predict active promoters and enhancers in specific tissues or cell lines by encoding epigenetic or spectral features directly extracted from DNA sequences. RESULTS: We present the experiments we performed to compare two Deep Neural Networks, a Feed-Forward Neural Network model working on epigenomic features, and a Convolutional Neural Network model working only on genomic sequence, targeted to the identification of enhancer- and promoter-activity in specific cell lines. While performing experiments to understand how the experimental setup influences the prediction performance of the methods, we particularly focused on (1) automatic model selection performed by Bayesian optimization and (2) exploring different data rebalancing setups for reducing negative unbalancing effects. CONCLUSIONS: Results show that (1) automatic model selection by Bayesian optimization improves the quality of the learner; (2) data rebalancing considerably impacts the prediction performance of the models; test set rebalancing may provide over-optimistic results, and should therefore be cautiously applied; (3) despite working on sequence data, convolutional models obtain performance close to those of feed forward models working on epigenomic information, which suggests that also sequence data carries informative content for CRR-activity prediction. We therefore suggest combining both models/data types in future works.


Asunto(s)
Aprendizaje Profundo , Humanos , Teorema de Bayes , Secuencias Reguladoras de Ácidos Nucleicos , Redes Neurales de la Computación , Aprendizaje Automático
4.
J Clin Endocrinol Metab ; 107(7): e3048-e3057, 2022 06 16.
Artículo en Inglés | MEDLINE | ID: mdl-35276006

RESUMEN

CONTEXT: Many different inherited and acquired conditions can result in premature bone fragility/low bone mass disorders (LBMDs). OBJECTIVE: We aimed to elucidate the impact of genetic testing on differential diagnosis of adult LBMDs and at defining clinical criteria for predicting monogenic forms. METHODS: Four clinical centers broadly recruited a cohort of 394 unrelated adult women before menopause and men younger than 55 years with a bone mineral density (BMD) Z-score < -2.0 and/or pathological fractures. After exclusion of secondary causes or unequivocal clinical/biochemical hallmarks of monogenic LBMDs, all participants were genotyped by targeted next-generation sequencing. RESULTS: In total, 20.8% of the participants carried rare disease-causing variants (DCVs) in genes known to cause osteogenesis imperfecta (COL1A1, COL1A2), hypophosphatasia (ALPL), and early-onset osteoporosis (LRP5, PLS3, and WNT1). In addition, we identified rare DCVs in ENPP1, LMNA, NOTCH2, and ZNF469. Three individuals had autosomal recessive, 75 autosomal dominant, and 4 X-linked disorders. A total of 9.7% of the participants harbored variants of unknown significance. A regression analysis revealed that the likelihood of detecting a DCV correlated with a positive family history of osteoporosis, peripheral fractures (> 2), and a high normal body mass index (BMI). In contrast, mutation frequencies did not correlate with age, prevalent vertebral fractures, BMD, or biochemical parameters. In individuals without monogenic disease-causing rare variants, common variants predisposing for low BMD (eg, in LRP5) were overrepresented. CONCLUSION: The overlapping spectra of monogenic adult LBMD can be easily disentangled by genetic testing and the proposed clinical criteria can help to maximize the diagnostic yield.


Asunto(s)
Osteogénesis Imperfecta , Osteoporosis , Fracturas de la Columna Vertebral , Adulto , Densidad Ósea/genética , Femenino , Genotipo , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Masculino , Mutación , Osteogénesis Imperfecta/diagnóstico , Osteogénesis Imperfecta/genética , Osteoporosis/diagnóstico , Osteoporosis/genética
5.
Gigascience ; 122022 12 28.
Artículo en Inglés | MEDLINE | ID: mdl-37083939

RESUMEN

BACKGROUND: Genome sequencing efforts for individuals with rare Mendelian disease have increased the research focus on the noncoding genome and the clinical need for methods that prioritize potentially disease causal noncoding variants. Some tools for assessment of variant pathogenicity as well as annotations are not available for the current human genome build (GRCh38), for which the adoption in databases, software, and pipelines was slow. RESULTS: Here, we present an updated version of the Regulatory Mendelian Mutation (ReMM) score, retrained on features and variants derived from the GRCh38 genome build. Like its GRCh37 version, it achieves good performance on its highly imbalanced data. To improve accessibility and provide users with a toolbox to score their variant files and look up scores in the genome, we developed a website and API for easy score lookup. CONCLUSIONS: Scores of the GRCh38 genome build are highly correlated to the prior release with a performance increase due to the better coverage of features. For prioritization of noncoding mutations in imbalanced datasets, the ReMM score performed much better than other variation scores. Prescored whole-genome files of GRCh37 and GRCh38 genome builds are cited in the article and the website; UCSC genome browser tracks, and an API are available at https://remm.bihealth.org.


Asunto(s)
Genoma Humano , Programas Informáticos , Humanos , Mutación , Bases de Datos Genéticas
6.
Genome Med ; 13(1): 31, 2021 02 22.
Artículo en Inglés | MEDLINE | ID: mdl-33618777

RESUMEN

BACKGROUND: Splicing of genomic exons into mRNAs is a critical prerequisite for the accurate synthesis of human proteins. Genetic variants impacting splicing underlie a substantial proportion of genetic disease, but are challenging to identify beyond those occurring at donor and acceptor dinucleotides. To address this, various methods aim to predict variant effects on splicing. Recently, deep neural networks (DNNs) have been shown to achieve better results in predicting splice variants than other strategies. METHODS: It has been unclear how best to integrate such process-specific scores into genome-wide variant effect predictors. Here, we use a recently published experimental data set to compare several machine learning methods that score variant effects on splicing. We integrate the best of those approaches into general variant effect prediction models and observe the effect on classification of known pathogenic variants. RESULTS: We integrate two specialized splicing scores into CADD (Combined Annotation Dependent Depletion; cadd.gs.washington.edu ), a widely used tool for genome-wide variant effect prediction that we previously developed to weight and integrate diverse collections of genomic annotations. With this new model, CADD-Splice, we show that inclusion of splicing DNN effect scores substantially improves predictions across multiple variant categories, without compromising overall performance. CONCLUSIONS: While splice effect scores show superior performance on splice variants, specialized predictors cannot compete with other variant scores in general variant interpretation, as the latter account for nonsense and missense effects that do not alter splicing. Although only shown here for splice scores, we believe that the applied approach will generalize to other specific molecular processes, providing a path for the further improvement of genome-wide variant effect prediction.


Asunto(s)
Aprendizaje Profundo , Variación Genética , Estudio de Asociación del Genoma Completo , Empalme del ARN/genética , Secuencia de Bases , Exones/genética , Humanos , Intrones/genética
8.
PLoS One ; 15(12): e0237412, 2020.
Artículo en Inglés | MEDLINE | ID: mdl-33259518

RESUMEN

Regulatory regions, like promoters and enhancers, cover an estimated 5-15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.


Asunto(s)
Secuencias Reguladoras de Ácidos Nucleicos/genética , Células A549 , Línea Celular Tumoral , Cromatina/genética , ADN/genética , Genoma/genética , Genómica/métodos , Células HeLa , Células Hep G2 , Humanos , Células K562 , Células MCF-7 , Redes Neurales de la Computación , Regiones Promotoras Genéticas/genética , Análisis de Secuencia de ADN , Máquina de Vectores de Soporte
9.
Nat Protoc ; 15(8): 2387-2412, 2020 08.
Artículo en Inglés | MEDLINE | ID: mdl-32641802

RESUMEN

Massively parallel reporter assays (MPRAs) can simultaneously measure the function of thousands of candidate regulatory sequences (CRSs) in a quantitative manner. In this method, CRSs are cloned upstream of a minimal promoter and reporter gene, alongside a unique barcode, and introduced into cells. If the CRS is a functional regulatory element, it will lead to the transcription of the barcode sequence, which is measured via RNA sequencing and normalized for cellular integration via DNA sequencing of the barcode. This technology has been used to test thousands of sequences and their variants for regulatory activity, to decipher the regulatory code and its evolution, and to develop genetic switches. Lentivirus-based MPRA (lentiMPRA) produces 'in-genome' readouts and enables the use of this technique in hard-to-transfect cells. Here, we provide a detailed protocol for lentiMPRA, along with a user-friendly Nextflow-based computational pipeline-MPRAflow-for quantifying CRS activity from different MPRA designs. The lentiMPRA protocol takes ~2 months, which includes sequencing turnaround time and data processing with MPRAflow.


Asunto(s)
Lentivirus/genética , Secuencias Reguladoras de Ácidos Nucleicos/genética , Análisis de Secuencia de ADN/métodos , Flujo de Trabajo , Secuencia de Bases
10.
Gigascience ; 9(5)2020 05 01.
Artículo en Inglés | MEDLINE | ID: mdl-32444882

RESUMEN

BACKGROUND: Several prediction problems in computational biology and genomic medicine are characterized by both big data as well as a high imbalance between examples to be learned, whereby positive examples can represent a tiny minority with respect to negative examples. For instance, deleterious or pathogenic variants are overwhelmed by the sea of neutral variants in the non-coding regions of the genome: thus, the prediction of deleterious variants is a challenging, highly imbalanced classification problem, and classical prediction tools fail to detect the rare pathogenic examples among the huge amount of neutral variants or undergo severe restrictions in managing big genomic data. RESULTS: To overcome these limitations we propose parSMURF, a method that adopts a hyper-ensemble approach and oversampling and undersampling techniques to deal with imbalanced data, and parallel computational techniques to both manage big genomic data and substantially speed up the computation. The synergy between Bayesian optimization techniques and the parallel nature of parSMURF enables efficient and user-friendly automatic tuning of the hyper-parameters of the algorithm, and allows specific learning problems in genomic medicine to be easily fit. Moreover, by using MPI parallel and machine learning ensemble techniques, parSMURF can manage big data by partitioning them across the nodes of a high-performance computing cluster. Results with synthetic data and with single-nucleotide variants associated with Mendelian diseases and with genome-wide association study hits in the non-coding regions of the human genome, involhing millions of examples, show that parSMURF achieves state-of-the-art results and an 80-fold speed-up with respect to the sequential version. CONCLUSIONS: parSMURF is a parallel machine learning tool that can be trained to learn different genomic problems, and its multiple levels of parallelization and high scalability allow us to efficiently fit problems characterized by big and imbalanced genomic data. The C++ OpenMP multi-core version tailored to a single workstation and the C++ MPI/OpenMP hybrid multi-core and multi-node parSMURF version tailored to a High Performance Computing cluster are both available at https://github.com/AnacletoLAB/parSMURF.


Asunto(s)
Biología Computacional/métodos , Predisposición Genética a la Enfermedad , Variación Genética , Estudio de Asociación del Genoma Completo/métodos , Programas Informáticos , Algoritmos , Bases de Datos Genéticas , Genómica/métodos , Humanos , Aprendizaje Automático , Reproducibilidad de los Resultados
11.
Nat Commun ; 10(1): 3583, 2019 08 08.
Artículo en Inglés | MEDLINE | ID: mdl-31395865

RESUMEN

The majority of common variants associated with common diseases, as well as an unknown proportion of causal mutations for rare diseases, fall in noncoding regions of the genome. Although catalogs of noncoding regulatory elements are steadily improving, we have a limited understanding of the functional effects of mutations within them. Here, we perform saturation mutagenesis in conjunction with massively parallel reporter assays on 20 disease-associated gene promoters and enhancers, generating functional measurements for over 30,000 single nucleotide substitutions and deletions. We find that the density of putative transcription factor binding sites varies widely between regulatory elements, as does the extent to which evolutionary conservation or integrative scores predict functional effects. These data provide a powerful resource for interpreting the pathogenicity of clinically observed mutations in these disease-associated regulatory elements, and comprise a rich dataset for the further development of algorithms that aim to predict the regulatory effects of noncoding mutations.


Asunto(s)
Biología Computacional/métodos , Enfermedad/genética , Mutagénesis , Elementos Reguladores de la Transcripción/genética , Línea Celular , Clonación Molecular , Genoma Humano/genética , Biblioteca Genómica , Secuenciación de Nucleótidos de Alto Rendimiento , Humanos , Polimorfismo de Nucleótido Simple
12.
Am J Hum Genet ; 105(3): 631-639, 2019 09 05.
Artículo en Inglés | MEDLINE | ID: mdl-31353024

RESUMEN

Notch signaling is an established developmental pathway for brain morphogenesis. Given that Delta-like 1 (DLL1) is a ligand for the Notch receptor and that a few individuals with developmental delay, intellectual disability, and brain malformations have microdeletions encompassing DLL1, we hypothesized that insufficiency of DLL1 causes a human neurodevelopmental disorder. We performed exome sequencing in individuals with neurodevelopmental disorders. The cohort was identified using known Matchmaker Exchange nodes such as GeneMatcher. This method identified 15 individuals from 12 unrelated families with heterozygous pathogenic DLL1 variants (nonsense, missense, splice site, and one whole gene deletion). The most common features in our cohort were intellectual disability, autism spectrum disorder, seizures, variable brain malformations, muscular hypotonia, and scoliosis. We did not identify an obvious genotype-phenotype correlation. Analysis of one splice site variant showed an in-frame insertion of 12 bp. In conclusion, heterozygous DLL1 pathogenic variants cause a variable neurodevelopmental phenotype and multi-systemic features. The clinical and molecular data support haploinsufficiency as a mechanism for the pathogenesis of this DLL1-related disorder and affirm the importance of DLL1 in human brain development.


Asunto(s)
Proteínas de Unión al Calcio/genética , Haploinsuficiencia , Proteínas de la Membrana/genética , Trastornos del Neurodesarrollo/genética , Estudios de Cohortes , Femenino , Humanos , Ligandos , Masculino , Linaje , Secuenciación del Exoma
13.
Hum Mutat ; 40(9): 1280-1291, 2019 09.
Artículo en Inglés | MEDLINE | ID: mdl-31106481

RESUMEN

The integrative analysis of high-throughput reporter assays, machine learning, and profiles of epigenomic chromatin state in a broad array of cells and tissues has the potential to significantly improve our understanding of noncoding regulatory element function and its contribution to human disease. Here, we report results from the CAGI 5 regulation saturation challenge where participants were asked to predict the impact of nucleotide substitution at every base pair within five disease-associated human enhancers and nine disease-associated promoters. A library of mutations covering all bases was generated by saturation mutagenesis and altered activity was assessed in a massively parallel reporter assay (MPRA) in relevant cell lines. Reporter expression was measured relative to plasmid DNA to determine the impact of variants. The challenge was to predict the functional effects of variants on reporter expression. Comparative analysis of the full range of submitted prediction results identifies the most successful models of transcription factor binding sites, machine learning algorithms, and ways to choose among or incorporate diverse datatypes and cell-types for training computational models. These results have the potential to improve the design of future studies on more diverse sets of regulatory elements and aid the interpretation of disease-associated genetic variation.


Asunto(s)
ADN/química , Epigenómica/métodos , Mutación Puntual , Sitios de Unión , Línea Celular , Cromatina/genética , ADN/metabolismo , Elementos de Facilitación Genéticos , Predisposición Genética a la Enfermedad , Humanos , Aprendizaje Automático , Regiones Promotoras Genéticas , Factores de Transcripción/metabolismo
14.
Sci Rep ; 8(1): 14611, 2018 10 02.
Artículo en Inglés | MEDLINE | ID: mdl-30279461

RESUMEN

A genome-wide evaluation of the effects of ionizing radiation on mutation induction in the mouse germline has identified multisite de novo mutations (MSDNs) as marker for previous exposure. Here we present the results of a small pilot study of whole genome sequencing in offspring of soldiers who served in radar units on weapon systems that were emitting high-frequency radiation. We found cases of exceptionally high MSDN rates as well as an increased mean in our cohort: While a MSDN mutation is detected in average in 1 out of 5 offspring of unexposed controls, we observed 12 MSDNs in altogether 18 offspring, including a family with 6 MSDNs in 3 offspring. Moreover, we found two translocations, also resulting from neighboring mutations. Our findings indicate that MSDNs might be suited in principle for the assessment of DNA damage from ionizing radiation also in humans. However, as exact person-related dose values in risk groups are usually not available, the interpretation of MSDNs in single families would benefit from larger molecular epidemiologic studies on this new biomarker.


Asunto(s)
Genoma Humano , Mutación de Línea Germinal , Exposición Paterna , Radiación Ionizante , Adulto , Animales , Secuencia de Bases , Estudios de Cohortes , Biología Computacional/métodos , Femenino , Humanos , Recién Nacido , Masculino , Ratones , Personal Militar , Tasa de Mutación , Proyectos Piloto , Factores de Riesgo , Secuenciación Completa del Genoma
15.
J Transl Med ; 16(1): 23, 2018 02 06.
Artículo en Inglés | MEDLINE | ID: mdl-29409514

RESUMEN

BACKGROUND: Cancer vaccines can effectively establish clinically relevant tumor immunity. Novel sequencing approaches rapidly identify the mutational fingerprint of tumors, thus allowing to generate personalized tumor vaccines within a few weeks from diagnosis. Here, we report the case of a 62-year-old patient receiving a four-peptide-vaccine targeting the two sole mutations of his pancreatic tumor, identified via exome sequencing. METHODS: Vaccination started during chemotherapy in second complete remission and continued monthly thereafter. We tracked IFN-γ+ T cell responses against vaccine peptides in peripheral blood after 12, 17 and 34 vaccinations by analyzing T-cell receptor (TCR) repertoire diversity and epitope-binding regions of peptide-reactive T-cell lines and clones. By restricting analysis to sorted IFN-γ-producing T cells we could assure epitope-specificity, functionality, and TH1 polarization. RESULTS: A peptide-specific T-cell response against three of the four vaccine peptides could be detected sequentially. Molecular TCR analysis revealed a broad vaccine-reactive TCR repertoire with clones of discernible specificity. Four identical or convergent TCR sequences could be identified at more than one time-point, indicating timely persistence of vaccine-reactive T cells. One dominant TCR expressing a dual TCRVα chain could be found in three T-cell clones. The observed T-cell responses possibly contributed to clinical outcome: The patient is alive 6 years after initial diagnosis and in complete remission for 4 years now. CONCLUSIONS: Therapeutic vaccination with a neoantigen-derived four-peptide vaccine resulted in a diverse and long-lasting immune response against these targets which was associated with prolonged clinical remission. These data warrant confirmation in a larger proof-of concept clinical trial.


Asunto(s)
Linfocitos T CD4-Positivos/inmunología , Vacunas contra el Cáncer/inmunología , Carcinoma Ductal Pancreático/terapia , Epítopos/inmunología , Monitorización Inmunológica , Neoplasias Pancreáticas/terapia , Receptores de Antígenos de Linfocitos T alfa-beta/genética , Vacunas de Subunidad/inmunología , Secuencia de Aminoácidos , Carcinoma Ductal Pancreático/sangre , Carcinoma Ductal Pancreático/inmunología , Carcinoma Ductal Pancreático/secundario , Humanos , Masculino , Persona de Mediana Edad , Neoplasias Pancreáticas/sangre , Neoplasias Pancreáticas/inmunología , Neoplasias Pancreáticas/secundario , Péptidos/química , Péptidos/inmunología , Resultado del Tratamiento , Vacunación
16.
Genome Med ; 10(1): 3, 2018 01 09.
Artículo en Inglés | MEDLINE | ID: mdl-29310717

RESUMEN

BACKGROUND: Glycosylphosphatidylinositol biosynthesis defects (GPIBDs) cause a group of phenotypically overlapping recessive syndromes with intellectual disability, for which pathogenic mutations have been described in 16 genes of the corresponding molecular pathway. An elevated serum activity of alkaline phosphatase (AP), a GPI-linked enzyme, has been used to assign GPIBDs to the phenotypic series of hyperphosphatasia with mental retardation syndrome (HPMRS) and to distinguish them from another subset of GPIBDs, termed multiple congenital anomalies hypotonia seizures syndrome (MCAHS). However, the increasing number of individuals with a GPIBD shows that hyperphosphatasia is a variable feature that is not ideal for a clinical classification. METHODS: We studied the discriminatory power of multiple GPI-linked substrates that were assessed by flow cytometry in blood cells and fibroblasts of 39 and 14 individuals with a GPIBD, respectively. On the phenotypic level, we evaluated the frequency of occurrence of clinical symptoms and analyzed the performance of computer-assisted image analysis of the facial gestalt in 91 individuals. RESULTS: We found that certain malformations such as Morbus Hirschsprung and diaphragmatic defects are more likely to be associated with particular gene defects (PIGV, PGAP3, PIGN). However, especially at the severe end of the clinical spectrum of HPMRS, there is a high phenotypic overlap with MCAHS. Elevation of AP has also been documented in some of the individuals with MCAHS, namely those with PIGA mutations. Although the impairment of GPI-linked substrates is supposed to play the key role in the pathophysiology of GPIBDs, we could not observe gene-specific profiles for flow cytometric markers or a correlation between their cell surface levels and the severity of the phenotype. In contrast, it was facial recognition software that achieved the highest accuracy in predicting the disease-causing gene in a GPIBD. CONCLUSIONS: Due to the overlapping clinical spectrum of both HPMRS and MCAHS in the majority of affected individuals, the elevation of AP and the reduced surface levels of GPI-linked markers in both groups, a common classification as GPIBDs is recommended. The effectiveness of computer-assisted gestalt analysis for the correct gene inference in a GPIBD and probably beyond is remarkable and illustrates how the information contained in human faces is pivotal in the delineation of genetic entities.


Asunto(s)
Citometría de Flujo/métodos , Glicosilfosfatidilinositoles/biosíntesis , Procesamiento de Imagen Asistido por Computador , Anomalías Múltiples/metabolismo , Automatización , Biomarcadores/metabolismo , Humanos , Discapacidad Intelectual/metabolismo , Fenotipo , Trastornos del Metabolismo del Fósforo/metabolismo , Síndrome
17.
BMC Bioinformatics ; 18(1): 449, 2017 Oct 12.
Artículo en Inglés | MEDLINE | ID: mdl-29025394

RESUMEN

BACKGROUND: The prediction of human gene-abnormal phenotype associations is a fundamental step toward the discovery of novel genes associated with human disorders, especially when no genes are known to be associated with a specific disease. In this context the Human Phenotype Ontology (HPO) provides a standard categorization of the abnormalities associated with human diseases. While the problem of the prediction of gene-disease associations has been widely investigated, the related problem of gene-phenotypic feature (i.e., HPO term) associations has been largely overlooked, even if for most human genes no HPO term associations are known and despite the increasing application of the HPO to relevant medical problems. Moreover most of the methods proposed in literature are not able to capture the hierarchical relationships between HPO terms, thus resulting in inconsistent and relatively inaccurate predictions. RESULTS: We present two hierarchical ensemble methods that we formally prove to provide biologically consistent predictions according to the hierarchical structure of the HPO. The modular structure of the proposed methods, that consists in a "flat" learning first step and a hierarchical combination of the predictions in the second step, allows the predictions of virtually any flat learning method to be enhanced. The experimental results show that hierarchical ensemble methods are able to predict novel associations between genes and abnormal phenotypes with results that are competitive with state-of-the-art algorithms and with a significant reduction of the computational complexity. CONCLUSIONS: Hierarchical ensembles are efficient computational methods that guarantee biologically meaningful predictions that obey the true path rule, and can be used as a tool to improve and make consistent the HPO terms predictions starting from virtually any flat learning method. The implementation of the proposed methods is available as an R package from the CRAN repository.


Asunto(s)
Algoritmos , Ontologías Biológicas , Área Bajo la Curva , Estudios de Asociación Genética , Humanos , Anotación de Secuencia Molecular , Fenotipo , Curva ROC
18.
Sci Rep ; 7(1): 2959, 2017 06 07.
Artículo en Inglés | MEDLINE | ID: mdl-28592878

RESUMEN

Disease and trait-associated variants represent a tiny minority of all known genetic variation, and therefore there is necessarily an imbalance between the small set of available disease-associated and the much larger set of non-deleterious genomic variation, especially in non-coding regulatory regions of human genome. Machine Learning (ML) methods for predicting disease-associated non-coding variants are faced with a chicken and egg problem - such variants cannot be easily found without ML, but ML cannot begin to be effective until a sufficient number of instances have been found. Most of state-of-the-art ML-based methods do not adopt specific imbalance-aware learning techniques to deal with imbalanced data that naturally arise in several genome-wide variant scoring problems, thus resulting in a significant reduction of sensitivity and precision. We present a novel method that adopts imbalance-aware learning strategies based on resampling techniques and a hyper-ensemble approach that outperforms state-of-the-art methods in two different contexts: the prediction of non-coding variants associated with Mendelian and with complex diseases. We show that imbalance-aware ML is a key issue for the design of robust and accurate prediction algorithms and we provide a method and an easy-to-use software tool that can be effectively applied to this challenging prediction task.


Asunto(s)
Predisposición Genética a la Enfermedad , Variación Genética , Aprendizaje Automático , ARN no Traducido , Algoritmos , Estudio de Asociación del Genoma Completo , Humanos , Modelos Genéticos , Mutación , Reproducibilidad de los Resultados , Programas Informáticos
19.
Genome Med ; 8(1): 130, 2016 12 13.
Artículo en Inglés | MEDLINE | ID: mdl-27964746

RESUMEN

BACKGROUND: The last two human genome assemblies have extended the previous linear golden-path paradigm of the human genome to a graph-like model to better represent regions with a high degree of structural variability. The new model offers opportunities to improve the technical validity of variant calling in whole-genome sequencing (WGS). METHODS: We developed an algorithm that analyzes the patterns of variant calls in the 178 structurally variable regions of the GRCh38 genome assembly, and infers whether a given sample is most likely to contain sequences from the primary assembly, an alternate locus, or their heterozygous combination at each of these 178 regions. We investigate 121 in-house WGS datasets that have been aligned to the GRCh37 and GRCh38 assemblies. RESULTS: We show that stretches of sequences that are largely but not entirely identical between the primary assembly and an alternate locus can result in multiple variant calls against regions of the primary assembly. In WGS analysis, this results in characteristic and recognizable patterns of variant calls at positions that we term alignable scaffold-discrepant positions (ASDPs). In 121 in-house genomes, on average 51.8±3.8 of the 178 regions were found to correspond best to an alternate locus rather than the primary assembly sequence, and filtering these genomes with our algorithm led to the identification of 7863 variant calls per genome that colocalized with ASDPs. Additionally, we found that 437 of 791 genome-wide association study hits located within one of the regions corresponded to ASDPs. CONCLUSIONS: Our algorithm uses the information contained in the 178 structurally variable regions of the GRCh38 genome assembly to avoid spurious variant calls in cases where samples contain an alternate locus rather than the corresponding segment of the primary assembly. These results suggest the great potential of fully incorporating the resources of graph-like genome assemblies into variant calling, but also underscore the importance of developing computational resources that will allow a full reconstruction of the genotype in personal genomes. Our algorithm is freely available at https://github.com/charite/asdpex .


Asunto(s)
Algoritmos , Variación Genética , Genoma Humano , Heterocigoto , Alineación de Secuencia/métodos , Análisis de Secuencia de ADN/métodos , Humanos
20.
Am J Hum Genet ; 99(3): 595-606, 2016 09 01.
Artículo en Inglés | MEDLINE | ID: mdl-27569544

RESUMEN

The interpretation of non-coding variants still constitutes a major challenge in the application of whole-genome sequencing in Mendelian disease, especially for single-nucleotide and other small non-coding variants. Here we present Genomiser, an analysis framework that is able not only to score the relevance of variation in the non-coding genome, but also to associate regulatory variants to specific Mendelian diseases. Genomiser scores variants through either existing methods such as CADD or a bespoke machine learning method and combines these with allele frequency, regulatory sequences, chromosomal topological domains, and phenotypic relevance to discover variants associated to specific Mendelian disorders. Overall, Genomiser is able to identify causal regulatory variants as the top candidate in 77% of simulated whole genomes, allowing effective detection and discovery of regulatory variants in Mendelian disease.


Asunto(s)
Algoritmos , Enfermedades Genéticas Congénitas/genética , Genoma Humano/genética , Mutación/genética , Frecuencia de los Genes , Estudio de Asociación del Genoma Completo , Humanos , Aprendizaje Automático , Sistemas de Lectura Abierta/genética , Fenotipo , Mutación Puntual/genética
SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...