CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions.

Schubach, Max; Maass, Thorben; Nazaretyan, Lusiné; Röner, Sebastian; Kircher, Martin

Schubach, Max; Maass, Thorben; Nazaretyan, Lusiné; Röner, Sebastian; Kircher, Martin.

Afiliación

Schubach M; Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
Maass T; Institute of Human Genetics, University Hospital Schleswig-Holstein, University of Lübeck, Lübeck, Germany.
Nazaretyan L; Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
Röner S; Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.
Kircher M; Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Berlin, Germany.

Nucleic Acids Res ; 52(D1): D1143-D1154, 2024 Jan 05.

Article en En | MEDLINE | ID: mdl-38183205

ABSTRACT

ABSTRACT

Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https//cadd.bihealth.org/ or https//cadd.gs.washington.edu/ to the community.

Asunto(s)

Variación Genética; Genoma Humano; Aprendizaje Automático; Programas Informáticos; Nucleótidos; Humanos

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google

Texto completo: 1 Colección: 01-internacional Base de datos: MEDLINE Asunto principal: Variación Genética / Programas Informáticos / Genoma Humano / Aprendizaje Automático Tipo de estudio: Prognostic_studies / Risk_factors_studies Límite: Humans Idioma: En Revista: Nucleic Acids Res Año: 2024 Tipo del documento: Article País de afiliación: Alemania

Texto completo

Añadir a Mi BVS

Imprimir

XML

PubMed Links

Buscar en Google