Note: Scoring of VCF files with CADD v1.7 is still rather slow if many new variants need to be calculated from scratch (e.g., if many insertion/deletion or multinucleotide subsitutions are included). If possible use the pre-scored whole genome and pre-calculated indel files directly where possible. We are very sorry for the inconvenience.
What is Combined Annotation Dependent Depletion (CADD)?
CADD is a tool for scoring the deleteriousness of single nucleotide variants, multi-nucleotide substitutions as well as insertion/deletions variants in the human genome.
While many variant annotation and scoring tools are around, most annotations tend to exploit a single information type (e.g. conservation) and/or are restricted in scope (e.g. to missense changes). Thus, a broadly applicable metric that objectively weights and integrates diverse information is needed. Combined Annotation Dependent Depletion (CADD) is a framework that integrates multiple annotations into one metric by contrasting variants that survived natural selection with simulated mutations.
C-scores strongly correlate with allelic diversity, pathogenicity of both coding and non-coding variants, and experimentally measured regulatory effects, and also highly rank causal variants within individual genome sequences. Finally, C-scores of complex trait-associated variants from genome-wide association studies (GWAS) are significantly higher than matched controls and correlate with study sample size, likely reflecting the increased accuracy of larger GWAS.
CADD can quantitatively prioritize functional, deleterious, and disease causal variants across a wide range of functional categories, effect sizes and genetic architectures and can be used prioritize causal variation in both research and clinical settings.
In addition to this website, CADD has been described in four publications. The most recent manuscript describes CADD v1.7, an extension to the annotations included in the model. Most prominently, this version improves the scoring of coding variants with features derived from the ESM-1v protein language model as well as the scoring of regulatory variants with features derived from a convolutional neural network trained on regions of open chromatin:
Schubach M, Maass T, Nazaretyan L, Röner S, Kircher M.Then there is CADD-Splice (CADD v1.6), which specifically improved the prediction of splicing effects:
CADD v1.7: Using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions.
Nucleic Acids Res. 2024 Jan 5. doi: 10.1093/nar/gkad989.
PubMed PMID: 38183205.
Rentzsch P, Schubach M, Shendure J, Kircher M.Our third manuscript describes the updates between the initial publication and CADD v1.4, introduces CADD for GRCh38 and explains how we envision the use of CADD. It was published by Nucleic Acids Research in 2018:
CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores.
Genome Med. 2021 Feb 22. doi: 10.1186/s13073-021-00835-9.
PubMed PMID: 33618777.
Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M.Finally, the original manuscript describing the method was published by Nature Genetics in 2014:
CADD: predicting the deleteriousness of variants throughout the human genome.
Nucleic Acids Res. 2018 Oct 29. doi: 10.1093/nar/gky1016.
PubMed PMID: 30371827.
Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J.
A general framework for estimating the relative pathogenicity of human genetic variants.
Nat Genet. 2014 Feb 2. doi: 10.1038/ng.2892.
PubMed PMID: 24487276.
How can I obtain CADD scores?
CADD scores are freely available for all non-commercial applications. If you are planning on using them in a commercial application, you can obtain a license through the UW CoMotion Express Licensing System. If in doubt about whether you need a license for your application, please contact Martin Kircher, Jay Shendure and Gregory M. Cooper. CADD is currently developed by Martin Kircher, Max Schubach, Thorben Maaß, and Lusine Nazaretyan. Former developers are Philipp Rentzsch, Daniela M. Witten, Gregory M. Cooper, and Jay Shendure.
We have pre-computed CADD-based scores (C-scores) for all approximately 9 billion possible single nucleotide variants (SNVs) of the reference genome, a selection of short insertion/deletions as well as some large variant sets (e.g. gnomAD, ExAC, 1000 Genomes, ESP). We also provide a simple lookup for SNVs and enable scoring of short insertions/deletions. Ranges of scores can be natively visualized in UCSC Genome Browser or using our custom tracks (for CADD v1.6 hg19/GRCh37 and CADD v1.6 hg38/GRCh38).