Panmask provides a list of easy/hard regions for short-read variant calling against the human genome GRCh38. The easy regions harbor small variants that are easy to call, with most variant callers achieving 98-99.5% accuracy in the regions. They cover 87.9% of GRCh38, 92.6% of coding regions and 95.8% of pathogenic variants in ClinVar. The panmask regions may help to reduce variant calling artifacts and simplify variant filtering. They can be downloaded from Zenodo.
GRCh38 easy regions (where variant calls tend to be accurate in most samples):
- umap-k100: Umap for 100bp single-end reads, published in Karimzadeh et al (2018)
- ENCODE: ENCODE blacklist regions v2, published in Amemiya et al (2019)
- GIAB-easy: GIAB genome stratification v3.5, published in Dwarshuis et al (2024)
- Illumina: originally developed by Illumina, reimplemented by Taylor and McCoy from JHU.
- 1000G mask: developed for the 1000 Genomes Project. Developers unknown.
- highRepro: intersection of highly reproducible regions generated by Pan et al (2022)
HG002 confident regions (where small variant calls can be trusted):
- HG002-GIAB: NIST confident regions, v4.2.1
- HG002-Q100: T2TQ100 v1.1/20241113
Other datasets used for evaluation:
Short-read small variant calls, published in Baid et al (2020). Only VCFs called from HG002 PCR-free NovaSeq data at 30X are used. Data files in this repo are released under CC0 and will be available at GigaDB.