GMP performs basecase squaring ~1.5x faster than multiplication of a*b by noticing the upper half of the cross-product is symmetric and skipping those duplicated multiplications.
I've read over the core_mont.cu code, I believ that core_mont.cu is used when TPI=limps. In this case it's not clear how to do fast exp.
Have you thought about this possibility?
Have you seen this anyone take advantage of this on a GPU?