This is more of a question than an issue, but why predict the input embedding and not the final latent at t+1? Did this lead to degenerate solutions?