Question about pretraining

Hi, thank you for sharing this great work! I’m curious about the pretraining process of this model. Could you please share some details on: **The amount of compute used (number/type of GPUs) and the total training time during pretraining**. This information would be very helpful for better understanding the scale of the training. Thanks in advance!