Definition of world, camera and human frame

Hi authors, could you please help elaborate on the definition of world, camera and human coordinate frame (and the rendering view) used in the paper and code? I think this info will be really helpful for readers to understand and apply the method.
For me, I am really confused trying to understand the transformations by reading the code. Thank you so much for your help!