Dear Author,
First of all, I would like to express my gratitude for your excellent work and for open-sourcing such a valuable resource. After carefully reading your paper and going through the code, I have a couple of questions regarding the Over-Trust Logit Penalty section, specifically about the definition of Equation (6):
From my understanding, OPERA adds an attention penalty to the probability of the top-k tokens selected at each beam search step. However, ( H(h_t) ) and the attention score appear to be different quantities. Could you clarify whether the subtraction in the equation has a clear physical interpretation, or if there’s a specific reasoning behind it?
I appreciate your time and look forward to your clarification!
Best regards,
Dear Author,
First of all, I would like to express my gratitude for your excellent work and for open-sourcing such a valuable resource. After carefully reading your paper and going through the code, I have a couple of questions regarding the Over-Trust Logit Penalty section, specifically about the definition of Equation (6):
From my understanding, OPERA adds an attention penalty to the probability of the top-k tokens selected at each beam search step. However, ( H(h_t) ) and the attention score appear to be different quantities. Could you clarify whether the subtraction in the equation has a clear physical interpretation, or if there’s a specific reasoning behind it?
I appreciate your time and look forward to your clarification!
Best regards,