Why set the label tokens the same as the input token #637

kaimoxuan123 · 2024-08-19T21:37:08Z

System Info

CUDA: 12.1

Information

The official example scripts
My own modified scripts

🐛 Describe the bug

Hello, I am learning toxicchat_dataset.py to generate instruction datasets to fine-tune llmam3.1.
https://github.com/meta-llama/llama-recipes/blob/main/src/llama_recipes/datasets/toxicchat_dataset.py#L17

for
ombined_tokens = {
"input_ids": list(prompt_tokens),
"labels": list(prompt_tokens)
}
return dict(combined_tokens, attention_mask=[1]*len(combined_tokens["input_ids"]))
As our task is to predict the next token, why can we offset the label exactlly the same as the input_ids, why don't we offest the input_ids by one token to the right side to get the labels.

Error logs

Read the code to get some experience

Expected behavior

The label is the input_ids shifted to the right by one token

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why set the label tokens the same as the input token #637

Why set the label tokens the same as the input token #637

kaimoxuan123 commented Aug 19, 2024

Why set the label tokens the same as the input token #637

Why set the label tokens the same as the input token #637

Comments

kaimoxuan123 commented Aug 19, 2024

System Info

Information

🐛 Describe the bug

Error logs

Expected behavior