Skip to content

<|endoftext|> token isn't encoded correctly #140

@ttumiel

Description

@ttumiel
import torch
from mingpt.bpe import BPETokenizer

tokenizer = BPETokenizer()
print(tokenizer("<|endoftext|>")) # tensor([[  27,   91,  437, 1659, 5239,   91,   29]])
print(tokenizer.decode(torch.tensor([50256]))) # '<|endoftext|>'
print(tokenizer(tokenizer.decode(torch.tensor([50256])))) # tensor([[  27,   91,  437, 1659, 5239,   91,   29]])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions