Microsoft/codebert-base produces two sep tokens

Hey Guys,

this post is basically a copy of this one. The first one did not get any attention, but I think the problem might be important and this is probably a better place to ask the question. If this is the wrong place to post this, please let me know where to post.

I noticed, that the “microsoft/codebert-base” tokenizer adds two SEP Tokens between two sentences. I don’t think this is intended behavior, but feel free to correct me.

This is the example I originally posted:

from transformers import AutoTokenizer, AutoModel
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

checkpoint = "microsoft/codebert-base"

model = AutoModel.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

from datasets import load_dataset

raw_dataset = load_dataset('json', data_files='/home/<user>/Data/<DataDir>/dataset_v1.jsonl', split='train')

def toke(example):
    return tokenizer(example["sentence1"], example["sentence2"]) 

tokenized_dataset = raw_dataset.select(list(range(10000))).map(toke, batched=True)

print(tokenized_dataset[7]['sentence1'])
print(tokenized_dataset[7]['sentence2'])
print(tokenized_dataset[7]['input_ids'])

Output:

train_nan_df.head()
test_df[‘ImageId’] = np.array(os.listdir (‘…/input/test_images/’))
[0, 21714, 1215, 10197, 1215, 36807, 4, 3628, 43048, 2, 2, 21959, 1215, 36807, 48759, 8532, 28081, 44403, 5457, 46446, 4, 30766, 1640, 366, 4, 8458, 41292, 31509, 49445, 46797, 73, 21959, 1215, 39472, 73, 108, 35122, 2]

Here is another example using publicly available data:
In:

!pip install transformers datasets

from transformers import AutoTokenizer, AutoModel
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

checkpoint = "microsoft/codebert-base"

model = AutoModel.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

tokenizer.sep_token_id

Out:

2

In:

tokenizer("this is the first sentence", "this is the second sentence")

Out:

{'input_ids': [0, 9226, 16, 5, 78, 3645, 2, 2, 9226, 16, 5, 200, 3645, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Also calling the respective tokenizer.tokenize and tokenizer.convert_tokens_to_ids functions does not produce two SEP tokens.

Looks like it’s the expected behavior for some of the tokenizers. At least according to this github issue.

Thanks for your reply! I did not expect to still get an answer but the post you shard answered my question! Merci :smiley: