Hey Guys,
this post is basically a copy of this one. The first one did not get any attention, but I think the problem might be important and this is probably a better place to ask the question. If this is the wrong place to post this, please let me know where to post.
I noticed, that the âmicrosoft/codebert-baseâ tokenizer adds two SEP Tokens between two sentences. I donât think this is intended behavior, but feel free to correct me.
This is the example I originally posted:
from transformers import AutoTokenizer, AutoModel
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = "microsoft/codebert-base"
model = AutoModel.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
from datasets import load_dataset
raw_dataset = load_dataset('json', data_files='/home/<user>/Data/<DataDir>/dataset_v1.jsonl', split='train')
def toke(example):
return tokenizer(example["sentence1"], example["sentence2"])
tokenized_dataset = raw_dataset.select(list(range(10000))).map(toke, batched=True)
print(tokenized_dataset[7]['sentence1'])
print(tokenized_dataset[7]['sentence2'])
print(tokenized_dataset[7]['input_ids'])
Output:
train_nan_df.head()
test_df[âImageIdâ] = np.array(os.listdir (ââŚ/input/test_images/â))
[0, 21714, 1215, 10197, 1215, 36807, 4, 3628, 43048, 2, 2, 21959, 1215, 36807, 48759, 8532, 28081, 44403, 5457, 46446, 4, 30766, 1640, 366, 4, 8458, 41292, 31509, 49445, 46797, 73, 21959, 1215, 39472, 73, 108, 35122, 2]
Here is another example using publicly available data:
In:
!pip install transformers datasets
from transformers import AutoTokenizer, AutoModel
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
checkpoint = "microsoft/codebert-base"
model = AutoModel.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.sep_token_id
Out:
2
In:
tokenizer("this is the first sentence", "this is the second sentence")
Out:
{'input_ids': [0, 9226, 16, 5, 78, 3645, 2, 2, 9226, 16, 5, 200, 3645, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Also calling the respective tokenizer.tokenize and tokenizer.convert_tokens_to_ids functions does not produce two SEP tokens.