I have two text features, ātitleā and āabstractā. I want to tokenize these two columns in a Dataset.
# https://huggingface.co/jjzha/jobbert-base-cased
checkpoint = 'jjzha/jobbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.model_max_len = 512
def tokenize(batch):
tokenized_text1 = tokenizer(batch["title"], truncation=True, padding='max_length', max_length=512)
tokenized_text2 = tokenizer(batch["abstract"], truncation=True, padding='max_length', max_length=512)
return {'input_ids1': tokenized_text1['input_ids'], 'attention_mask1': tokenized_text1['attention_mask'],
'token_type_ids1': tokenized_text1['token_type_ids'],
'input_ids2': tokenized_text2['input_ids'], 'attention_mask2': tokenized_text2['attention_mask'],
'token_type_ids2': [1] * len(tokenized_text2['token_type_ids']) #100 * [[1] * 512]
}
dataset_dict.map(tokenize, batched=True)['train'].to_pandas()#['input_ids1']
In the above code, in this line ātoken_type_ids2ā, I want to create the token_type_ids full of 1s for the second text feature. But it is not working as expected.
By examining the tokenized dataset, the column ātoken_type_ids2ā for each row, is a single value 1. But it is supposed to be a list of 1s with length 512.
How to solve this problem?
BTW, If I fix the value there to be ā100 * [[1] * 512]ā, it shows the correct token_type_ids. 100 is the size of the original dataset. But shouldnāt the map function applies the tokenize function to each row?