How to return custom `token_type_ids` from a tokenizer?

lonewar · May 2, 2023, 3:25pm

I have two text features, ‘title’ and ‘abstract’. I want to tokenize these two columns in a Dataset.

# https://huggingface.co/jjzha/jobbert-base-cased
checkpoint = 'jjzha/jobbert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
tokenizer.model_max_len = 512

def tokenize(batch):
    tokenized_text1 = tokenizer(batch["title"], truncation=True, padding='max_length', max_length=512)
    tokenized_text2 = tokenizer(batch["abstract"], truncation=True, padding='max_length', max_length=512)

    return {'input_ids1': tokenized_text1['input_ids'], 'attention_mask1': tokenized_text1['attention_mask'],
        'token_type_ids1': tokenized_text1['token_type_ids'],
        'input_ids2': tokenized_text2['input_ids'], 'attention_mask2': tokenized_text2['attention_mask'],
        'token_type_ids2': [1] * len(tokenized_text2['token_type_ids']) #100 * [[1] * 512]
            }

dataset_dict.map(tokenize, batched=True)['train'].to_pandas()#['input_ids1']

In the above code, in this line ‘token_type_ids2’, I want to create the token_type_ids full of 1s for the second text feature. But it is not working as expected.
By examining the tokenized dataset, the column “token_type_ids2” for each row, is a single value 1. But it is supposed to be a list of 1s with length 512.

How to solve this problem?
BTW, If I fix the value there to be “100 * [[1] * 512]”, it shows the correct token_type_ids. 100 is the size of the original dataset. But shouldn’t the map function applies the tokenize function to each row?

Topic		Replies	Views
How to return custom `token_type_ids` or other values from a tokenizer? 🤗Tokenizers	0	692	May 3, 2023
LongFormer tokenizer has the same token_type_ids for sequence pairs 🤗Tokenizers	0	719	December 20, 2021
How to tokenize using map 🤗Datasets	4	6301	April 14, 2021
Do any models support 3 types of token_type_ids? 🤗Transformers	0	176	June 21, 2021
Do I need token_type_ids for BertForSequenceClassification? 🤗Transformers	2	217	October 12, 2020

How to return custom `token_type_ids` from a tokenizer?

Related topics