I’m trying to retrain a NER. When I apply DataCollatorForTokenClassification to padding the shape of tokenizer and labels raise a error:
/usr/local/lib/python3.7/dist-packages/transformers/data/data_collator.py in <dictcomp>(.0)
326 ]
327
--> 328 batch = {k: torch.tensor(v, dtype=torch.int64) for k, v in batch.items()}
329 return batch
330
ValueError: expected sequence of length 512 at dim 1 (got 513)
- Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint,
max_length = 512,truncation = True,
padding = "max_length")
PreTrainedTokenizerFast(name_or_path='pierreguillou/ner-bert-large-cased-pt-lenerbr', vocab_size=29794, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})
- Data
DataCollatorForTokenClassification(
tokenizer,
max_length=512,
padding="max_length",
label_pad_token_id=-100)
DataCollatorForTokenClassification(tokenizer=PreTrainedTokenizerFast(name_or_path='pierreguillou/ner-bert-large-cased-pt-lenerbr', vocab_size=29794, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}), padding='max_length', max_length=512, pad_to_multiple_of=None, label_pad_token_id=-100, return_tensors='pt')
- Item train example tokenized
['[CLS]',
'analis',
'##e',
'da',
'defesa',
'da',
'interessa',
'##da',
'pres',
'##tadora',
'do',
'servi',
'##co',
'de',
'comunica',
'##ca',
'##o',
'multi',
'##mid',
'##ia',
'-',
's',
'##c',
'##m',
'e',
'servi',
'##co',
'telef',
'##oni',
'##co',
'fixo',
'comu',
'##tado',
'-',
's',
'##t',
'##f',
'##c',
'[SEP]']