TypeError: decode() got an unexpected keyword argument 'clean_up_tokenization_spaces'

adib-enc · February 5, 2023, 1:02pm

I’m using this code to try indonesian MBART pretrained model. I’m going to use it for summarization.

from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, pipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from indobenchmark import IndoNLGTokenizer
import time
import tensorflow as tf
import os, re, logging
import pandas as pd

model1 = "indobenchmark/indobart-v2"
print(model1)

tokenizer = IndoNLGTokenizer.from_pretrained("indobenchmark/indobart-v2")
model = AutoModelForSeq2SeqLM.from_pretrained(model1)

summarizer = pipeline(
    "summarization", model=model, tokenizer=tokenizer, 
    num_beams=5, do_sample=True, no_repeat_ngram_size=3
)

summarizer(
    "some indonesian article",
    min_length=20,
    max_length=144,
)

the code throws this error

TypeError: decode() got an unexpected keyword argument 'clean_up_tokenization_spaces'

complete stacktrace : err mbart indobenchmark · GitHub. Note that this error shows at jupyter notebook.

I always get such an error while inferencing using pipeline. What’s actually going wrong here?

I also using this method to try to generate summary. It runs normally.

def sumt5m(model, pr):
	input_ids = tokenizer.encode(pr, max_length=10240, return_tensors='pt')
	summary_ids = model.generate(input_ids,
				max_length=100, 
				num_beams=2,
				repetition_penalty=2.5, 
				length_penalty=1.0, 
				early_stopping=True,
				no_repeat_ngram_size=2,
				use_cache=True)
	return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

adib-enc · February 13, 2023, 9:22am

Apparently I used custom tokenizer from indobenchmark, and the code is a bit outdated. I fixed it by adding clean_up_tokenization_spaces arg to correctly override decode method from PreTrainedTokenizer baseclass, like this

	def decode(self, inputs, skip_special_tokens=False, clean_up_tokenization_spaces: bool = True):
		outputs = super().decode(inputs, skip_special_tokens=skip_special_tokens, clean_up_tokenization_spaces=clean_up_tokenization_spaces)
		return outputs.replace(' ','').replace('▁', ' ')

Topic		Replies	Views
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	25	April 22, 2025
DataCollatorWithPadding: TypeError Course	1	1916	November 21, 2021
TypeError: Repository.__init__() got an unexpected keyword argument 'token' Intermediate	8	14566	August 9, 2023
AutoModelForQuestionAnswering : TypeError: __init__() got an unexpected keyword argument 'return_dict' Models	2	2421	November 13, 2020
How to decode with spaces? 🤗Tokenizers	0	1845	April 28, 2022

TypeError: decode() got an unexpected keyword argument 'clean_up_tokenization_spaces'

Related topics