I’m using this code to try indonesian MBART pretrained model. I’m going to use it for summarization.
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer, pipeline
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
from indobenchmark import IndoNLGTokenizer
import time
import tensorflow as tf
import os, re, logging
import pandas as pd
model1 = "indobenchmark/indobart-v2"
print(model1)
tokenizer = IndoNLGTokenizer.from_pretrained("indobenchmark/indobart-v2")
model = AutoModelForSeq2SeqLM.from_pretrained(model1)
summarizer = pipeline(
"summarization", model=model, tokenizer=tokenizer,
num_beams=5, do_sample=True, no_repeat_ngram_size=3
)
summarizer(
"some indonesian article",
min_length=20,
max_length=144,
)
the code throws this error
TypeError: decode() got an unexpected keyword argument 'clean_up_tokenization_spaces'
complete stacktrace : err mbart indobenchmark · GitHub. Note that this error shows at jupyter notebook.
I always get such an error while inferencing using pipeline. What’s actually going wrong here?
I also using this method to try to generate summary. It runs normally.
def sumt5m(model, pr):
input_ids = tokenizer.encode(pr, max_length=10240, return_tensors='pt')
summary_ids = model.generate(input_ids,
max_length=100,
num_beams=2,
repetition_penalty=2.5,
length_penalty=1.0,
early_stopping=True,
no_repeat_ngram_size=2,
use_cache=True)
return tokenizer.decode(summary_ids[0], skip_special_tokens=True)