Hi,
I’d like to run some fine-tuning experiments with this German BART model, but am finding it difficult to even get started due to the lack of documentation.
From what I can tell, the model is configured as FSMTForConditionalGeneration
, which requires language tags to be specified when loading the tokenizer. My naïve guess would be to specify something like ['de', 'de']
(for German) or ['src', 'tgt']
, however, doing either of these simply returns any input text as a sequence of tokens. Below is a minimal example.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("timo/timo-BART-german", ['de', 'de'])
text = "Meine Freunde sind nett aber sie essen zu viel Kuchen."
input_ids = tokenizer([text], add_special_tokens=False, return_tensors='pt')['input_ids']
>>> tensor([[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3]])
Is anyone able to point me in the direction of a good tutorial/guide on how to get started with community models? Or better yet, @timo, any chance of providing a model card for this model to give an idea of its status/usability?
Thanks in advance!