Keyword generation using T5

Hey folks,

I am trying to fine-tune a T5 for some sort of keyword generation.

Due to the way I’ve created my dataset (extracting keywords from a summary of the actual text) the gold keywords that I have might not be present in the actual text. For this reason a token classification task would not work. Thus my idea of text to text generation !

Ideally this would work like:

>>> input_text = "question: What are the keywords? context: [DOCUMENT]"
>>> inputs = tokenizer(input_text)
>>> outputs = model.generate(**inputs)
>>> print(tokenizer.decode(outputs))
finance - crisis - stocks

My main concern is that LM are actually trained to generate syntactically correct sentence. Would this be problematic?
Has anyone tried to perform a similar task or has any suggestion on how I could tackle this problem from a different angle? :nerd_face:

Cheers!