I am trying to fine-tune a T5 for some sort of keyword generation.
Due to the way I’ve created my dataset (extracting keywords from a summary of the actual text) the gold keywords that I have might not be present in the actual text. For this reason a token classification task would not work. Thus my idea of text to text generation !
Ideally this would work like:
>>> input_text = "question: What are the keywords? context: [DOCUMENT]" >>> inputs = tokenizer(input_text) >>> outputs = model.generate(**inputs) >>> print(tokenizer.decode(outputs)) finance - crisis - stocks
My main concern is that LM are actually trained to generate syntactically correct sentence. Would this be problematic?
Has anyone tried to perform a similar task or has any suggestion on how I could tackle this problem from a different angle?