Finetune BERT for information extraction

Hello,

What is the best way to finetune/pre-train a model using BERT/T5 or BART (or something else) that would convert text to extracted JSON:

My name is John{ name: John }

Assuming my data comes in the form of:

# train.tsv
text,extraction
name:   Jill,name:Jill
My name is Jack,name:Jack

So far I’ve tried:

model_checkpoint = "google/mt5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
tokenizer  = AutoTokenizer.from_pretrained(model_checkpoint)

# ... skipping for brevity

args = Seq2SeqTrainingArguments(...)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()

Is it a good idea to use an existing Seq2Seq model? Or is there a way to transfer the weights of the Transformer layers and discard the heads?

The reason I ask is b/c - while the Encoder weights are valuable … I feel like the Decoder weights or LMHeads are actually counterproductive - since they’ve been pre-trained on a natural language and not JSON (or any structured language) … and so my intuition would be to to re-initialize them randomly and train their weights from scratch