Hello,
What is the best way to finetune/pre-train a model using BERT/T5 or BART (or something else) that would convert text to extracted JSON:
My name is John
→ { name: John }
Assuming my data comes in the form of:
# train.tsv
text,extraction
name: Jill,name:Jill
My name is Jack,name:Jack
So far I’ve tried:
model_checkpoint = "google/mt5-small"
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
# ... skipping for brevity
args = Seq2SeqTrainingArguments(...)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["test"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
trainer.train()
Is it a good idea to use an existing Seq2Seq model? Or is there a way to transfer the weights of the Transformer layers and discard the heads?
The reason I ask is b/c - while the Encoder weights are valuable … I feel like the Decoder weights or LMHeads are actually counterproductive - since they’ve been pre-trained on a natural language and not JSON (or any structured language) … and so my intuition would be to to re-initialize them randomly and train their weights from scratch