Train new VisonEncoderDecoder model for new languages

I am trying to train a new VisionEncoderDecoder model for new language (Bahasa-Indonesian).
The initial performance is pretty bad (72% CER). I am wondering whether I can get advices from anyone who tried similar tasks.

Based on my understanding, Bahasa-Indonesian uses Alphabet characters (without special characters such as German Epsilon). So my plan is as follows:

  1. Encoder: Pretrained ViT model (“google/vit-base-patch16-224”)
  2. Decoder: Pretrained Indonesian Roberta Language model ("'cahya/roberta-base-indonesian-1.5G")

For updating decoder weights using Pertained Roberta model, I have a few questions.
Currently, the parameter names from Roberta models are different from Decoder model parameters, so we need some mapping process. I did the following steps, and i am wondering whether there are some errors.


encoder = ViTModel.from_pretrained(“google/vit-base-patch16-224”)


lmconfig = TrOCRConfig(vocab_size=indonesian_lm_vocab_size, …, ) # update accordingly by indonesian lm
decoder = CausualLM(lmcfg)
lm = RobertaForCausalLM.from_pretrained(“cahya/roberta-base-indonesian-1.5G”)
decoder = load_wts(decoder, lm)

def load_wts(decoder, lm):
param_name_dict = {
‘attention.self.query.weight’: ‘self_attn.q_proj.weight’,
‘attention.self.query.bias’: ‘self_attn.q_proj.bias’,
‘attention.self.key.weight’: ‘self_attn.k_proj.weight’,
‘attention.self.key.bias’: ‘self_attn.k_proj.bias’,
‘attention.self.value.weight’: ‘self_attn.v_proj.weight’,
‘attention.self.value.bias’: ‘self_attn.v_proj.bias’,
‘attention.output.dense.weight’: ‘self_attn.out_proj.weight’,
‘attention.output.dense.bias’: ‘self_attn.out_proj.bias’,
‘attention.output.LayerNorm.weight’: ‘self_attn_layer_norm.weight’,
‘attention.output.LayerNorm.bias’: ‘self_attn_layer_norm.bias’,
‘output.LayerNorm.weight’: ‘final_layer_norm.weight’,
‘position_embeddings.weight’: ‘embed_positions.weight’,
‘LayerNorm.bias’: ‘layernorm_embedding.bias’
wts = lm.state_dict()
new_wts = {}
dwts = decoder.state_dict()
for key in wts.keys():
nkey = rename_param(key)
if nkey:
new_wts[nkey] = wts[key]
decoder.load_state_dict(new_wts, strict=False)
return decoder

As paper mentioned, this does not update weights in the encoder-decoder attention layers
since they do not exist in Roberta language model.

If you have any advice, please let me know. Thank you a lot.