What does the decoder with past values means

Those three parts consist of the encoder, the “decoder” (which actually consists of the decoder with the language modeling head), and the “decoder” with pre-computed key/values as additional inputs. This specific export comes from the fact that during the first pass, the decoder has no pre-computed key/values hidden-states, while during the rest of the generation past key/values will be used to speed up sequential decoding

Can you explain in detail about what is the difference between he decoder with the LM head and the decoder with the pre-computed the key/values. Both of them seems to be very confusing. Can you please explain it in detail and it is mentioned here

Also would like to know where the decoder_with_LM_Head and decoder_with_Past_head is used during the inference.

Hi @Rkoy,

Since #241, we have enabled the possibility to only export one decoder : the latter will not have pre-computed key/values as inputs. This will results in the past_key_values to be computed at each generation step. To enable this export you only need to set use_cache to False when calling the from_pretrained method. To speed up decoding by leveraging the key/values hidden-states which have already been computed in the previous generation step, you need to export a second decoder with additional pre-computed key/values as inputs.

2 Likes