How to use only one bert to do generation task with 'past_key_values' mechanism´╝č

I really like the rich text generation APIs of this project, especially the 'past_key_values' mechanism, which makes the generation process efficiently. I use UniLM, sadly it's not implemented in huggingface, and I'm eager to implement UniLM with 'past_key_values' mechanism, but has encountered a lot of difficulties.
The structure of UniLM is virtually the same as Bert, except the mask type of attention, so first I tried 'BertForMaskedLM', but it's forward function does't support 'past_key_values'. Then I tried 'BertModel', but the shape of past_key_values it returns are strange, so at the next step of decoding, input the past_key_values into the generate function causes:

--> 930  past_key_values_length = past_key_values[0][0].shape[2] if past_key_values is not None else 0
IndexError: tuple index out of range

what's the simplest way to implement UniLM, and used the rich APIs for text generation especially the 'past_key_values' mechanism? Please help me, thank you very much!

Any updates´╝č