How to combine TF-IDF weights with transformers?

Hi, I’m trying to implement the approach suggested by this paper. In their words

“we apply the TF-IDF score in the BERT mask layer, making the different attention score for the embedding crossing”
“through the attention mechanism of BERT model, we converted the distance between two words at any position to 1, which effectively solves the difficult long-term dependence problem in NLP. So we can directly use the feature representation of BERT as the word embedding feature of the following task.”

I can’t figure out how to implement this. Pretty much all examples on doing arithmetic on models are on hidden states or the pooled output, and I’m not sure if TF-IDF weighing there makes any sense.

I am also interested in the same question. Any luck with this @ebrky ?