Can we use tokenizer from one architecture and model from another one?

I’ve a Bert tokenizer, which is pre-trained on some dataset. Now I want to fine tune some task in-hand with a Roberta model. So in this scenario

  1. Can I use Bert tokenizer output as input to Roberta Model?
  2. Does such kind of setup makes sense between autoregressive and non-autoregressive models, i.e., using Bert tokenizer with XLNet model?
  3. Does these kind of setups make sense?

From what I understand, this can be implemented, but doesn’t make sense. But I can use some experience or clarification in this direction.

hi sps.

I think it would be possible to use a Bert tokenizer with a Roberta Model, but you would have to train the Roberta model from scratch. You wouldn’t be able to take advantage of transfer learning by using a pre-trained Roberta.

Why would you want to do that?

You might run into problems with things like the sep and cls tokens, which might have different conventions between Bert and Roberta, though I expect you could write some code to deal with that.

A tokenizer splits your text up into chunks, and replaces each chunk with a numerical value. I think Bert and Roberta do this in different ways, but that shouldn’t make the systems incompatible. Any embedding layer should be able to learn to use the numbers that come out of WordPiece, BytePair or SentencePiece tokenizers.

Have you seen this intro to tokenizers [Summary of the tokenizers — transformers 4.11.1 documentation]

1 Like

Yes, I agree with you. This type of setup largely doesn’t make sense.

Actually, I wanted to use some autoregressive model such as XLNet, but for my specific data I don’t have XLNet model. So I don’t have an appropriate tokenizer. I was having this wild thought, if I can use some pre-existing tokenizer(which trained on similar data to that of mine) and input it to XLNet model. Argument being(as you mentioned) - anyhow layers would learn some thing, in the worst case they will learn from scratch similar to finetune setting.

But yeah, I also think this can’t/shouldn’t be done as we wouldn’t know how worse the scenario has become.