I am looking to classify (binary) text and of course I prefer to use some transformers model that has already been trained on English and fine tuning it on my data.
But in my classification task it makes a big difference if there are special characters or not. For example, for the text “i like dogs” the label is 0 and for the text “i like “dogs”” the label is 1.
The problem is that every trained language model that I have already found ignores in one way or another special characters in its tokenizer (either they are cleaned or they are classified as unk, etc.).
Do you know a suitable model?