hi nbroad,
[I am not an expert]
I think it depends on the specifics of your data. I have a similar issue with my data. For example, some of my texts include repeated # characters or “l@@k”, designed to catch a viewer’s eye. I decided to delete this kind of thing, because it isn’t really language. It is likely that when Bert was being trained it didn’t often see them. What it tells me about the text is that the writer of the text was trying to catch some viewers’ attention. It doesn’t really tell me (or Bert) much else.
It’s a bit tricky, because some special characters might have some meaning in some contexts, for example “p/x” for “part exchange” might be frequent enough to have some meaning to Bert.
As a compromise, when I cleaned my data, I deleted all occurrences of | # * ] [ \ . Then I kept single occurrences of ! ( ) - ! ? , £ / +, but deleted any repeated occurrences.
In my case, it wasn’t necessary to remove the whole sentence if it contained “######”, because the text that remained after removing the offending “######” still made a meaningful sentence.
I haven’t yet decided what to do about numbers. It might be that Bert is able to make some sense of values such as 1984 or £2000, even though it has to tokenize them as 1, 9, 8, 4 and £, 2, 0, 0, 0. One thing I have recently realised is that my data include numbers with commas in (eg £2,000), and I’m pretty sure that would get a better representation if I removed the commas (ie cleaned it to £2000).
I don’t think it would be right to remove numbers altogether, but I’m starting to wonder if it would be useful to replace numbers with descriptors, such as " a few / lots / hundreds / thousands / millions / billions / recent date / historical date ". In some cases, it might be necessary to extract the actual numbers and include them as separate features.
So far as I know, the data that Bert was trained on wasn’t purged of special characters. I think it is likely that Bert will do a good job on data that is similar in style to the data it was trained on (books and wikipedia articles).
As usual, if in doubt: try it out.