How much cleaning for transformers?

I know that BERT has tokens for numbers, punctuation, and special characters (e.g. #@!%). If I’m training a language model, should I

  1. Keep numbers, punctuation, and special characters
  2. Remove only the aforementioned characters, leaving the rest of the sentence untouched
  3. Remove the whole sentence if it contains any of those.

hi nbroad,

[I am not an expert]

I think it depends on the specifics of your data. I have a similar issue with my data. For example, some of my texts include repeated # characters or “l@@k”, designed to catch a viewer’s eye. I decided to delete this kind of thing, because it isn’t really language. It is likely that when Bert was being trained it didn’t often see them. What it tells me about the text is that the writer of the text was trying to catch some viewers’ attention. It doesn’t really tell me (or Bert) much else.

It’s a bit tricky, because some special characters might have some meaning in some contexts, for example “p/x” for “part exchange” might be frequent enough to have some meaning to Bert.

As a compromise, when I cleaned my data, I deleted all occurrences of | # * ] [ \ . Then I kept single occurrences of ! ( ) - ! ? , £ / +, but deleted any repeated occurrences.

In my case, it wasn’t necessary to remove the whole sentence if it contained “######”, because the text that remained after removing the offending “######” still made a meaningful sentence.

I haven’t yet decided what to do about numbers. It might be that Bert is able to make some sense of values such as 1984 or £2000, even though it has to tokenize them as 1, 9, 8, 4 and £, 2, 0, 0, 0. One thing I have recently realised is that my data include numbers with commas in (eg £2,000), and I’m pretty sure that would get a better representation if I removed the commas (ie cleaned it to £2000).

I don’t think it would be right to remove numbers altogether, but I’m starting to wonder if it would be useful to replace numbers with descriptors, such as " a few / lots / hundreds / thousands / millions / billions / recent date / historical date ". In some cases, it might be necessary to extract the actual numbers and include them as separate features.

So far as I know, the data that Bert was trained on wasn’t purged of special characters. I think it is likely that Bert will do a good job on data that is similar in style to the data it was trained on (books and wikipedia articles).

As usual, if in doubt: try it out.

1 Like

Hi @rgwatwormhill,

Thank you for the response. Even though you preface your post by claiming you are not an expert, I appreciate your post nonetheless.

I asked a question to the forum earlier (it went unanswered) related to replacing special characters or bizarre strings with understandable strings. For instance, I figured a hyperlink that goes https://www… would not make much sense to Bert, but the word hyperlink might. This is similar with what you were saying: maybe replacing the numbers with words might make a positive impact.

I think the only way to find this out will be to try it both ways and see what happens. I’m sure there are a few people out there that have conclusive answers to this, but I’m not sure they will respond to this thread.

I recently stumbled on this video that came out almost a year ago. The video goes over wordpiece embeddings, but it also goes over what types ok tokens appear in the vocabulary. See here:

It confirms that there are characters from non-English languages, symbols, numbers, and other special characters in the base version of bert.

I was thinking in terms of training my own model using MLM, how much should I remove? I’ll probably just have to try and see.

1 Like