How much cleaning for transformers?

nbroad · August 24, 2020, 7:49pm

I know that BERT has tokens for numbers, punctuation, and special characters (e.g. #@!%). If I’m training a language model, should I

Keep numbers, punctuation, and special characters
Remove only the aforementioned characters, leaving the rest of the sentence untouched
Remove the whole sentence if it contains any of those.

rgwatwormhill · August 26, 2020, 10:52pm

hi nbroad,

[I am not an expert]

I think it depends on the specifics of your data. I have a similar issue with my data. For example, some of my texts include repeated # characters or “l@@k”, designed to catch a viewer’s eye. I decided to delete this kind of thing, because it isn’t really language. It is likely that when Bert was being trained it didn’t often see them. What it tells me about the text is that the writer of the text was trying to catch some viewers’ attention. It doesn’t really tell me (or Bert) much else.

It’s a bit tricky, because some special characters might have some meaning in some contexts, for example “p/x” for “part exchange” might be frequent enough to have some meaning to Bert.

As a compromise, when I cleaned my data, I deleted all occurrences of | # * ] [ \ . Then I kept single occurrences of ! ( ) - ! ? , £ / +, but deleted any repeated occurrences.

In my case, it wasn’t necessary to remove the whole sentence if it contained “######”, because the text that remained after removing the offending “######” still made a meaningful sentence.

I haven’t yet decided what to do about numbers. It might be that Bert is able to make some sense of values such as 1984 or £2000, even though it has to tokenize them as 1, 9, 8, 4 and £, 2, 0, 0, 0. One thing I have recently realised is that my data include numbers with commas in (eg £2,000), and I’m pretty sure that would get a better representation if I removed the commas (ie cleaned it to £2000).

I don’t think it would be right to remove numbers altogether, but I’m starting to wonder if it would be useful to replace numbers with descriptors, such as " a few / lots / hundreds / thousands / millions / billions / recent date / historical date ". In some cases, it might be necessary to extract the actual numbers and include them as separate features.

So far as I know, the data that Bert was trained on wasn’t purged of special characters. I think it is likely that Bert will do a good job on data that is similar in style to the data it was trained on (books and wikipedia articles).

As usual, if in doubt: try it out.

nbroad · August 27, 2020, 4:45pm

Hi @rgwatwormhill,

Thank you for the response. Even though you preface your post by claiming you are not an expert, I appreciate your post nonetheless.

I asked a question to the forum earlier (it went unanswered) related to replacing special characters or bizarre strings with understandable strings. For instance, I figured a hyperlink that goes https://www… would not make much sense to Bert, but the word hyperlink might. This is similar with what you were saying: maybe replacing the numbers with words might make a positive impact.

I think the only way to find this out will be to try it both ways and see what happens. I’m sure there are a few people out there that have conclusive answers to this, but I’m not sure they will respond to this thread.

I recently stumbled on this video that came out almost a year ago. The video goes over wordpiece embeddings, but it also goes over what types ok tokens appear in the vocabulary. See here: https://youtu.be/zJW57aCBCTk?t=906

It confirms that there are characters from non-English languages, symbols, numbers, and other special characters in the base version of bert.

I was thinking in terms of training my own model using MLM, how much should I remove? I’ll probably just have to try and see.

Topic		Replies	Views
Should I normalize text or not Beginners	4	1932	April 26, 2024
Text preprocessing for fitting Tokenizer model 🤗Tokenizers	1	1388	October 25, 2022
Effect of punctuations on Transformer models Beginners	0	544	January 12, 2022
My BERT won’t predict any special tokens Beginners	0	59	June 20, 2024
Concatenate non string features to a BERT transformers model Beginners	5	2792	March 27, 2022

How much cleaning for transformers?

Related topics