Fine-tuned BERT model, how to deal with abbreviations and English as non-first language?

Hello,

I am a software developer slightly dabbling in ML with not much experience at all. I am using the Hugging Face library using the BERT LLM and fine-tuning it based on some industry-related data to provide a sentiment analysis for groups of text. I work for a healthcare company, and we are trying to get a sentiment analysis of a shift note from a carer, but there are a few problems. I understand that it’s hard to grasp where I’m coming from without seeing the training data-set but using the real-world data the analysis is correct around 87% of the time and it is good at picking up context, however the notes that are incorrect are usually when they have one of those problems in it, so I’m wondering if there is a way to get around this or try make up for these errors?

  1. Abbreviations are used a lot for things like locations, businesses and just general words/phrases. Do I need to add in the abbreviations into the data sets? For example- a note might say “OT today”, meaning they went to the Occupational Therapist today. Which is a positive analysis, so would I need to train the data on the abbreviation as well as the normal word?
    Also when it is an abbreviation of something that I wouldn’t know (which may be in context of their company/location/industry) how could I help the data, for example in a note “Friendlies for a LMW fasting test.” came back negative with 0.99 confidence, and there is nothing in there that would determine negative based on the training data.

  2. A lot of workers do not have English as their first language, so there are misspelled words and the incorrect word (which is a correctly spelt word) based on the context of what they are trying to say- eg. “I didnt no what that is” (no is supposed to be know). Does this have that much of an impact on the performance of the model?

  3. Do numbers play a role in the analysis as there are notes like “Physio at 10” which come back negative, while there would be no data in the data-set that would make this negative.

  4. Does punctuation play a big role in determining the context as well? I would get a note like this that comes back negative 0.85 confidence (names changed for privacy)- “8:30 TL Kieran start shift Ben came and cut trees shapes with jigsaw Planing with John 11:00 TL Kieran finished shift”. Where the person didn’t input any punctuation, and it’s all hard to understand context.

Happy for any sort of advice/direction to look in as I am very new to the area and keen to explore more. :slight_smile:

Like you I am just dabbling so I could be wrong here! Perhaps its not just the data. Isn’t the base model accuracy rates around 90% to begin with after fine tuning?

A few things:

  1. abbreviations - i generally preprocess my data where there are uncommon abbreviations to replace them for training and prediction. Its easier than training the model to recognize them specially where my datasets are small.

  2. misspelled words may not cause problems if your text is long enough so that there is enough other context. You can also consider preprocessing using something like this. https://github.com/sebastianruder/NLP-progress/blob/master/english/grammatical_error_correction.md

  3. Not sure about numbers - I would have assumed that bert’s training data set allows it to recognize that 10 is similar to ten.

  4. Again my understanding is that bert is trained with punctuation - not sure how it deals with incorrect punctuation though.

  5. You can consider taking all of the misclassified inputs and creating another training set to fine tune your model.