Strange shap analysis for text classification with BERT

I’m using a Bert model and noticed that the model doesn’t use the whole word in some cases. I’m wondering if this is because of the tokeniser or the phrases in my dataset. There are 12 classes, 9 of which are under 10%. There are about 1200 phrases in the whole dataset.
I think the reason is either that I don’t have enough data or that I’m doing something wrong. Maybe I’m using the tokeniser wrongly. I’ve allready removed the rare classes, now I’m working with only 3 class and the performance are better now, but the phenomenon still persist.

1 Like

Tokenizers do not necessarily split into words. often they split larger words into chunks. For example

The word banana may be split as “ban”, “an”, “a”. This is efficient because the token “a” is commonly used and the token “an” is often used. So they are split into puzzle pieces such that we can construct many words from the pieces.

When this happens we usually give “an”, “a” the encoding -100, -100. which tells the loss function to ignore them. However, “ban” is still used normally.

From what I can tell through experiments, the attention seems to reconstruct the association between the tokens even though the last two tokens are ignored by the loss function. And so the model still “sees” the trailing tokens.

This may explain why the model doesnt seem to “use the whole word” though it does still “see” the whole word.

Thank you swtb for your response.

I want to show you the problem with ad image of my work.
The text is in italian, so I fine-tuned this version of bert dbmdz/bert-base-italian-xxl-uncased (dbmdz/bert-base-italian-xxl-uncased · Hugging Face) using AutoModelForSequenceClassification.
In the example the word “gessato” have two highlighting, one for the part “ges” end the other for “sato”. So I don’t understand why this append, at this point I exclude that the tokenizer is to blame.
I think the problem is with the model. I work with little data and my model overfits after a few epochs. The results for many classes are not good both in validation and test (for example, I reach 85% recall or precision for the most important ones). Maybe this is the problem. The BERT model may not have found the best features for my problem, maybe the last layers of my model don’t understand how to get good features.

ges and sato are likely separate tokens. I think you may be chasing a red herring. What do your loss curves (train and validation) look like? what other metrics do you log? What parameters have you experimented with? experiment with other models too such as XLM-Roberta?

XLM-RoBERTa (huggingface.co)

This is the situation, smells of overfitting.

I used a focal loss with gamma =5, the weight for the loss computed as (1-p)**gamma where p is the class proportion, lr = 0.00004, weight_decay = 8.471051501108374e-08, dropout for the layer classifcation set to 0.1415, early_stopper set to patience = 5.
I used Adam as optimizer and the cosine scheduler, the warmup steps is the 10% of (|training_dataset|/batch_size) and num_training_steps=len(df_train_data) * epochs)

I poste even the classification reports for test set

Your weight decay seems rather irrelevant at e-8 I often see this parameter default at 0.01, this may assist with overfitting.

have you experimented without weighted loss? maybe you are accelerating the training of underrepped classes too much, or penalising overrepped classes too much.

Also, try using a larger model and smaller model to see how that effects scores. I Believe XLM RoBERTa does support Italian

I find the impact of cosine scheduler and warmup is not the most impactful, you may consider removing them while you optimise other parameters. less moving parts is always helpful. Given that you are using Adam you will find it naturally adapts its LR without needing a scheduler.

Lastly, It is also possible that your data needs a clean out or augmentation of some kind.

You are indeed right about weight decay, this could be the first change to be made. As for the other models, I have not tried them because I follow orders, but since I have little data it might be worth trying a smaller Bert model and trying other models that support Italian. I will also try simplifying the scheduler to see what comes out. In any case by reducing the number of classes to 3 and more, performance increases, some classes get to have f1-score 90% on the validation set, but still more needs to be done. So yes, I want to try smaller models since little data and usually smaller models should work better in these cases.

PS: for cleaning data you mean stop-word and punctuation removing and stemming? I always believed that for transformer models this preprocessing was not necessary.

Yes, if your dataset is small you will struggle to satisfy a larger model.

Strongly recommend use of RoBERTa as it uses the sentence piece tokeniser. Whereas Bert I think uses word piece. Sentence piece has notable advantages.

ok I will try with RoBERTa XLM, I think that should be a RoBERTa for italian called GilBERTo, maybe I can try with both.

Just a last question, for cleaning data you mean stop-word and punctuation removing and stemming, right? I always believed that for transformer models this preprocessing was not necessary.

XLM- RoBERTa is multilingual anyways, so you can use the official base and large models. Though one that is specifically Italian may be better :slight_smile: