Strange shap analysis for text classification with BERT

MattBlue92 · April 29, 2024, 4:59pm

I’m using a Bert model and noticed that the model doesn’t use the whole word in some cases. I’m wondering if this is because of the tokeniser or the phrases in my dataset. There are 12 classes, 9 of which are under 10%. There are about 1200 phrases in the whole dataset.
I think the reason is either that I don’t have enough data or that I’m doing something wrong. Maybe I’m using the tokeniser wrongly. I’ve allready removed the rare classes, now I’m working with only 3 class and the performance are better now, but the phenomenon still persist.

swtb · April 30, 2024, 1:14pm

Tokenizers do not necessarily split into words. often they split larger words into chunks. For example

The word banana may be split as “ban”, “an”, “a”. This is efficient because the token “a” is commonly used and the token “an” is often used. So they are split into puzzle pieces such that we can construct many words from the pieces.

When this happens we usually give “an”, “a” the encoding -100, -100. which tells the loss function to ignore them. However, “ban” is still used normally.

From what I can tell through experiments, the attention seems to reconstruct the association between the tokens even though the last two tokens are ignored by the loss function. And so the model still “sees” the trailing tokens.

This may explain why the model doesnt seem to “use the whole word” though it does still “see” the whole word.

MattBlue92 · April 30, 2024, 2:11pm

Thank you swtb for your response.

I want to show you the problem with ad image of my work.
The text is in italian, so I fine-tuned this version of bert dbmdz/bert-base-italian-xxl-uncased (dbmdz/bert-base-italian-xxl-uncased · Hugging Face) using AutoModelForSequenceClassification.
In the example the word “gessato” have two highlighting, one for the part “ges” end the other for “sato”. So I don’t understand why this append, at this point I exclude that the tokenizer is to blame.
I think the problem is with the model. I work with little data and my model overfits after a few epochs. The results for many classes are not good both in validation and test (for example, I reach 85% recall or precision for the most important ones). Maybe this is the problem. The BERT model may not have found the best features for my problem, maybe the last layers of my model don’t understand how to get good features.

swtb · April 30, 2024, 2:18pm

ges and sato are likely separate tokens. I think you may be chasing a red herring. What do your loss curves (train and validation) look like? what other metrics do you log? What parameters have you experimented with? experiment with other models too such as XLM-Roberta?

XLM-RoBERTa (huggingface.co)

MattBlue92 · April 30, 2024, 2:50pm

This is the situation, smells of overfitting.

I used a focal loss with gamma =5, the weight for the loss computed as (1-p)**gamma where p is the class proportion, lr = 0.00004, weight_decay = 8.471051501108374e-08, dropout for the layer classifcation set to 0.1415, early_stopper set to patience = 5.
I used Adam as optimizer and the cosine scheduler, the warmup steps is the 10% of (|training_dataset|/batch_size) and num_training_steps=len(df_train_data) * epochs)

I poste even the classification reports for test set

swtb · April 30, 2024, 3:02pm

Your weight decay seems rather irrelevant at e-8 I often see this parameter default at 0.01, this may assist with overfitting.

have you experimented without weighted loss? maybe you are accelerating the training of underrepped classes too much, or penalising overrepped classes too much.

Also, try using a larger model and smaller model to see how that effects scores. I Believe XLM RoBERTa does support Italian

I find the impact of cosine scheduler and warmup is not the most impactful, you may consider removing them while you optimise other parameters. less moving parts is always helpful. Given that you are using Adam you will find it naturally adapts its LR without needing a scheduler.

Lastly, It is also possible that your data needs a clean out or augmentation of some kind.

MattBlue92 · April 30, 2024, 3:36pm

You are indeed right about weight decay, this could be the first change to be made. As for the other models, I have not tried them because I follow orders, but since I have little data it might be worth trying a smaller Bert model and trying other models that support Italian. I will also try simplifying the scheduler to see what comes out. In any case by reducing the number of classes to 3 and more, performance increases, some classes get to have f1-score 90% on the validation set, but still more needs to be done. So yes, I want to try smaller models since little data and usually smaller models should work better in these cases.

PS: for cleaning data you mean stop-word and punctuation removing and stemming? I always believed that for transformer models this preprocessing was not necessary.

swtb · April 30, 2024, 3:48pm

Yes, if your dataset is small you will struggle to satisfy a larger model.

Strongly recommend use of RoBERTa as it uses the sentence piece tokeniser. Whereas Bert I think uses word piece. Sentence piece has notable advantages.

MattBlue92 · April 30, 2024, 3:53pm

ok I will try with RoBERTa XLM, I think that should be a RoBERTa for italian called GilBERTo, maybe I can try with both.

Just a last question, for cleaning data you mean stop-word and punctuation removing and stemming, right? I always believed that for transformer models this preprocessing was not necessary.

swtb · April 30, 2024, 3:58pm

XLM- RoBERTa is multilingual anyways, so you can use the official base and large models. Though one that is specifically Italian may be better

ba-at-huggingface · September 17, 2024, 8:50pm

Wondering if you ever tried Shap for words, even though the tokenizers tokenize for word-pieces.

Topic		Replies	Views
Doubts about the tokenization strategy and the explanation of models through SHAP 🤗Tokenizers	0	231	May 22, 2024
Token Classification Model making mistake outside of training dataset Intermediate	0	461	October 30, 2021
Do you need to use the associated tokenizer Beginners	2	569	June 6, 2022
Sentence splitting 🤗Tokenizers	7	31821	September 15, 2022
Do I have to only tokens in Bert dataset for token classification 🤗Datasets	0	131	January 18, 2024

Strange shap analysis for text classification with BERT

Related topics