Doubts about the tokenization strategy and the explanation of models through SHAP

MattBlue92 · May 22, 2024, 8:04am

Hi there!

I have a doubt about the tokenizers of the transformers models, and to simplify I will restrict the case to BERT.

BERT uses the wordpiece tokenizer therefore its features are word pieces and some whole words, the advantages of this approach are: a small dictionary, robustness to new words and computational efficiency.

I’m doing a project in text-classification and an important module of this project is about XAI, I was required to use SHAP to explain BERT’s classification.
We discovered without much surprise that SHAP highlights word pieces to explain BERT’s classification, but for my boss this isn’t good for him.

He wanted that SHAP highlights words and not piece of words but with wordpiece tokenizer it’s impossible because that algorithm works for split words!

Now he has asked me to train a new tokenizer that processes word-level features for BERT, but I am not very convinced about this strategy because

hugginface does not provide a word-level tokenizer to make from scracht;
it would result in a huge dictionary that would have to be built from the domain documents, which are very few (about a hundred documents), plus a huge generic data corpus like the one used by BERT and roBERTa to try to have as many words as possible and increase the resilience to new words;
it is a computationally expensive operation in terms of time, effort and resources, because BERT would then have to be retrained because its knowledge was based on the features extracted from the word piece tokeniser, and to get things right it would then have to redo a hyperparametisation, which would be very challenging as I do everything.
I don’t found papers (IEEE, elsever journals) or on web no one train word-level tokenizer to make from scracht for BERT and co.

I am convinced that the best way forward is to try a post-processing strategy where it is possible to reconstruct the information by retrieving the word embeddings of the split words and obtain the word embeddings of the reconstructed word and then understand how this can be used by SHAP.

Did anyone train from scratch a word-level tokenizer?
Did anyone use a post processing tecnique to retrive the correct embedding and rappresentation of a split word for SHAP?

Topic		Replies	Views
Strange shap analysis for text classification with BERT Beginners	10	879	September 17, 2024
Train wordpiece from scratch 🤗Tokenizers	2	1432	September 9, 2021
Access word piece tokens from BERT tokenized dataset 🤗Datasets	2	930	November 17, 2021
Training BERT from scratch with Wikipedia + Book Corpus Dataset 🤗Transformers	1	4635	January 22, 2021
How the vocabulary of BERT tokenizer is generated? 🤗Transformers	2	2938	January 6, 2024

Doubts about the tokenization strategy and the explanation of models through SHAP

Related topics