German Sentiment (tokenizer)?

Hi,

I want to use German Sentiment (" German Sentiment Classification with Bert") repository (Link: GitHub - oliverguhr/german-sentiment-lib: An easy to use python package for deep learning-based german sentiment classification.).

I have a dataset of tweets in German language. Do I need to use tokenizer before applying this model? And if yes which one?

In the repository as an example no tokenizer is being used.

Thank you!

Hi @Al3ksandra ,

the repo uses the oliverguhr/german-sentiment-bert model:

Which is basically a fine-tuned model of the bert-base-german-cased on sentiment datasets. This can be verified when comparing the vocab of the original German BERT model and the fine-tuned one:

$ curl https://huggingface.co/oliverguhr/german-sentiment-bert/resolve/main/vocab.txt > vocab_sentiment.txt
$ curl https://huggingface.co/bert-base-german-cased/resolve/main/vocab.txt > vocab_german_bert.txt

$ diff vocab_sentiment.txt vocab_german_bert.txt

It has no difference.

You do not need to use your own tokenizer/cleanup steps, because it has built-in methods. So when calling the predict_sentiment method, your input text will be tokenized and cleaned before:

I hope this helps :slight_smile:

1 Like

Thanks Stefan! It helps :smiley: