I want to use German Sentiment (" German Sentiment Classification with Bert") repository (Link: GitHub - oliverguhr/german-sentiment-lib: An easy to use python package for deep learning-based german sentiment classification.).
I have a dataset of tweets in German language. Do I need to use tokenizer before applying this model? And if yes which one?
In the repository as an example no tokenizer is being used.
Hi @Al3ksandra ,
the repo uses the
Which is basically a fine-tuned model of the
bert-base-german-cased on sentiment datasets. This can be verified when comparing the vocab of the original German BERT model and the fine-tuned one:
$ curl https://huggingface.co/oliverguhr/german-sentiment-bert/resolve/main/vocab.txt > vocab_sentiment.txt
$ curl https://huggingface.co/bert-base-german-cased/resolve/main/vocab.txt > vocab_german_bert.txt
$ diff vocab_sentiment.txt vocab_german_bert.txt
It has no difference.
You do not need to use your own tokenizer/cleanup steps, because it has built-in methods. So when calling the
predict_sentiment method, your input text will be tokenized and cleaned before:
I hope this helps