Hi,
I want to use German Sentiment (" German Sentiment Classification with Bert") repository (Link: GitHub - oliverguhr/german-sentiment-lib: An easy to use python package for deep learning-based german sentiment classification. ).
I have a dataset of tweets in German language. Do I need to use tokenizer before applying this model? And if yes which one?
In the repository as an example no tokenizer is being used.
Thank you!
Hi @Al3ksandra ,
the repo uses the oliverguhr/german-sentiment-bert
model:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from typing import List
import torch
import re
class SentimentModel():
def __init__(self, model_name: str = "oliverguhr/german-sentiment-bert"):
if torch.cuda.is_available():
self.device = 'cuda'
else:
self.device = 'cpu'
self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
self.model = self.model.to(self.device)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
Which is basically a fine-tuned model of the bert-base-german-cased
on sentiment datasets. This can be verified when comparing the vocab of the original German BERT model and the fine-tuned one:
$ curl https://huggingface.co/oliverguhr/german-sentiment-bert/resolve/main/vocab.txt > vocab_sentiment.txt
$ curl https://huggingface.co/bert-base-german-cased/resolve/main/vocab.txt > vocab_german_bert.txt
$ diff vocab_sentiment.txt vocab_german_bert.txt
It has no difference.
You do not need to use your own tokenizer/cleanup steps, because it has built-in methods. So when calling the predict_sentiment
method, your input text will be tokenized and cleaned before:
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
self.clean_http_urls = re.compile(r'https*\S+', re.MULTILINE)
self.clean_at_mentions = re.compile(r'@\S+', re.MULTILINE)
def predict_sentiment(self, texts: List[str], output_probabilities = False)-> List[str]:
texts = [self.clean_text(text) for text in texts]
# Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
# truncation=True limits number of tokens to model's limitations (512)
encoded = self.tokenizer.batch_encode_plus(texts,padding=True, add_special_tokens=True,truncation=True, return_tensors="pt")
encoded = encoded.to(self.device)
with torch.no_grad():
logits = self.model(**encoded)
label_ids = torch.argmax(logits[0], axis=1)
if output_probabilities == False:
return [self.model.config.id2label[label_id.item()] for label_id in label_ids]
else:
predictions = torch.softmax(logits[0], dim=-1).tolist()
I hope this helps
1 Like