German Sentiment (tokenizer)?

Al3ksandra · February 10, 2023, 12:10pm

Hi,

I want to use German Sentiment (" German Sentiment Classification with Bert") repository (Link: GitHub - oliverguhr/german-sentiment-lib: An easy to use python package for deep learning-based german sentiment classification.).

I have a dataset of tweets in German language. Do I need to use tokenizer before applying this model? And if yes which one?

In the repository as an example no tokenizer is being used.

Thank you!

stefan-it · February 10, 2023, 12:36pm

Hi @Al3ksandra ,

the repo uses the oliverguhr/german-sentiment-bert model:

github.com

oliverguhr/german-sentiment-lib/blob/master/germansentiment/sentimentmodel.py#L7


      
          from transformers import AutoModelForSequenceClassification, AutoTokenizer
          from typing import List
          import torch
          import re
          
          
class SentimentModel():
              def __init__(self, model_name: str = "oliverguhr/german-sentiment-bert"):
                  if torch.cuda.is_available():
                      self.device = 'cuda'
                  else:
                      self.device = 'cpu'        
                      
                  self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
                  self.model = self.model.to(self.device)
                  self.tokenizer = AutoTokenizer.from_pretrained(model_name)
          
          
        self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)

Which is basically a fine-tuned model of the bert-base-german-cased on sentiment datasets. This can be verified when comparing the vocab of the original German BERT model and the fine-tuned one:

$ curl https://huggingface.co/oliverguhr/german-sentiment-bert/resolve/main/vocab.txt > vocab_sentiment.txt
$ curl https://huggingface.co/bert-base-german-cased/resolve/main/vocab.txt > vocab_german_bert.txt

$ diff vocab_sentiment.txt vocab_german_bert.txt

It has no difference.

You do not need to use your own tokenizer/cleanup steps, because it has built-in methods. So when calling the predict_sentiment method, your input text will be tokenized and cleaned before:

github.com

oliverguhr/german-sentiment-lib/blob/367f8f55d92fd85e6cde8bc59dc8dbad7ec88071/germansentiment/sentimentmodel.py#L25


      
              self.tokenizer = AutoTokenizer.from_pretrained(model_name)
          
          
    self.clean_chars = re.compile(r'[^A-Za-züöäÖÜÄß ]', re.MULTILINE)
              self.clean_http_urls = re.compile(r'https*\S+', re.MULTILINE)
              self.clean_at_mentions = re.compile(r'@\S+', re.MULTILINE)
          
          
def predict_sentiment(self, texts: List[str], output_probabilities = False)-> List[str]:
              texts = [self.clean_text(text) for text in texts]
              # Add special tokens takes care of adding [CLS], [SEP], <s>... tokens in the right way for each model.
              # truncation=True limits number of tokens to model's limitations (512)
              encoded = self.tokenizer.batch_encode_plus(texts,padding=True, add_special_tokens=True,truncation=True, return_tensors="pt")
              encoded = encoded.to(self.device)
              with torch.no_grad():
                      logits = self.model(**encoded)
              
              label_ids = torch.argmax(logits[0], axis=1)
          
          
    if output_probabilities == False:
                  return [self.model.config.id2label[label_id.item()] for label_id in label_ids]
              else:
                  predictions = torch.softmax(logits[0], dim=-1).tolist()

I hope this helps

Al3ksandra · February 10, 2023, 3:29pm

Thanks Stefan! It helps

Topic		Replies	Views
Do you need to use the associated tokenizer Beginners	2	562	June 6, 2022
Do you have to use a model card's accompanying tokenizer? Beginners	1	303	November 4, 2022
Tutorial: Fine-tuning with custom datasets – sentiment, NER, and question answering 🤗Transformers	19	12799	February 12, 2024
German NLP Repository Languages at Hugging Face	11	4524	November 21, 2023
Questions on model's tokens 🤗Tokenizers	0	600	March 24, 2021

German Sentiment (tokenizer)?

Related topics