German NLP Repository

I am a MSc student in University of Siegen and have keen desire and interest in training NLP models specific to german language. It would be really great to meet people wanting to contribute models or datasets in german language. As of now I have trained and shared 3 models for german langauge such as Question answering models and NER model for legal domain in german.
Feel free to experiment and share your reviews .

More models to come :hugs::hugs:

2 Likes

Hi @Sahajtomar,

I am interested in German/English models that are useful for sentiment analysis and text classification in general (topic detection). I Have labeled datasets for both tasks.

Can you recommend any available model in the repository?

I am facing a blocker issue on a MultiLabel Text Classificatin task (described in a different issue item) would you happen to know how this could be implemented with the current version of Transformers (there seem to be some breaking change around version 3 which make it difficult to review a working sample :frowning: )…

Thanx Dirk

Hey,
Sure … there are two options
1). there are multilingual models like Universal sentence encoder or Sentence bert / roberta models which you can use to get embeddings and train simple ML model over it.
2) There are also zero shot classification models trained on XNLI datasets… which you can use directly. Kindly see this model

Also I am training a model on NLI task specifically in german language… as soon as I am done I will upload the model. Enjoy :hugs:

Hi,

I am actually looking for a TensorFlow model that supports German and English. I am looking at
bert-base-multilingual-uncased-sentiment

right now but this seems to have problems when I load it as TFBertForsequenceClassification or TFAutoModelForsequenceClassification (?) - I get an exception:

cannot reshape array of size 3840 into shape (768,20)

when I execute:
tf_model = TFAutoModelForSequenceClassification.from_pretrained(mydrivePath
, label2id=label2index
, id2label=index2label
, from_pt=True)

(where label2id and id2label are just dictionaries that mab categories to id and back).

Anyways if you happen to know a good Tensorflow model that can be used for Text Classification in English and German I would appreciate the hint :slight_smile:

Thanx Dirk

Hey, I’m Patrick, a German Research Engineer at Hugging Face. I will be joining the “Wav2Vec2 Fine-tuning week” starting on Monday next week - see: [Open-to-the-community] XLSR-Wav2Vec2 Fine-Tuning Week for Low-Resource Languages - #14 by ayameRushia .

If you want to participate and have any questions regarding fine-tuning a German speech recognition model, feel free to ping me here :hugs:

1 Like

hey @patrickvonplaten , I am facing issues with colab disk space. uncompressed data is 22gb for herman language. What are other options to train on large datasets?

Hey Sahajtomar,

One option would be to use a colab pro, but we are currently trying to organize more GPU & RAM compute for you guys

Hi @Sahajtomar,

Thanks for sharing your trained German models. I would like to know if it would be possible to add a license file with details to your German zero-shot model?

Thanks.

Hi,
Could you let me know what is license file

By license file I mean a file that describes the terms and conditions under which the published model can be used. For example, the BERT model is published under the following license: bert/LICENSE at master · google-research/bert · GitHub

Thanks.