PreTrain RoBERTa for Kannada

RoBERTa for Kannada

Currently, there are only two models available for hate speech detection in the hugging face model hub. By pre-training a RoBERTa model, we wish to increase the accessibility to one of the oldest Indic languages.

2. Language

Kannada

3. Model

A randomly Initialized RoBERTa model.

4. Datasets

Here are some of the datasets containing Kannada sentences: preprocessing required.

  1. Automate Text-based Workflows at Scale

5. Training scripts

A masked language modeling script for Flax is available here. Probably the same can be used.

6. Desired project outcome

To use this model and fine-tune it for a sentiment analysis task for Kannada text sentences.

More datasets containing Kannada sentences :

  1. Kannada News Dataset | Kaggle
  2. Kannada Covid-19 Sentiment Analysis Dataset | Kaggle

Some more datasets :

  1. Kannada Wikipedia Articles | Kaggle
  2. Samanantar | Kaggle

Let’s define it!