Sentiment analysis of Sinhala language using deep learning networks

Sentiment analysis of Sinhala language using deep learning networks

1. Description

The main objective of the project is to test deep learning models to identify the sentiments in Sinhala text. A Facebook dataset is used to train and test the models. The model codes are already developed and only the training and testing phases remain to be done. Since Sinhala remains as a resource poor NLP language, this project will lend a hand to improve the current tools and provide insight on the current state.
Migrating the current code into JAX with the use of Flax, Haiku and other libraries is another objective. Libraries like Trax with basic deep learning models and the trending Transformers are aimed to be tested.

2. Language

The models are trained in Sinhala Language

3. Model

The models that will be tested are

  • RNN
  • LSTM
  • GRU
  • BiLSTM
  • Baseline models with the combination of a CNN
  • Stacked LSTM and BiLSTM
  • HAHNN
  • Capsule networks

4. Dataset

A Facebook dataset contaning 526,732 Sinhala and English posts extracted from CrowdTangle . The dataset consists of a decade’s worth of content from Facebook pages popular in Sri Lanka.

5. Training scripts

The following links contain the model scripts

Main models

6. Challenges

There are several models that needs to be adjusted and tested

7. Desired project outcome

Performance measures of each model

8. Reads

The following links can be useful to better understand the project and
what has been done previously.

https://sencat.lk/

4 Likes

Hello there! I would like to join this project.

I am an undergraduate highly interested in sentiment analysis. Sinhala is my mother tongue and the language closest to my heart. Therefore, to engage in a sentiment analysis project that could make a valuable contribution to the Sinhala research community would be a dream come true for me.

My current timezone is GMT+5:30. If I get the opportunity, I can contribute towards improving this project by improving the models, and creating new models better tailored for the domain. I’m sure this project could be a state of the art NLP project, and a valuable resource for the Sinhala research arena which currently suffers from resource poverty.

1 Like

If the proposal is good enough can you please accept this project? @Suzana @valhalla @osanseviero @patrickvonplaten

1 Like

Thanks for the cool proposal @graw !

The project is really cool :slight_smile: Just regarding the models, we don’t really have any of those implemented in Transformers, so this might take some time…Would it be sensible to pretrain a RoBERTa model + finetune it afterwards maybe?

Given that the project lasts only a week, maybe implementing + trying out all those models is a bit time-consuming

1 Like

puttting you guys down though officially :slight_smile:

1 Like

Also do you have links to the dataset? :slight_smile:

1 Like

Thank you so much, you are a life savior. It is fine about the Transformer part. I will work on it afterwards. The database is not technically available for the public because of the new Facebook regulations. I am able to provide it to my teammates but not make it public. However I can add the paper regarding the dataset and if you are interested you can ask from the original authors for access.
Dataset paper

1 Like

Hello!

Great to see enthusiasm for Sinhala language. We have a RoBerta model for Sinhala on Hugging Face, which I trained on OSCAR dataset (which is a ~800MB dataset). I have a proposal to train a GPT2 for Sinhala with OSCAR and C4M dataset (which has ~3GB of Sinhala data). feel free to join : PreTrain GPT2 for Sinhala from scratch I think this could be a very good downstream task.

Thanks!

1 Like