Introduction to NLP in Turkish
Turkish in Perspective
Turkish alphabet consists of 29 letters, 8 of them being vowels, written with the Latin alphabet with an addition of letters “ç”, “ö”, “ü”, “ğ”, “ı”, “ş”. Basic word order of a sentence in Turkish is subject-object-verb. It’s an agglutinative language that uses suffixes on top of stems to create new words or conjugate existing ones. There’s no grammatical gender in Turkish unlike Germanic languages, there’s a pronoun called “o” to refer to all genders and objects. The original Turkish doesn’t contain suffixes to indicate gender in jobs and articles in front of nouns. For formal referrals, second-person plural pronoun “siz” is used.
Turkish NLP with Hugging Face
Currently there are 48 models that can make predictions on Turkish language, and 41 datasets that include Turkish examples. One of the most used models is the bert-base-turkish-cased by Munich Digitization Center, being used as a popular language model base for Turkish NLP tasks.
Loading a Dataset
Let’s dive into a dataset! Loading a dataset is easy enough with the datasets library.
We will load the dataset ‘turkish_product_reviews’ and fine-tune “savasy/bert-base-turkish-sentiment-cased“ with it.
# uncomment and install datasets, if not installed
# !pip install datasets
from datasets import load_dataset
dataset = load_dataset('turkish_product_reviews', split = "train")
Let’s examine the dataset:
In [1]:dataset
Out[1]:Dataset({
features: ['sentence', 'sentiment'],
num_rows: 23516
})
In [2]:dataset[0]["sentence"]
Out[1]:’beklentimin altında bir ürün kaliteli değil’
This dataset does not contain a separate test set, so we divide the training set twice, once for test set and once for validation set.
from sklearn.model_selection import train_test_split
train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=.2)
Using a Pre-Trained Model
Below model is Turkish BERT fine-tuned on sentiment analysis, we will fine-tune this model on the above dataset.
You can see the model here: model page
from transformers import AutoTokenizer, AutoModel
model_name = 'savasy/bert-base-turkish-sentiment-cased'
tokenizer = AutoTokenizer.from_pretrained('model_name')
model = AutoModel.from_pretrained('model_name')
Check out example notebook for full application of above process.
To-do:
- Add Turkish translation
- Add comments in Turkish to notebook
- Add native Tensorflow implementation