Indonesian NLP - Introductions

Hello there! This is the introduction thread for Indonesian NLP enthusiasts.

Welcome, and please feel free to introduce yourself with any of the following:

  • Your name, Github, Hugging Face, and/or Twitter handle
  • Your interest in Indonesian NLP
  • Some projects you are working on or interested in starting
  • Any other languages that you speak, any personal interests, or anything else

Hi there Mas @cahya, thanks for initiating this thread!

I’m Wilson Wongso (@w11wo in Github and HF), a year-2 Computer Science undergraduate student from Jakarta, Indonesia. I’m still relatively new and am still learning NLP/Hugging Face, but am having fun thus far!

I trained some small language models with GPT-2 and RoBERTa recently as my side project during semester break and am interested to create language models for native Indonesian languages like Javanese, Sundanese, Medanese, etc.

Looking forward to connecting with the community here!


Hi, my name is Cahya. I work as a system and software engineer in Vienna, Austria. My interest in ML / NLP started in early 2017 with a simple text classification with Tensorflow.

Currently I like to experiment with Conversational AI, Open Domain Question Answering, and Text Summarization. I built some Indonesian language models which are hosted here and helped to put some existing Indonesian NLP datasets to the collection of Hf datasets.

I hope we could connect and work together on interesting Indonesian nlp projects. One of the projects I would like to try is creating an MBART model with a collection of some of existing languages ​​in Indonesia (at least the 15 most used one) such as Javanese, Sundanese or Minangkabau. This could later be used for machine translation among these languages ​​or other seq2seq tasks.

My handles:


Hello guys, my name is warto from IAIN Purwokerto Indonesia. My interest on NLP and text mining. I am doctoral student at Dian Nuswantoro University Semarang. My research topic about information extraction.
I just finished annotate Indonesian news with covid19 topic


Hi, My name is Akmal. My Huggingface, GitHub username is Wikidepia.

I have zero background on NLP / Machine learning :sweat_smile:
Currently i am interested in creating Indonesia transformer models like T5 and GPT-2. Thanks to TFRC :smiley: Also translating english dataset like PAWS.

Nice to meet you all!


@cahya thx for sharing your models, which variant do you reckon will be the best fit(size, inference speed, classification accuracy) if I were to fine tune to classify address strings in indonesian? currently experimenting with cahya/bert-base-indonesian-522M. Solving a NER problem

1 Like

Hi @yptheangel
Glad that you want to use my models. I would suggest to use cahya/bert-base-indonesian-1.5G model for classification accuracy since it was trained with more data. If you want to use smaller model with faster inference speed, I would suggest the model cahya/distilbert-base-indonesian, which used cahya/bert-base-indonesian-1.5G as the teacher.
I have also fine tuned this bert model for NER cahya/bert-base-indonesian-NER · Hugging Face, which used the NER dataset id_nergrit_corpus · Datasets at Hugging Face. However, I still need to write model card/documentation about it.

1 Like

Hi mas Akmal, nice to see you here also. Great that you built also several Indonesian models, I also really appreciate that you created/translated several datasets for Indonesian NLP. If I see your models and the datasets you created, I am not sure if you really have zero background on NLP/Machine Learning :grin:
Btw, how long do you still have access to TFRC?

1 Like

I recently extended my TFRC trial, so i still have around 50days of access.

1 Like

That is great, I still don’t use my access to TFRC. Maybe we could collaborate to build something new with our access :slight_smile:

1 Like

Thats awesome!

1 Like

Hi, My name is Reza, and im new to nlp especially utilizing hugging face.
I have a question, is it okay to train language model (like bert) with many typo word (twitter-like sentence) ?
we want to make lm so it can be used for many task, but we need inference time fast enough (<500ms)

1 Like

Maybe you could try the ByT5 model, the author wrote that their model is robust to noise like text in twitter

1 Like

Hi everyone, my name is Alvin (github : alvinwatner, twitter : watneralvin). I have been learning NLP only for the past 8 months approximately, right after I finished my undergraduate studies.

My previous work for my bachelor degree is to solve protein folding problem using deep reinforcement learning, which obviously linked to NLP in terms of sequential symbol unless it pose different objective function. Currently conducting a research in leveraging various types of keyphrases in question generation, the challenges quite similar but less-popular compare to summarization. In particular, I’m interested in any of reading comprehension tasks and aim to build a robust machine that perform well by understanding language phenomenon such as textual entailment, coreference resolution, etc., not by taking a spurious statistical shortcut.


Halo teman2, my name is Rapha, (github:, twitter:, I’m working on, an online translator for the Tetun language (which is the main language spoken in Timor-Leste). Currently supports Tetun-English only, but I’d like to add Tetun-Indonesian, since this is a frequent request by the app users.

Like Indonesian, Tetun is an Austronesian language, and so I’m very interested in Indonesian NLP because it’s a “sibling” language to Tetun with a lot more resources.

I speak Indonesian, and would love to get involved in some pure-Indonesian NLP projects, time permits! @cahya I really like your idea of creating an mBART model for the languages of Indonesia, did you get a chance to try it?


Hi everyone, just to let you know that PyconID will be on 4-5 Dec this year, and the call for proposal is now open at - PyCon Indonesia 2021, Online.

Would be great to have a talk on Indonesian NLP! For those interested please submit a talk proposal.

1 Like

Hi Raphael, sorry for late response. Yes, it would be nice if we could have tetun-indonesian, the challenge here is its parallel corpus.

Good that you like the idea for creating mbart for some Indonesian language, unfortunately I still don’t start it yet. Maybe we should collaborate to do it.

Btw, we have a telegram channel discussing Indonesian nlp and huggingface, if you like, I can send you an invitation

1 Like

Hi Cahya, you’re right about the parallel corpus, I tried to train a multilingual translation model Tetun - English - Indonesian but the Tetun-Indonesian quality was poor for this reason.

Yes I would be happy to join the Telegram channel! Thanks in advance for the invite.

Hi everyone and Pak @cahya!
I just recently found this discussion and am so excited with a lot of like-minded here!

I’m Ivo (@ivokun in Github, HF, or other platforms). I work as an associate research engineer. I started to learn ML (NLP in particular) in 2017 when my previous company try to develop article generation (and failed miserably). Then took a graduate degree with NLG as my main research topic.

After a year of not working with NLP, recently, I can continue my research on NLG again. I tried to train GPT3-XL (with GPTNeo’s script) with the Bahasa Indonesia subset of the Oscar dataset. Just got TRC and want to fully utilize it this week.

I hope we can collaborate in the near future!

1 Like

Hi @ivokun, nice also to see another Person interested in text generation for Indonesian.
Btw, did you mean gpt2-xl? We had experience to pretrain gpt2-large on 68GB datasets, it took around 6 days for 1 epoch using tpu v3-8. If you want to train gpt2-xl, which is twice of the size of gpt-large, on Indonesian oscar dataset (around 30GB), you would need also around 6 days for an epoch. I am just not sure if you have enough ram for training gpt2-xl on tpu v3-8.