Indonesian NLP - Introductions

cahya · February 21, 2021, 9:37am

Hello there! This is the introduction thread for Indonesian NLP enthusiasts.

Welcome, and please feel free to introduce yourself with any of the following:

Your name, Github, Hugging Face, and/or Twitter handle
Your interest in Indonesian NLP
Some projects you are working on or interested in starting
Any other languages that you speak, any personal interests, or anything else

w11wo · February 21, 2021, 9:58am

Hi there Mas @cahya, thanks for initiating this thread!

I’m Wilson Wongso (@w11wo in Github and HF), a year-2 Computer Science undergraduate student from Jakarta, Indonesia. I’m still relatively new and am still learning NLP/Hugging Face, but am having fun thus far!

I trained some small language models with GPT-2 and RoBERTa recently as my side project during semester break and am interested to create language models for native Indonesian languages like Javanese, Sundanese, Medanese, etc.

Looking forward to connecting with the community here!

cahya · February 21, 2021, 1:04pm

Hi, my name is Cahya. I work as a system and software engineer in Vienna, Austria. My interest in ML / NLP started in early 2017 with a simple text classification with Tensorflow.

Currently I like to experiment with Conversational AI, Open Domain Question Answering, and Text Summarization. I built some Indonesian language models which are hosted here and helped to put some existing Indonesian NLP datasets to the collection of Hf datasets.

I hope we could connect and work together on interesting Indonesian nlp projects. One of the projects I would like to try is creating an MBART model with a collection of some of existing languages in Indonesia (at least the 15 most used one) such as Javanese, Sundanese or Minangkabau. This could later be used for machine translation among these languages or other seq2seq tasks.

My handles:

Huggingface: cahya
Github: cahya-wirawan

warto · February 23, 2021, 9:40am

Hello guys, my name is warto from IAIN Purwokerto Indonesia. My interest on NLP and text mining. I am doctoral student at Dian Nuswantoro University Semarang. My research topic about information extraction.
I just finished annotate Indonesian news with covid19 topic

Wikidepia · February 24, 2021, 12:18am

Hi, My name is Akmal. My Huggingface, GitHub username is Wikidepia.

I have zero background on NLP / Machine learning
Currently i am interested in creating Indonesia transformer models like T5 and GPT-2. Thanks to TFRC Also translating english dataset like PAWS.

Nice to meet you all!

yptheangel · March 14, 2021, 6:05am

@cahya thx for sharing your models, which variant do you reckon will be the best fit(size, inference speed, classification accuracy) if I were to fine tune to classify address strings in indonesian? currently experimenting with cahya/bert-base-indonesian-522M. Solving a NER problem

cahya · March 15, 2021, 12:24pm

Hi @yptheangel
Glad that you want to use my models. I would suggest to use cahya/bert-base-indonesian-1.5G model for classification accuracy since it was trained with more data. If you want to use smaller model with faster inference speed, I would suggest the model cahya/distilbert-base-indonesian, which used cahya/bert-base-indonesian-1.5G as the teacher.
I have also fine tuned this bert model for NER cahya/bert-base-indonesian-NER · Hugging Face, which used the NER dataset id_nergrit_corpus · Datasets at Hugging Face. However, I still need to write model card/documentation about it.

cahya · March 15, 2021, 12:40pm

Hi mas Akmal, nice to see you here also. Great that you built also several Indonesian models, I also really appreciate that you created/translated several datasets for Indonesian NLP. If I see your models and the datasets you created, I am not sure if you really have zero background on NLP/Machine Learning
Btw, how long do you still have access to TFRC?

Wikidepia · March 15, 2021, 12:54pm

I recently extended my TFRC trial, so i still have around 50days of access.

cahya · March 15, 2021, 12:56pm

That is great, I still don’t use my access to TFRC. Maybe we could collaborate to build something new with our access

Wikidepia · March 16, 2021, 4:34am

Thats awesome!

reza · March 28, 2021, 7:07am

Hi, My name is Reza, and im new to nlp especially utilizing hugging face.
I have a question, is it okay to train language model (like bert) with many typo word (twitter-like sentence) ?
we want to make lm so it can be used for many task, but we need inference time fast enough (<500ms)

cahya · June 24, 2021, 5:35pm

Maybe you could try the ByT5 model, the author wrote that their model is robust to noise like text in twitter

raphaelmerx · June 30, 2021, 3:36am

Halo teman2, my name is Rapha, (github: github.com/raphaelmerx/, twitter: https://twitter.com/RaphaelMerx/), I’m working on tetun.org, an online translator for the Tetun language (which is the main language spoken in Timor-Leste). Currently tetun.org supports Tetun-English only, but I’d like to add Tetun-Indonesian, since this is a frequent request by the app users.

Like Indonesian, Tetun is an Austronesian language, and so I’m very interested in Indonesian NLP because it’s a “sibling” language to Tetun with a lot more resources.

I speak Indonesian, and would love to get involved in some pure-Indonesian NLP projects, time permits! @cahya I really like your idea of creating an mBART model for the languages of Indonesia, did you get a chance to try it?

raphaelmerx · October 1, 2021, 10:22pm

Hi everyone, just to let you know that PyconID will be on 4-5 Dec this year, and the call for proposal is now open at PaperCall.io - PyCon Indonesia 2021, Online.

Would be great to have a talk on Indonesian NLP! For those interested please submit a talk proposal.

cahya · October 1, 2021, 11:37pm

Hi Raphael, sorry for late response. Yes, it would be nice if we could have tetun-indonesian, the challenge here is its parallel corpus.

Good that you like the idea for creating mbart for some Indonesian language, unfortunately I still don’t start it yet. Maybe we should collaborate to do it.

Btw, we have a telegram channel discussing Indonesian nlp and huggingface, if you like, I can send you an invitation

raphaelmerx · October 2, 2021, 5:14am

Hi Cahya, you’re right about the parallel corpus, I tried to train a multilingual translation model Tetun - English - Indonesian but the Tetun-Indonesian quality was poor for this reason.

Yes I would be happy to join the Telegram channel! Thanks in advance for the invite.

ivokun · November 15, 2021, 5:05am

Hi everyone and Pak @cahya!
I just recently found this discussion and am so excited with a lot of like-minded here!

I’m Ivo (@ivokun in Github, HF, or other platforms). I work as an associate research engineer. I started to learn ML (NLP in particular) in 2017 when my previous company try to develop article generation (and failed miserably). Then took a graduate degree with NLG as my main research topic.

After a year of not working with NLP, recently, I can continue my research on NLG again. I tried to train GPT3-XL (with GPTNeo’s script) with the Bahasa Indonesia subset of the Oscar dataset. Just got TRC and want to fully utilize it this week.

I hope we can collaborate in the near future!

cahya · November 15, 2021, 8:09pm

Hi @ivokun, nice also to see another Person interested in text generation for Indonesian.
Btw, did you mean gpt2-xl? We had experience to pretrain gpt2-large on 68GB datasets, it took around 6 days for 1 epoch using tpu v3-8. If you want to train gpt2-xl, which is twice of the size of gpt-large, on Indonesian oscar dataset (around 30GB), you would need also around 6 days for an epoch. I am just not sure if you have enough ram for training gpt2-xl on tpu v3-8.

cahya · November 15, 2021, 8:16pm

As mentioned above, we have telegram group where we discuss about indonesian nlp/huggingface, I sent you an invitation in case you want to join us.

Topic		Replies	Views
Thai NLP - Introductions Languages at Hugging Face	3	1640	October 10, 2022
Turkish NLP - Introductions Languages at Hugging Face	31	5613	April 10, 2024
Hindi NLP Introduction 🔥 Languages at Hugging Face	39	4818	March 8, 2021
Italian NLP - Introductions Languages at Hugging Face	8	4179	February 28, 2023
Korean NLP - Introductions Languages at Hugging Face	2	1241	June 27, 2023

Indonesian NLP - Introductions

Related topics