Indonesian NLP - Introductions

Hello there! This is the introduction thread for Indonesian NLP enthusiasts.

Welcome, and please feel free to introduce yourself with any of the following:

  • Your name, Github, Hugging Face, and/or Twitter handle
  • Your interest in Indonesian NLP
  • Some projects you are working on or interested in starting
  • Any other languages that you speak, any personal interests, or anything else
3 Likes

Hi there Mas @cahya, thanks for initiating this thread!

I’m Wilson Wongso (@w11wo in Github and HF), a year-2 Computer Science undergraduate student from Jakarta, Indonesia. I’m still relatively new and am still learning NLP/Hugging Face, but am having fun thus far!

I trained some small language models with GPT-2 and RoBERTa recently as my side project during semester break and am interested to create language models for native Indonesian languages like Javanese, Sundanese, Medanese, etc.

Looking forward to connecting with the community here!

5 Likes

Hi, my name is Cahya. I work as a system and software engineer in Vienna, Austria. My interest in ML / NLP started in early 2017 with a simple text classification with Tensorflow.

Currently I like to experiment with Conversational AI, Open Domain Question Answering, and Text Summarization. I built some Indonesian language models which are hosted here and helped to put some existing Indonesian NLP datasets to the collection of Hf datasets.

I hope we could connect and work together on interesting Indonesian nlp projects. One of the projects I would like to try is creating an MBART model with a collection of some of existing languages ​​in Indonesia (at least the 15 most used one) such as Javanese, Sundanese or Minangkabau. This could later be used for machine translation among these languages ​​or other seq2seq tasks.

My handles:

4 Likes

Hello guys, my name is warto from IAIN Purwokerto Indonesia. My interest on NLP and text mining. I am doctoral student at Dian Nuswantoro University Semarang. My research topic about information extraction.
I just finished annotate Indonesian news with covid19 topic

5 Likes

Hi, My name is Akmal. My Huggingface, GitHub username is Wikidepia.

I have zero background on NLP / Machine learning :sweat_smile:
Currently i am interested in creating Indonesia transformer models like T5 and GPT-2. Thanks to TFRC :smiley: Also translating english dataset like PAWS.

Nice to meet you all!

3 Likes

@cahya thx for sharing your models, which variant do you reckon will be the best fit(size, inference speed, classification accuracy) if I were to fine tune to classify address strings in indonesian? currently experimenting with cahya/bert-base-indonesian-522M. Solving a NER problem

Hi @yptheangel
Glad that you want to use my models. I would suggest to use cahya/bert-base-indonesian-1.5G model for classification accuracy since it was trained with more data. If you want to use smaller model with faster inference speed, I would suggest the model cahya/distilbert-base-indonesian, which used cahya/bert-base-indonesian-1.5G as the teacher.
I have also fine tuned this bert model for NER cahya/bert-base-indonesian-NER · Hugging Face, which used the NER dataset id_nergrit_corpus · Datasets at Hugging Face. However, I still need to write model card/documentation about it.

Hi mas Akmal, nice to see you here also. Great that you built also several Indonesian models, I also really appreciate that you created/translated several datasets for Indonesian NLP. If I see your models and the datasets you created, I am not sure if you really have zero background on NLP/Machine Learning :grin:
Btw, how long do you still have access to TFRC?

I recently extended my TFRC trial, so i still have around 50days of access.

That is great, I still don’t use my access to TFRC. Maybe we could collaborate to build something new with our access :slight_smile:

Thats awesome!

Hi, My name is Reza, and im new to nlp especially utilizing hugging face.
I have a question, is it okay to train language model (like bert) with many typo word (twitter-like sentence) ?
we want to make lm so it can be used for many task, but we need inference time fast enough (<500ms)