Indonesian NLP - Introductions

ivokun · November 16, 2021, 3:59am

I’m not sure the naming is right or not. But, according to GPT-Neo’s repo, they used GPT3-XL naming to refer to GPT3 architecture with 1.3B params. I follow their colab to train with Indonesian Oscar dataset.

My training procedure:

Divided the oscar dataset into 100,000 examples into each txt file
Train a new tokenizer
Tokenized the whole corpus and saved it as a tfrecords file.
Train the model with 600000 steps with 256 batch sizes.

It takes me ~12 hours for every 500 steps with TPUv2-8 (in total should be ~600 days to complete! ). I only used the colab pro and it doesn’t use much memory in the training (<3GB). I haven’t used TPUv3-8 with TRC yet this time to sanity check the training procedure. It should accelerate the training even with the bigger model, right?

arynas · November 21, 2021, 6:36am

Hello Pak @cahya and all members, I’m Ari (github & huggingface: arynas) from Jogjakarta.

I work as AI & Machine Learning Researcher at netray[dot]id and already research in Computer Vision, NLP, and Speech Recognition field. Currently my research focus is to create wav2vec model from scratch for Indonesia Language and Java Language. My dream is to create speech pretrained model for all local language in Indonesia.

I hope we can collaborate and create some interesting project in the future!

cahya · November 21, 2021, 8:38am

Hallo Pak @arynas, nice to see you here.

We are having a research on GitHub - indonesian-nlp/multilingual-asr: Multilingual Speech Recognition for Indonesian Languages . Currently it contains indonesian, Javanese and Sundanese, but we hope to add another languages if we can get its speech datasets.

cahya · November 24, 2021, 9:42am

Feel free to join our telegram group if you like. I’ll send you the link

Dwis · March 31, 2022, 2:23pm

Halo Mr. Cahya, can you send me the telegram group link so I can join since I confused about some transformer model? Thanks

ilos-vigil · November 16, 2022, 4:43pm

Hi there, my name is Sandy (GitHub: ilos-vigil, HuggingFace: ilos-vigil). My current interest is NLP and lightweight DL model. Currently i’m trying to pretrain light LM model with long sequence/token using BigBird architecture.

ronnieaban · January 16, 2025, 8:52am

Hello Mr. Cahya, i would like to join telegram group also. thank you

Topic		Replies	Views
Thai NLP - Introductions Languages at Hugging Face	3	1639	October 10, 2022
Turkish NLP - Introductions Languages at Hugging Face	31	5613	April 10, 2024
Hindi NLP Introduction 🔥 Languages at Hugging Face	39	4815	March 8, 2021
Italian NLP - Introductions Languages at Hugging Face	8	4177	February 28, 2023
Korean NLP - Introductions Languages at Hugging Face	2	1241	June 27, 2023

Indonesian NLP - Introductions

Related topics