Indonesian NLP - Introductions

I’m not sure the naming is right or not. But, according to GPT-Neo’s repo, they used GPT3-XL naming to refer to GPT3 architecture with 1.3B params. I follow their colab to train with Indonesian Oscar dataset.

My training procedure:

  • Divided the oscar dataset into 100,000 examples into each txt file
  • Train a new tokenizer
  • Tokenized the whole corpus and saved it as a tfrecords file.
  • Train the model with 600000 steps with 256 batch sizes.

It takes me ~12 hours for every 500 steps with TPUv2-8 (in total should be ~600 days to complete! :laughing: :laughing:). I only used the colab pro and it doesn’t use much memory in the training (<3GB). I haven’t used TPUv3-8 with TRC yet this time to sanity check the training procedure. It should accelerate the training even with the bigger model, right?

Hello Pak @cahya and all members, I’m Ari (github & huggingface: arynas) from Jogjakarta.

I work as AI & Machine Learning Researcher at netray[dot]id and already research in Computer Vision, NLP, and Speech Recognition field. Currently my research focus is to create wav2vec model from scratch for Indonesia Language and Java Language. My dream is to create speech pretrained model for all local language in Indonesia.

I hope we can collaborate and create some interesting project in the future!

1 Like

Hallo Pak @arynas, nice to see you here.

We are having a research on GitHub - indonesian-nlp/multilingual-asr: Multilingual Speech Recognition for Indonesian Languages . Currently it contains indonesian, Javanese and Sundanese, but we hope to add another languages if we can get its speech datasets.

Feel free to join our telegram group if you like. I’ll send you the link

Halo Mr. Cahya, can you send me the telegram group link so I can join since I confused about some transformer model? Thanks

Hi there, my name is Sandy (GitHub: ilos-vigil, HuggingFace: ilos-vigil). My current interest is NLP and lightweight DL model. Currently i’m trying to pretrain light LM model with long sequence/token using BigBird architecture.

1 Like