PreTrain RoBERTa/T5 from scratch for Programming Languages

sbmaruf · June 23, 2021, 3:53pm

RoBERTa/T5 for Programming Languages

Currently, there is ~~no MLM-based model~~ (some are available) that was trained from scratch for Various Programming Languages on the hub. For this project, the goal is to create a strong programming language generation model.

2. Language

Primary Languages: C++, Python, Java

Apart from that, we can also consider the following languages based on availability. We have scrapped a large set of programming language codes. However, we are in the process of discussing the matter of licensing issues with the data owner. These are the scrapped languages so far.

lang_dict = {
    "C":["GNU C", "GNU C11"], 
    "C++":[ "GNU C++", "GNU C++0x", "GNU C++11", "GNU C++14", "GNU C++17", "GNU C++17 Diagnostics", "MS C++", "MS C++ 2017", "Clang++17 Diagnostics"], 
    "C#":[".NET Core C#", "Mono C#", "MS C#"],
    "D":["D"],
    "Delphi": ["Delphi"],
    "FPC":["FPC"],
    "Go":["Go"],
    "Haskell":["Haskell"],
    "Java":["Java 6", "Java 7", "Java 8", "Java 11", "Kotlin"],
    "JavaScript":["JavaScript", "Node.js"],
    "Ocaml":["Ocaml"],
    "PascalABC.NET":["PascalABC.NET"],
    "Perl":["Perl"],
    "PHP":["PHP"],
    "Python 2":["PyPy 2", "Python 2"],
    "Python 3":["PyPy 3", "Python 3"],
    "Ruby":["Ruby"],
    "Rust":["Rust"],
    "Scala":["Scala"]
}

Please note that we cannot share our own dataset of codes without resolving licensing issue. However, there are enough data available online to train an LM.

3. Model

A randomly initialized RoBERTa model.

4. Datasets

We can use dataset from,

GitHub - facebookresearch/TransCoder: Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf
GitHub - github/CodeSearchNet: Datasets, tools, and benchmarks for representation learning of code.
If we can resolve the licensing issue, we can share our own datasets via hugging-face dataset pipeline.

5. Training scripts

A masked language modeling script for Flax is available here.

6. Challenges

Data pre-processing
Sample preparation for LM

7. Desired project outcome

The desired project output is an MLM/T5 model that is able to generate Programming language.

8. Reads

The most important read would be the following colab:

Google Colaboratory

Apart from that we need to check the tokenization procedure from,

UPDATE 1: In addition to some interest from many people, I must add that, there are some pre-trained models already in hugging-face which I was not aware of at the time of writing this proposal. Model Link: Hugging Face – The AI community building the future.

UPDATE 2: We can try training a T5 like model with Programming Languages instead of an MLM-like model.

tasnim · June 23, 2021, 3:55pm

Interesting project. I am in!

jackal1586 · June 23, 2021, 4:00pm

I am also available for this project.

naruto7 · June 23, 2021, 5:57pm

Very interesting idea!
I am aware of the FAIR paper and have always wanted to extend that work in some way.
This would be a great opportunity for me to learn JAX/HF and also contribute to the translation of programming languages
Count me in!

gagan3012 · June 23, 2021, 7:21pm

This seems very interesting I would like to be a part

kmfoda · June 23, 2021, 10:41pm

I’m also extremely interested in this. Count me in if there’s still space!

taisazero · June 24, 2021, 12:43am

Hello!

I’m really interested in joining this team as well and improving upon existing models. This project can yield models that generate NL to code, code to NL, and possibly from one programming language to another.

Edit: Would it be possible to make two teams since the size limit is 5? I am interested in turning this into a new research project and tryout pretraining modified RoBERTa and T5 architectures. I can share more details upon joining /meeting the team.

sbmaruf · June 24, 2021, 8:31am

Hi @taisazero ! I can understand your excitement. In the current scenario, we will only have 7 days to train our model. I think it will be quite difficult to convert it to a research project. How about this, if this gets accepted, we at first concentrate on training a good LM from scratch. Later on, if some members of the group want to collaborate they can do so. I am as excited as you are, but I only fear that we don’t have enough time to pull this out as a research project unless someone has already planned everything.

patrickvonplaten · June 24, 2021, 8:33am

Great to see that much interest! We can definitely make multiple teams or even think about one big team with access to multiple TPUs

taisazero · June 24, 2021, 5:44pm

Yes, I definitely agree – we have to walk before we run :’). I definitely want to stick to a plan we can accomplish in the allotted time (7 days) and happy with the proposal and excited to learn and work with others on it. I hope to collaborate with others in the future as a research project on this topic. I think this is a great first step.

Also nice to meet you all! :’)

Great to also hear that we have the option to create one big team or multiple teams!

I wonder if our big group (or each smaller team if we go that route) can train multiple architectures (e.g. both RoBERTa and T5) so that each of us can get the chance to learn/practice Jax. Again, if we have time, it might be great to fine-tune our pretrained model on benchmark code generation tasks e.g. CoNaLa and Shellcode_IA32 (self-plugging right here :’))

Hope this proposal makes it through!

Vaibhavbrkn · June 25, 2021, 3:07am

@sbmaruf Very interesting idea. I would love to join this project.

zcheng2046 · June 28, 2021, 2:59pm

@sbmaruf This is a great idea. I am really interested in it and want to join the project. I have some experience fine-tuning the transformers and I am interested in learning the new framework to speed up the training process.

ainoob101 · July 1, 2021, 4:16am

Interesting project. I wanna be a part of this.

hassiahk · July 1, 2021, 6:49am

So many interesting projects Are you guys maybe planning to work on this even after the community week ends?

patrickvonplaten · July 1, 2021, 9:48am

Awesome - finalizing this project!

sbmaruf · July 2, 2021, 12:05am

Please join the discord server, here
We will have a meeting today (2 July) at 6PM SGT Time.

@tasnim @jackal1586 @naruto7 @au1206 @gagan3012 @kmfoda @taisazero @Vaibhavbrkn @zcheng2046 @ainoob101

sbmaruf · July 2, 2021, 12:05am

Please join the discord server, here
We will have a meeting today (2 July) at 6PM SGT Time. @hassiahk

patrickvonplaten · July 2, 2021, 12:24am

Awesome, this group has a lot of participation

Giving you guys directly to TPUs tomorrow! Split the team randomely into two in the official google sheet, but this shouldn’t change anything - just that you have access to 2 TPU v3-8s

Might make organization a bit easier to split work on two VMs!

kmfoda · July 2, 2021, 4:36am

Thanks @sbmaruf. Looking forward to it! I get a “no text channels” error when I click on the link provided for the discord channel though?

sbmaruf · July 2, 2021, 6:11am

discord server link, Flax-HuggingFace-Community-Week
after login, find the channel from the left panel or try the link above again.

otherwise shoot me an email. sbmaruf at gmail dot com

Topic		Replies	Views
Pre-train RoBERTa from Scratch for Georgian Language Flax/JAX Projects	1	1265	July 7, 2021
PreTrain RoBERTa from scratch in Portuguese Flax/JAX Projects	16	2424	October 4, 2021
PreTrain RoBERTa from scratch in Marathi Flax/JAX Projects	7	921	July 7, 2021
PreTrain RoBERTa from scratch in Indonesian Flax/JAX Projects	6	549	June 28, 2021
XLM-Roberta Flax 🤗Transformers	0	294	December 2, 2021