PreTrain RoBERTa/T5 from scratch for Programming Languages

RoBERTa/T5 for Programming Languages

Currently, there is no MLM-based model (some are available) that was trained from scratch for Various Programming Languages on the hub. For this project, the goal is to create a strong programming language generation model.

2. Language

Primary Languages: C++, Python, Java

Apart from that, we can also consider the following languages based on availability. We have scrapped a large set of programming language codes. However, we are in the process of discussing the matter of licensing issues with the data owner. These are the scrapped languages so far.

lang_dict = {
    "C":["GNU C", "GNU C11"], 
    "C++":[ "GNU C++", "GNU C++0x", "GNU C++11", "GNU C++14", "GNU C++17", "GNU C++17 Diagnostics", "MS C++", "MS C++ 2017", "Clang++17 Diagnostics"], 
    "C#":[".NET Core C#", "Mono C#", "MS C#"],
    "D":["D"],
    "Delphi": ["Delphi"],
    "FPC":["FPC"],
    "Go":["Go"],
    "Haskell":["Haskell"],
    "Java":["Java 6", "Java 7", "Java 8", "Java 11", "Kotlin"],
    "JavaScript":["JavaScript", "Node.js"],
    "Ocaml":["Ocaml"],
    "PascalABC.NET":["PascalABC.NET"],
    "Perl":["Perl"],
    "PHP":["PHP"],
    "Python 2":["PyPy 2", "Python 2"],
    "Python 3":["PyPy 3", "Python 3"],
    "Ruby":["Ruby"],
    "Rust":["Rust"],
    "Scala":["Scala"]
}

Please note that we cannot share our own dataset of codes without resolving licensing issue. However, there are enough data available online to train an LM.

3. Model

A randomly initialized RoBERTa model.

4. Datasets

We can use dataset from,

  1. GitHub - facebookresearch/TransCoder: Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf
  2. GitHub - github/CodeSearchNet: Datasets, tools, and benchmarks for representation learning of code.
  3. If we can resolve the licensing issue, we can share our own datasets via hugging-face dataset pipeline.

5. Training scripts

A masked language modeling script for Flax is available here.

6. Challenges

  • Data pre-processing
  • Sample preparation for LM

7. Desired project outcome

The desired project output is an MLM/T5 model that is able to generate Programming language.

8. Reads

The most important read would be the following colab:

Apart from that we need to check the tokenization procedure from,

  1. GitHub - facebookresearch/TransCoder: Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf
  2. GitHub - github/CodeSearchNet: Datasets, tools, and benchmarks for representation learning of code.

UPDATE 1: In addition to some interest from many people, I must add that, there are some pre-trained models already in hugging-face which I was not aware of at the time of writing this proposal. Model Link: Hugging Face – The AI community building the future.

UPDATE 2: We can try training a T5 like model with Programming Languages instead of an MLM-like model.

5 Likes

Interesting project. I am in!

2 Likes

I am also available for this project.

1 Like

Very interesting idea!
I am aware of the FAIR paper and have always wanted to extend that work in some way.
This would be a great opportunity for me to learn JAX/HF and also contribute to the translation of programming languages
Count me in!

This seems very interesting I would like to be a part

1 Like

I’m also extremely interested in this. Count me in if there’s still space!

Hello!

I’m really interested in joining this team as well and improving upon existing models. This project can yield models that generate NL to code, code to NL, and possibly from one programming language to another.

Edit: Would it be possible to make two teams since the size limit is 5? I am interested in turning this into a new research project and tryout pretraining modified RoBERTa and T5 architectures. I can share more details upon joining /meeting the team.

3 Likes

Hi @taisazero ! I can understand your excitement. In the current scenario, we will only have 7 days to train our model. I think it will be quite difficult to convert it to a research project. How about this, if this gets accepted, we at first concentrate on training a good LM from scratch. Later on, if some members of the group want to collaborate they can do so. I am as excited as you are, but I only fear that we don’t have enough time to pull this out as a research project unless someone has already planned everything.

1 Like

Great to see that much interest! We can definitely make multiple teams or even think about one big team with access to multiple TPUs

4 Likes

Yes, I definitely agree – we have to walk before we run :’). I definitely want to stick to a plan we can accomplish in the allotted time (7 days) and happy with the proposal and excited to learn and work with others on it. I hope to collaborate with others in the future as a research project on this topic. I think this is a great first step.

Also nice to meet you all! :’)

Great to also hear that we have the option to create one big team or multiple teams!

I wonder if our big group (or each smaller team if we go that route) can train multiple architectures (e.g. both RoBERTa and T5) so that each of us can get the chance to learn/practice Jax. Again, if we have time, it might be great to fine-tune our pretrained model on benchmark code generation tasks e.g. CoNaLa and Shellcode_IA32 (self-plugging right here :’))

Hope this proposal makes it through!

1 Like

@sbmaruf Very interesting idea. I would love to join this project.

@sbmaruf This is a great idea. I am really interested in it and want to join the project. I have some experience fine-tuning the transformers and I am interested in learning the new framework to speed up the training process.

2 Likes

Interesting project. I wanna be a part of this.

1 Like

So many interesting projects :grinning_face_with_smiling_eyes: Are you guys maybe planning to work on this even after the community week ends?

Awesome - finalizing this project!

1 Like

Please join the discord server, here
We will have a meeting today (2 July) at 6PM SGT Time.

@tasnim @jackal1586 @naruto7 @au1206 @gagan3012 @kmfoda @taisazero @Vaibhavbrkn @zcheng2046 @mp1

1 Like

Please join the discord server, here
We will have a meeting today (2 July) at 6PM SGT Time. @hassiahk

1 Like

Awesome, this group has a lot of participation :slight_smile:

Giving you guys directly to TPUs tomorrow! Split the team randomely into two in the official google sheet, but this shouldn’t change anything - just that you have access to 2 TPU v3-8s :slight_smile:

Might make organization a bit easier to split work on two VMs!

2 Likes

Thanks @sbmaruf. Looking forward to it! I get a “no text channels” error when I click on the link provided for the discord channel though?

discord server link, Flax-HuggingFace-Community-Week
after login, find the channel from the left panel or try the link above again.

otherwise shoot me an email. sbmaruf at gmail dot com

1 Like