RoBERTa/T5 for Programming Languages
Currently, there is no MLM-based model (some are available) that was trained from scratch for Various Programming Languages on the hub. For this project, the goal is to create a strong programming language generation model.
2. Language
Primary Languages: C++, Python, Java
Apart from that, we can also consider the following languages based on availability. We have scrapped a large set of programming language codes. However, we are in the process of discussing the matter of licensing issues with the data owner. These are the scrapped languages so far.
lang_dict = {
"C":["GNU C", "GNU C11"],
"C++":[ "GNU C++", "GNU C++0x", "GNU C++11", "GNU C++14", "GNU C++17", "GNU C++17 Diagnostics", "MS C++", "MS C++ 2017", "Clang++17 Diagnostics"],
"C#":[".NET Core C#", "Mono C#", "MS C#"],
"D":["D"],
"Delphi": ["Delphi"],
"FPC":["FPC"],
"Go":["Go"],
"Haskell":["Haskell"],
"Java":["Java 6", "Java 7", "Java 8", "Java 11", "Kotlin"],
"JavaScript":["JavaScript", "Node.js"],
"Ocaml":["Ocaml"],
"PascalABC.NET":["PascalABC.NET"],
"Perl":["Perl"],
"PHP":["PHP"],
"Python 2":["PyPy 2", "Python 2"],
"Python 3":["PyPy 3", "Python 3"],
"Ruby":["Ruby"],
"Rust":["Rust"],
"Scala":["Scala"]
}
Please note that we cannot share our own dataset of codes without resolving licensing issue. However, there are enough data available online to train an LM.
3. Model
A randomly initialized RoBERTa model.
4. Datasets
We can use dataset from,
- GitHub - facebookresearch/TransCoder: Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf
- GitHub - github/CodeSearchNet: Datasets, tools, and benchmarks for representation learning of code.
- If we can resolve the licensing issue, we can share our own datasets via hugging-face dataset pipeline.
5. Training scripts
A masked language modeling script for Flax is available here.
6. Challenges
- Data pre-processing
- Sample preparation for LM
7. Desired project outcome
The desired project output is an MLM/T5 model that is able to generate Programming language.
8. Reads
The most important read would be the following colab:
Apart from that we need to check the tokenization procedure from,
- GitHub - facebookresearch/TransCoder: Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf
- GitHub - github/CodeSearchNet: Datasets, tools, and benchmarks for representation learning of code.
UPDATE 1: In addition to some interest from many people, I must add that, there are some pre-trained models already in hugging-face which I was not aware of at the time of writing this proposal. Model Link: Hugging Face – The AI community building the future.
UPDATE 2: We can try training a T5 like model with Programming Languages instead of an MLM-like model.