Pretrain GPT-Neo for Open Source GitHub Copilot Model

ncoop57 · June 30, 2021, 12:10am

Open Source GitHub Copilot for auto generating code

I would like to train an open source version of the new awesome GitHub Copilot AI tool, which is based on GPT3. Similar to the awesome people behind GPT-Neo, having such an open source model would greatly help researchers understand what this type of biases and limitations this kind of code autocompletion model might have such as generating insecure code (i do research in this area and i know my team would love an open sourced version to run experiments on, i.e. try and break it )

2. Language

The model will be trained on different programming languages such as C, C++, java, python, etc.

3. Model

GPT-Neo

4. Datasets

Datasets that contain hopefully high quality source code

Possible links to publicly available datasets include:

Some additional datasets may need creating that are not just method level.

5. Training scripts

I believe the standard CLM language model script would do for this.

We can make use of transformers/run_clm_flax.py at master · huggingface/transformers · GitHub

6. (Optional) Challenges

The data additional data may be a challenge. From what I can see in copilot, it looks to be training on entire files, not code snippets. There are file level datasets that exist but they are a few years old and i don’t think they cover many programming languages. The ones I listed above have multiple languages but are only methods.

However, githubs API is pretty easy to use and so it would be pretty easy to create one from scratch, especially if we get some insights into how the copilot dataset was generated

7. (Optional) Desired project outcome

I’d love to have this open source model setup in a similar Visual Studio Code extension to the GitHub Copilot one. I’ve actually made a tutorial on doing this using the GPT-Neo model, so we could easily clean it up and release it free of charge forever because from what I’ve seen on Twitter the GitHub Copilot might eventually be put behind a paywall .

8. (Optional) Reads

The following links can be useful to better understand the project and
what has previously been done.

Introducing GitHub Copilot: your AI pair programmer | The GitHub Blog
Deep Learning to Autocomplete Code: Part #2 Autogenerating Code - VS Code Tutorial - YouTube (tutorial on how we could setup the demo of the model once it’s done cooking)

hgarg · June 30, 2021, 2:06am

Very much interested to be part of this project.

knilakshan20 · June 30, 2021, 3:22am

Really an interesting project

mrm8488 · June 30, 2021, 3:26am

Count with me, Nathan

salmanhiro · June 30, 2021, 3:32am

sounds cool! i am in

sleepyjarvis · June 30, 2021, 3:42am

I was actually planning to do the same Nathan. Count me in. Let’s discuss the plan then. Hope to hear from you soon.

Rohan · June 30, 2021, 4:58am

I would love to be a part of this project!

gagan3012 · June 30, 2021, 5:08am

It so interesting can I join this project?

naruto7 · June 30, 2021, 5:53am

@ncoop57 Count me in! Very interested in this project
I think we can also use the GitHub public repository dataset: Google Cloud Platform

julien-c · June 30, 2021, 6:41am

Also wanted to share how we trained huggingface/CodeBERTa-language-id · Hugging Face (trained on code_search_net as well) – see original model card: huggingface/CodeBERTa-small-v1 · Hugging Face

That was 1.5 years ago so I think things have changed significantly since then!

In particular, I am super curious which tokenizer type to optimally pick for code (maybe char-based in a better option now, cc @patrickvonplaten )

abhishekbhagwat · June 30, 2021, 7:22am

Hi Nathan, looks like a cool project! I would love to be a part of it!

Vivek · June 30, 2021, 8:10am

Would love to explore more about GPT-Neo model and also work for this amazing idea.I would love to be a part of it

valhalla · June 30, 2021, 9:40am

Super cool idea !

As Julien said picking the right tokenizer would be important.

And one could also consider fine-tuning GPTNeo on code data instead of pre-training from scratch, as GPTNeo is trained on Pile dataset which already contains the github (almost 95GB as stated in the paper) and Stack Exchange data.

valhalla · June 30, 2021, 10:04am

Great to so much interest, Lets’ officially define this project

Added everyone here who commented in this sheet. Please leave a comment here or in the sheet if you want to change something.

There are already 10 members here, if more people join we will need to split the team so that it’ll be easier for management. (cc @patrickvonplaten )

ncoop57 · June 30, 2021, 12:09pm

Omg I am sooooo happy to see so much excitement for this project . We are gonna kill this y’all .

I agree with Julien as well, tokenized will be important. Character or even byte level may be the way to go, but I worry we will run into memory issues if we have the model predicting large amount of code similar to copilot. My research group tried regular old BPE but then added in the special keywords that exist to try and make it so that the BPE model didn’t have too many superfluous tokens, but it’s hard to say if that is the optimal.

I love the idea of fine-tuning the model and using the stack exchange, especially since the big part of copilot is how you can prompt it with comments to generate your code. So, having all sorts of data that has some mix of natural language and code would be the best. We will need to define some cleaning criteria as well, maybe we could run some static analyzer to check for certain known vulnerabilities or insecurities. GitHub has their codescanning tool that does this and i know a few research tools as well that we could look at

There are also a few people who were interested in twitter that haven’t commented here. I’ll msg them to also post here.

naruto7 · June 30, 2021, 12:57pm

Wrt tokenization of code, it may be useful to refer to section 4.3 of the TransCoder paper. This paper is on the unsupervised translation of programming languages. In this work javalang for Java, clang for C++ and the tokenize library for Python were used. The figure below shows how robust tokenize is to 2 versions of the same function:

image2248×1123 597 KB

Then the BPE codes are learnt on the tokenized code files using fastBPE
For data that includes both useful comments and code, we could look at code snippets at GeeksforGeeks and code samples such as those for TF and PyTorch available on the official websites

Just my 2 cents

ymhan · June 30, 2021, 4:15pm

What about finetuning GPT-J?

mariagrandury · June 30, 2021, 5:37pm

Wow, such a nice idea Not sure if I still can join but would love to!

junkgear · June 30, 2021, 6:14pm

Count me in !

youali · June 30, 2021, 6:23pm

I would love to give a hand in this, open source ftw

Topic		Replies	Views
How to fine tune fine tune GitHub Copilot? Research	3	3624	June 24, 2022
PreTrain GPT-2 from scratch for German on novel GC4 dataset Flax/JAX Projects	7	1200	July 2, 2021
Pretrain GPT2 from scratch in Korean Flax/JAX Projects	3	989	July 16, 2021
Closest model available to OpenAI's codex/ GitHub Copilot for code completion 🤗Transformers	6	7633	August 7, 2023
Generative models for code generation? 🤗Transformers	0	789	March 1, 2023