Pretrain GPT-Neo for Open Source GitHub Copilot Model

Let’s maybe make two teams here no? This way I can give you two TPU VMs tomorrow :slight_smile: Splitting the team in half now and adding all new members. Feel free to work together as much as possible, but should be a bit easier to organize when there are two teams I think.

2 Likes

Actually creating 3 projects for this - this is awesome! I’ll give you access to three TPU VMs tomorrow :slight_smile: Hope that’s fine. Different teams doesn’t mean that you can’t work together - it’s just so that you have an easier time splitting the work and have access to more TPUs. I’ve split the members somewhat random in the group. Feel free to share the three TPU names tomorrow internally with each other and re-organize the three teams if it fits you better :slight_smile:

2 Likes

The more TPUs the merrier in my book :nerd_face:

1 Like

Hi all of y’all, we are missing quite a few people from here in our discord channel. If you are still interested in helping out please say hello and get caught up in our discord channel. You can use this link here Flax-HuggingFace-Community-Week and just jump to the channel “copilot-code-synthesis”

If you aren’t in the slack channel or in one of the three teams in this spreadsheet Confirmed teams for Flax/JAX community week - Google Sheets please let me and @patrickvonplaten and I am sure he will be able to resolve the issue.

Hi @ncoop57 @patrickvonplaten, do we still have space left. I would like to join the effort. Thanks

1 Like

I’m very interested in helping with things like safety, correctness of generated code, training efficiency, and so on by way of neurosymbolic techniques down the road, if that’s an interest. Also UX informed informed by years of expertise and experience in programming languages tool development, and evaluation methods informed by the same expertise and experience.

I’d be too nervous to contribute to anything closed, but extremely happy to contribute to something open source and transparent. Please let me know if my input and help is welcome here. I feel very strongly about getting this all right at some point.

1 Like

I was thinking about it. Count me in! Would the new GPT J 6B make sense?

1 Like

Awesome! That is one of the top models in the running. We are continuing to discuss everything on discord so hop on over there, we have a specific channel dedicated to all things models if you’d like to work on that task!

I am very interested in doing this also. Starting out with the GPTJ 6B would make sense but we should also note that the corpus has GitHub data that it was trained with (the pile) so getting the licenses of each piece of code that was trained for GPTJ would be a big step in avoiding the issues that Github Copilot has.

1 Like

The best way to contribute current since we have so many members right now would be to just contribute to our repository and discussing things in our issues. Here is our repo if you are interested: GitHub - ncoop57/gpt-code-clippy: Full description can be found here: https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57

I have personally been experimenting with GPT-Neo on text generation and would be interested in contributing to this project too.

@patrickvonplaten Is it still possible to join the team/get TPU access for this project?

Would also love to contribute! Have been working on the reverse: Code-to-NL for about 2 years and assembled datasets by crawling github for different languages as well. Also gathered a good bit of experience on code and language processing and some model modifications that gave us significant boosts. Still possible to join the party? :smile:

Hi, just want to know how this project is going now?