Auto-generation of Messages to Commits | Project Proposing

Description

PreTrain language model for automatic generation of messages to commits.
Version control systems are used in the development of many projects, so the tool can be relevant for a wide range of developers.

Model

Here we need to use the Text2Text/Sequence2Sequence or Text Generation model. Better to discuss it all together.

Dataset

Not sure that there are existing datasets for us. But we can get everything from GitHub API.

How to collect data

Get all branches from the repository:

https://api.github.com/repos/huggingface/transformers/branches?per_page=100&page=2

Result (master as example):

[...
   {
       "name": "master",
       "commit": {
         "sha": "3ff2cde5ca4a2d3c622b827d9edf7e3d0b7f4fb7",
         "url":    "https://api.github.com/repos/huggingface/transformers/commits/3ff2cde5ca4a2d3c622b827d9edf7e3d0b7f4fb7"
       },
       "protected": true
   },
...
]

Get all commits from the branch:

  • page=1 and sha=7a8d6b19767a92b1c4ea45d88d4eedc2b29bf1fa as example
https://api.github.com/repos/huggingface/transformers/commits?per_page=100&sha=3ff2cde5ca4a2d3c622b827d9edf7e3d0b7f4fb7&page=1

Result:
An array of commits. Json of each commit is huge, so I will not paste it. From each commit we need a url.

https://api.github.com/repos/huggingface/transformers/commits/24cbf6bc5a0b6a9bb5afdda6bb1a329ac980fa4b

Get commit message and patches:

  • Message - $.commit.message
  • Patches - array of changed files $.files, where each file has $.patch

Patch example:

@@ -193,7 +193,7 @@ It is recommended to pre-train Wav2Vec2 with Trainer + Deepspeed (please refer t\n Here is an example of how you can use DeepSpeed ZeRO-2 to pretrain a small Wav2Vec2 model:\n \n ```\n-PYTHONPATH=../../../src deepspeed --num_gpus 2 run_pretrain.py \\\n+PYTHONPATH=../../../src deepspeed --num_gpus 4 run_pretrain.py \\\n --output_dir=\"./wav2vec2-base-libri-100h\" \\\n --num_train_epochs=\"3\" \\\n --per_device_train_batch_size=\"32\" \\

Training scripts

I think we would use our own script for this project. As well, we can refer to any example script.

Expected result

Ready to go model. So we can integrate it with source code editors and browser extensions.

About

I am really interested in this project. I hope to find like-minded people and create a cool project :boom:
Even if you don’t want to participate - like and reply to this topic so that more people will see it! :heart:

3 Likes

Hello, Aleksey. This sounds interesting.
Something that might be useful to you, CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model.
I might be able to help you very passively. Let me know. They have a codebase alongside the dataset over here.
Let me know.
Best Wishes !
PS : Look at this community Code.AI, you are more likely to find collaborators over there.

Thank you, @reshinthadith!
I guess our team has no limit and any help will be useful :hugs:

Hey @AlekseyKorshuk! I’ve been a part of Git-T5 team during the JAX/Flax community event. We got some pre-trained models and datasets for fine-tuning. Our code is released under this repository.

By the way, I’m still working on this project. Recently, we’ve received an OpenAI Codex API access and currently working on a VSCode extension to make a copilot for commits. If you’re interested, please let me know! :slight_smile:

Hello @mozharovsky! I am interested in this :hugs: