Commit message generation is known to be a difficult task due to the following reasons:
- No available clean and large dataset for pre-training and fine-tuning
- (Almost) No available pre-trained seq2seq models on both programming and natural languages corpora
- Context limitations (sequence length, external knowledge, etc.)
In the recent research    it has been shown that pre-trained transformer models (like T5 or BART) can capture code understanding domain and fine-tune for various seq2seq programming downstream tasks (e.g. code summarization, code documentation generation, commit message generation, etc).
Nevertheless, each of the listed downstream tasks deserves more thorough research. I’ve chosen commit message generation to be such a task because it has many practical applications, such as auto version control systems.
To wrap up, these are the main goals of this research project:
- Release a publicly available dataset for programming language models pre-training
- Release a publicly available dataset for commit message generation fine-tuning
- Release a publicly available (pre)-trained T5 model for commit message generation
- Release all pre-processing, post-processing, and training scripts for further research
We’ll use English and Python languages for training.
This research is limited in time, so it would be great to show a strong baseline for at least one pair of languages. Later we’ll be able to extend the model for more language pairs.
We’ll be using a random T5 model similar to
t5-base in configuration.
Possible links to publicly available datasets include:
Please note that the BigQuery GitHub dump needs to be cleaned up before being used for pre-training/fine-tuning. We should also consider using only those repos whose license allows us to do so.
We can make use of run_summarization_flax.py to train the model.
The main challenges are to prepare the dataset and set up training tasks in Flax. Besides, further challenges include facing the short context sizes and transferring knowledge across domains. I suppose we could try using sparse attention blocks and techniques like RAG for bringing the domain knowledge.
The desired project outcome is to achieve the listed goals and have fun! A demo would be a simple application (e.g., VSCode extension) that generates commit messages given the changed python files.
The following links can be useful to better understand the project and
what has previously been done in the research community.