Train T5 model for commit message generation

mozharovsky · June 24, 2021, 8:42am

T5 for commit message generation

Commit message generation is known to be a difficult task due to the following reasons:

No available clean and large dataset for pre-training and fine-tuning
(Almost) No available pre-trained seq2seq models on both programming and natural languages corpora
Context limitations (sequence length, external knowledge, etc.)

In the recent research [1] [2] [3] it has been shown that pre-trained transformer models (like T5 or BART) can capture code understanding domain and fine-tune for various seq2seq programming downstream tasks (e.g. code summarization, code documentation generation, commit message generation, etc).

Nevertheless, each of the listed downstream tasks deserves more thorough research. I’ve chosen commit message generation to be such a task because it has many practical applications, such as auto version control systems.

To wrap up, these are the main goals of this research project:

Release a publicly available dataset for programming language models pre-training
Release a publicly available dataset for commit message generation fine-tuning
Release a publicly available (pre)-trained T5 model for commit message generation
Release all pre-processing, post-processing, and training scripts for further research

2. Language

We’ll use English and Python languages for training.

This research is limited in time, so it would be great to show a strong baseline for at least one pair of languages. Later we’ll be able to extend the model for more language pairs.

3. Model

We’ll be using a random T5 model similar to t5-base in configuration.

4. Datasets

Possible links to publicly available datasets include:

Please note that the BigQuery GitHub dump needs to be cleaned up before being used for pre-training/fine-tuning. We should also consider using only those repos whose license allows us to do so.

5. Training scripts

We can make use of run_summarization_flax.py to train the model.

6. (Optional) Challenges

The main challenges are to prepare the dataset and set up training tasks in Flax. Besides, further challenges include facing the short context sizes and transferring knowledge across domains. I suppose we could try using sparse attention blocks and techniques like RAG for bringing the domain knowledge.

7. (Optional) Desired project outcome

The desired project outcome is to achieve the listed goals and have fun! A demo would be a simple application (e.g., VSCode extension) that generates commit messages given the changed python files.

8. (Optional) Reads

The following links can be useful to better understand the project and
what has previously been done in the research community.

bhavnicksm · June 24, 2021, 11:45am

This idea is definitely quite interesting
I would love to join in and be a part of this

I have worked on Seq-2-Seq tasks like Machine Translation on low-resource languages, Character Transliteration for code-mixed conversations, Built and Trained Transformer models in both PyTorch as well as TensorFlow before. I have been meaning to find another seq-2-seq task to work on and this is just ideal .

mozharovsky · June 24, 2021, 2:44pm

Hey, thanks for being interested in the project! I’ll be super happy to see you on board!

gagan3012 · June 24, 2021, 4:16pm

This seems very interesting can I please join this team?

mozharovsky · June 24, 2021, 6:43pm

Hey! Sure, welcome on board!

mozharovsky · June 24, 2021, 6:46pm

Does anybody know TypeScript btw?

Looking right now at VSCode Extensions API – it’s awesome! I’m sure we’ll be able to integrate the model into the commit message generation pipeline within an extension.

RastSD · June 25, 2021, 8:44am

I’m very interested and would like to join as well

agemagician · June 25, 2021, 11:41am

There are already T5 models that was trained for this task:

This repo should help you to extend the pretrained CodeTrans models for more tasks or languages:

Good luck.

patrickvonplaten · June 25, 2021, 5:25pm

That’s a great project description - excited about the results!

bhavnicksm · June 25, 2021, 6:07pm

Hey @agemagician!

Thanks for pointing out great resources for this task.

I noticed that the paper CodeTrans as well as the Git Commit Message Generator model available on HuggingFace Hub is only trained for Java.

I think since Java and Python have fundamentally different languages with different synatctical structure and semantics, fine-tuning that model may not yeild the best results. Though I would not mind trying at all and comparing the results, if we have time.

agemagician · June 25, 2021, 8:02pm

ProtTrans actually has models which were trained on 9 programming languages + English language using self-supervised learning.

You can easily fine-tune them on any task and almost all famous programming languages.

For example:

mozharovsky · June 25, 2021, 9:44pm

Thanks, @patrickvonplaten! Shall we consider forming two teams under this project? We have both pre-training and fine-tuning research experiments that can be parallelized across the teams.

mozharovsky · June 25, 2021, 10:19pm

Hey @agemagician,

CodeTrans is a great research project! Our research is highly backed by their results. Anyway, we still have space for improving their ideas to make pre-trained models stronger domain learners.

Besides, we’re missing publicly available high-quality data and benchmarks for us and other researchers to improve upon existing ideas and bring new ones.

We believe that the HuggingFace ecosystem is a perfect place to make research open for all, and hopefully, our research can enforce these values further.

Regards,
Eugene

patrickvonplaten · June 29, 2021, 2:17pm

Very exciting project - putting it down Since one team is good here! Since you are more than 5, we might have 2 TPUs for you!

mozharovsky · June 29, 2021, 4:08pm

Amazing news, thank you! Just a quick question – can we add more participants since we have a few more slots available?

bharat-raghunathan · June 29, 2021, 5:51pm

Hi, I would like to join this team if spots are still available! I have some experience in Software Engineering and GitHub APIs that may help

mozharovsky · June 30, 2021, 8:32am

Hey, join us on our Discord channel!

patrickvonplaten · June 30, 2021, 12:35pm

Yes! This is fine

patrickvonplaten · July 1, 2021, 10:01am

added you

mozharovsky · July 2, 2021, 11:03am

Hey @jxuhf, welcome on board! Please, join our Discord channel (#t5-commit-message).

Topic		Replies	Views
Auto-generation of Messages to Commits \| Project Proposing Flax/JAX Projects	4	1577	August 13, 2021
PreTrain T5 for Italian 🇮🇹 Flax/JAX Projects	3	618	July 7, 2021
Commit Message Generation Model Models	0	38	August 27, 2024
Pretrain T5 from scratch in Dutch Flax/JAX Projects	2	2089	July 7, 2021
PreTrain BART on The Pile Flax/JAX Projects	19	1635	July 1, 2021