Train T5 model for commit message generation

T5 for commit message generation

Commit message generation is known to be a difficult task due to the following reasons:

  1. No available clean and large dataset for pre-training and fine-tuning
  2. (Almost) No available pre-trained seq2seq models on both programming and natural languages corpora
  3. Context limitations (sequence length, external knowledge, etc.)

In the recent research [1] [2] [3] it has been shown that pre-trained transformer models (like T5 or BART) can capture code understanding domain and fine-tune for various seq2seq programming downstream tasks (e.g. code summarization, code documentation generation, commit message generation, etc).

Nevertheless, each of the listed downstream tasks deserves more thorough research. I’ve chosen commit message generation to be such a task because it has many practical applications, such as auto version control systems.

To wrap up, these are the main goals of this research project:

  1. Release a publicly available dataset for programming language models pre-training
  2. Release a publicly available dataset for commit message generation fine-tuning
  3. Release a publicly available (pre)-trained T5 model for commit message generation
  4. Release all pre-processing, post-processing, and training scripts for further research

2. Language

We’ll use English and Python languages for training.

This research is limited in time, so it would be great to show a strong baseline for at least one pair of languages. Later we’ll be able to extend the model for more language pairs.

3. Model

We’ll be using a random T5 model similar to t5-base in configuration.

4. Datasets

Possible links to publicly available datasets include:

Please note that the BigQuery GitHub dump needs to be cleaned up before being used for pre-training/fine-tuning. We should also consider using only those repos whose license allows us to do so.

5. Training scripts

We can make use of run_summarization_flax.py to train the model.

6. (Optional) Challenges

The main challenges are to prepare the dataset and set up training tasks in Flax. Besides, further challenges include facing the short context sizes and transferring knowledge across domains. I suppose we could try using sparse attention blocks and techniques like RAG for bringing the domain knowledge.

7. (Optional) Desired project outcome

The desired project outcome is to achieve the listed goals and have fun! A demo would be a simple application (e.g., VSCode extension) that generates commit messages given the changed python files.

8. (Optional) Reads

The following links can be useful to better understand the project and
what has previously been done in the research community.

10 Likes

This idea is definitely quite :sparkles: interesting :sparkles:
I would love to join in and be a part of this :hugs:

I have worked on Seq-2-Seq tasks like Machine Translation on low-resource languages, Character Transliteration for code-mixed conversations, Built and Trained Transformer models in both PyTorch as well as TensorFlow before. I have been meaning to find another seq-2-seq task to work on and this is just :sparkles: ideal :sparkles:.

1 Like

Hey, thanks for being interested in the project! I’ll be super happy to see you on board! :hugs:

1 Like

This seems very interesting can I please join this team?

3 Likes

Hey! Sure, welcome on board! :hugs:

Does anybody know TypeScript btw? :grinning_face_with_smiling_eyes:

Looking right now at VSCode Extensions API – it’s awesome! I’m sure we’ll be able to integrate the model into the commit message generation pipeline within an extension.

I’m very interested and would like to join as well :blush:

1 Like

There are already T5 models that was trained for this task:

This repo should help you to extend the pretrained CodeTrans models for more tasks or languages:

Good luck.

2 Likes

That’s a great project description - excited about the results!

2 Likes

Hey @agemagician! :hugs:

Thanks for pointing out great resources for this task. :heart:

I noticed that the paper CodeTrans as well as the Git Commit Message Generator model available on HuggingFace Hub is only trained for Java.

I think since Java and Python have fundamentally different languages with different synatctical structure and semantics, fine-tuning that model may not yeild the best results. Though I would not mind trying at all and comparing the results, if we have time.

1 Like

ProtTrans actually has models which were trained on 9 programming languages + English language using self-supervised learning.

You can easily fine-tune them on any task and almost all famous programming languages.

For example:

3 Likes

Thanks, @patrickvonplaten! Shall we consider forming two teams under this project? We have both pre-training and fine-tuning research experiments that can be parallelized across the teams.

1 Like

Hey @agemagician,

CodeTrans is a great research project! Our research is highly backed by their results. Anyway, we still have space for improving their ideas to make pre-trained models stronger domain learners.

Besides, we’re missing publicly available high-quality data and benchmarks for us and other researchers to improve upon existing ideas and bring new ones.

We believe that the HuggingFace ecosystem is a perfect place to make research open for all, and hopefully, our research can enforce these values further. :hugs:

Regards,
Eugene

2 Likes

Very exciting project - putting it down :slight_smile: Since one team is good here! Since you are more than 5, we might have 2 TPUs for you!

1 Like

Amazing news, thank you! Just a quick question – can we add more participants since we have a few more slots available?

1 Like

Hi, I would like to join this team if spots are still available! I have some experience in Software Engineering and GitHub APIs that may help

2 Likes

Hey, join us on our Discord channel! :wave:

1 Like

Yes! This is fine :slight_smile:

added you :slight_smile:

2 Likes

Hey @jxuhf, welcome on board! Please, join our Discord channel (#t5-commit-message).