Please read the topic category description to understand what this is all about
Description
Applications like GitHub’s CoPilot can automatically generate docstrings from a class or function name. The goal of this project is to fine-tune a Transformer like CodeT5 to do this ourselves!
Model(s)
Generating docstrings from source code can be modelled as a sequence-to-sequence task, so T5 models are a good starting point here:
Datasets
A good dataset for this task is code_search_net, but feel free to find alternative datasets if you can’t find your favourite programming language there.
Challenges
Models like CodeT5 are rather large and you’ll need to think about what metrics one should use for this type of task.
Desired project outcomes
- Create a Streamlit or Gradio app on
Spaces that can automatically generate a docstring from a class of function name in your favourite programming language!
- Don’t forget to push all your models and datasets to the Hub so others can build on them!
Additional resources
Discord channel
To chat and organise with other people interested in this project, head over to our Discord and:
-
Follow the instructions on the
#join-course
channel -
Join the
#docstring-generator
channel
Just make sure you comment here to indicate that you’ll be contributing to this project