Please read the topic category description to understand what this is all about
Description
Many open-source projects on GitHub use Issues to triage feature requests, bugs, and so on. For example, check out the Issues tab of Transformers and Datasets to get an idea. The goal of this project is to pick your favourite open-source project and create a bot that can automatically assign a Label
(e.g. bug, enhancement etc) to a new GitHub issue.
Model(s)
Any of the pretrained BERT-like models on the Hub should serve as a good basis for this project. Given the domain is about source code, you may find that fine-tuning the language model first on the dataset gives a boost in performance.
Datasets
For this project you’ll have to create your own dataset by downloading and processing the GitHub issues associated with an open-source project. You can do this with GitHub’s REST or GraphQL APIs. You can find an example dataset on the Hub here:
Challenges
This is a multilabel classification task, so you’ll need to do some data exploration to figure out which classes can be feasibly detected.
Desired project outcomes
- Create a Streamlit or Gradio app on Spaces that injests new GitHub issues from an open-source project and predicts the Labels of each one.
- Don’t forget to push all your models and datasets to the Hub so others can build on them!
Additional resources
- Predicting Issues’ Labels with RoBERTa
- Check out this chapter of the course for more details
Discord channel
To chat and organise with other people interested in this project, head over to our Discord and:
-
Follow the instructions on the
#join-course
channel -
Join the
#github-issues-classification
channel
Just make sure you comment here to indicate that you’ll be contributing to this project