Create a GitHub issues tagger

lewtun · November 9, 2021, 7:12pm

Please read the topic category description to understand what this is all about

Description

Many open-source projects on GitHub use Issues to triage feature requests, bugs, and so on. For example, check out the Issues tab of Transformers and Datasets to get an idea. The goal of this project is to pick your favourite open-source project and create a bot that can automatically assign a Label (e.g. bug, enhancement etc) to a new GitHub issue.

Model(s)

Any of the pretrained BERT-like models on the Hub should serve as a good basis for this project. Given the domain is about source code, you may find that fine-tuning the language model first on the dataset gives a boost in performance.

Datasets

For this project you’ll have to create your own dataset by downloading and processing the GitHub issues associated with an open-source project. You can do this with GitHub’s REST or GraphQL APIs. You can find an example dataset on the Hub here:

lewtun/github-issues

Challenges

This is a multilabel classification task, so you’ll need to do some data exploration to figure out which classes can be feasibly detected.

Desired project outcomes

Create a Streamlit or Gradio app on Spaces that injests new GitHub issues from an open-source project and predicts the Labels of each one.
Don’t forget to push all your models and datasets to the Hub so others can build on them!

Additional resources

Predicting Issues’ Labels with RoBERTa
Check out this chapter of the course for more details

Discord channel

To chat and organise with other people interested in this project, head over to our Discord and:

Follow the instructions on the #join-course channel
Join the #github-issues-classification channel

Just make sure you comment here to indicate that you’ll be contributing to this project

DELith · November 17, 2021, 2:13pm

I would like to proceed with this project if still possible

lewtun · November 17, 2021, 4:06pm

Hey @DELith yes you’re welcome to take on this project! I’ve created a channel on Discord (see topic description) in case you or others need it

gerardo · November 18, 2021, 5:03am

I would like to participate on this project
When I search on discord for #github-issues-classification nothing happens.

I think there’s a disconnect with my discord and my hf.co discussion site.

lewtun · November 18, 2021, 10:05am

Hey @gerardo ! Did you first follow the instructions on the #join-course channel? You need to do that to see the team channels

gerardo · November 18, 2021, 3:21pm

This is the only information I see when I click #join-course

I think there’s something wrong with my setup.

lewtun · November 18, 2021, 9:24pm

Hey @gerardo what happens if you add the emoji to the message in #join-course? That should automatically open the rest of the team channels

gerardo · November 18, 2021, 9:57pm

To be added, please react to this message with

lewtun · November 18, 2021, 11:10pm

Don’t worry - happens to me all the time

DELith · November 19, 2021, 4:06pm

@lewtun How do you think the raw text from the issue body should be processed before being fed into the model (training\ inference)? I mean the steps preceding the basic pipeline like tokenization and encoding? I am asking, because it might have some markdown syntax as well as long sequences like exception traceback and stuff like that.

lewtun · November 19, 2021, 7:51pm

Hey @DELith I think you can tokenize the text as you normally would with any text. My suggestion would be to start with that first as a baseline and then see if there’s a need to improve the preprocessing

Topic		Replies	Views
Classify remarks into predefined questions Beginners	0	137	August 21, 2023
How To Request Support Beginners	3	12693	January 31, 2023
Chatbot for support model selection Beginners	0	174	June 1, 2023
Fine-Tune for MultiClass or MultiLabel-MultiClass Models	52	69507	May 22, 2023
Multiple text classification using hugging face with gradio app 🔒 Gradio	1	1154	September 14, 2022