Create a GitHub issues tagger

:wave: Please read the topic category description to understand what this is all about

Description

Many open-source projects on GitHub use Issues to triage feature requests, bugs, and so on. For example, check out the Issues tab of :hugs: Transformers and :hugs: Datasets to get an idea. The goal of this project is to pick your favourite open-source project and create a bot that can automatically assign a Label (e.g. bug, enhancement etc) to a new GitHub issue.

Model(s)

Any of the pretrained BERT-like models on the Hub should serve as a good basis for this project. Given the domain is about source code, you may find that fine-tuning the language model first on the dataset gives a boost in performance.

Datasets

For this project you’ll have to create your own dataset by downloading and processing the GitHub issues associated with an open-source project. You can do this with GitHub’s REST or GraphQL APIs. You can find an example dataset on the Hub here:

Challenges

This is a multilabel classification task, so you’ll need to do some data exploration to figure out which classes can be feasibly detected.

Desired project outcomes

  • Create a Streamlit or Gradio app on :hugs: Spaces that injests new GitHub issues from an open-source project and predicts the Labels of each one.
  • Don’t forget to push all your models and datasets to the Hub so others can build on them!

Additional resources

Discord channel

To chat and organise with other people interested in this project, head over to our Discord and:

  • Follow the instructions on the #join-course channel

  • Join the #github-issues-classification channel

Just make sure you comment here to indicate that you’ll be contributing to this project :slight_smile:

1 Like

I would like to proceed with this project if still possible

1 Like

Hey @DELith yes you’re welcome to take on this project! I’ve created a channel on Discord (see topic description) in case you or others need it :slight_smile:

I would like to participate on this project
When I search on discord for #github-issues-classification nothing happens.

I think there’s a disconnect with my discord and my hf.co discussion site.

Hey @gerardo ! Did you first follow the instructions on the #join-course channel? You need to do that to see the team channels :slight_smile:


This is the only information I see when I click #join-course

I think there’s something wrong with my setup.

Hey @gerardo what happens if you add the :hugs: emoji to the message in #join-course? That should automatically open the rest of the team channels

1 Like

To be added, please react to this message with :hugs:

:man_facepalming:t2:

Don’t worry - happens to me all the time :wink:

@lewtun How do you think the raw text from the issue body should be processed before being fed into the model (training\ inference)? I mean the steps preceding the basic pipeline like tokenization and encoding? I am asking, because it might have some markdown syntax as well as long sequences like exception traceback and stuff like that.

Hey @DELith I think you can tokenize the text as you normally would with any text. My suggestion would be to start with that first as a baseline and then see if there’s a need to improve the preprocessing :slight_smile: