Document

Fire and Forget Introducing Label Bot

LabelBot is a Github application that can be used to automatically tag an issue with a Github label.

It uses machine learning to read the comment body and title to estimate what the issue might be about, and then labels the issue following the publishing of a short comment notifying the issue creator of the same.

When an issue is opened, the bot predicts if the label should be a: 🐞 bug, 💡 enhancement or ? question labels and applies a label automatically if appropriate.

Try it out

Title

Text

Issue body can not be empty

Request failed!

Prediction Information

Data Before Preprocessing
Data After Preprocessing

Bot in action on Github

Lets talk about data

We mined Github API and GHTorrent dumps to get label specific issues of trending repositories

Trending repositories are fetched from GHTorrent based upon their stars and forks.
GitHub API is used to fetch issues from these repositories based on their labels.
As of now, our dataset contains issues tagged with bug, question and enhancement.
We ensure that the issue is primarily related to the label we have queried for, we take the following measure - If the issue is tagged with any other default Github label, it is ignored.

Target Labels Selection

Initially, we fetched issues tagged with all default Github labels

Documentation was removed due to insufficient data points. Comments tagged with documentation also did not show convincing similarity in terms of word embeddings.
Comments tagged with invalid, duplicate, wontfix and good first issue also did not show convincing similarity in terms of word embeddings.
This is understandable because the labels invalid, duplicate, wontfix and good first issues are tagged to a issue by the maintainer based on characteristics other than the comment body.

Model Selection

We compared the performance across 4 ML models based on our dataset.

Linear Support Vector Classifier: 73.46%
Logistic Regression: 73.34%
Multinomial Naive Bayes: 69.74%
Random Forest Classifier: 50.59%

Upon comparing the accuracies we chose to use Linear Support Vector Classifier for the labelling-bot.

Confusion Matrix

A Confusion Matrix is a table that is used to visually demonstrate the performance of a classification model.

We perform classification on the test set which is obtained after a 80-20 train-test split.
The horizontal axis("predicted") denotes the labels that have been predicted by the model.
The vertical axis("actual") denotes the labels the data points actually had in the data set.
The diagonal from the top-left to the bottom-right corner denotes accurate predictions.
What we observe in this model is relatively higher accuracy for bug and enhancement labels when compared to question.
From this we conclude that issues labelled as question tend to be more open-ended in terms of their body(embedding based similarity).