Skip to content

2021H1 - Improve Machine Learning Webcompat Bot #194

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
karlcow opened this issue Apr 5, 2021 · 20 comments
Closed

2021H1 - Improve Machine Learning Webcompat Bot #194

karlcow opened this issue Apr 5, 2021 · 20 comments
Assignees

Comments

@karlcow
Copy link

karlcow commented Apr 5, 2021

We had two choices on either mastering the current infrastructure or in re-using the infrastructure of BugBug

Here a summary of the previous discussions:

@karlcow
Copy link
Author

karlcow commented Apr 5, 2021

We should probably move this to a public forum of some sort.
Marco introduced me in 2018 to autowebcompat (which I had forgotten).

https://groups.google.com/g/mozilla.compatibility/c/4yQxX9EZGQ8/m/ARNK5Da7AQAJ

See https://github.com/marco-c/autowebcompat

Automatically detect web compatibility issues

Since then Open Innovation team started a project to do auto-triage on webcompat web-bugs. Then the Open Innovation team disappeared and the knowledge transfer was minimal.

The documentation of the webcompat ML project
https://webcompat-ml.readthedocs.io/en/latest/

We inherited of a useful project but that we are not mastering and this is a big weakness.

Ksenia started to get an understanding of the project.
https://docs.google.com/document/d/1w3_AU4ZYJJLSkqOW5j_6K49JxmutEsxcpsiwHXz8Tbw/edit#heading=h.otfe6ib0565j

I’m pretty sure there are a lot more we could with these tools. But first we need to have a better hold on them. And it’s cool if we can work together on making things useful for everyone.

@karlcow
Copy link
Author

karlcow commented Apr 5, 2021

Some pieces of a message by Marco Castelluccio @marco-c

Some context about bugbug

bugbug is currently being used for a few purposes.

It started as a project to automatically assign a type to bugs (defect vs enhancement vs task, back when we introduced the "type" we needed a way to fill it for already existing bugs), and then evolved to be a platform to build ML models on bug reports: we now have many models, some of which are being used on Bugzilla, e.g. to assign a type, to assign a component (https://hacks.mozilla.org/2019/04/teaching-machines-to-triage-firefox-bugs/), to close bugs detected as spam, to detect "regression" bugs, and so on.

Then, it evolved to be a platform to build ML models for generic software engineering purposes: we now no longer only have models that operate on bug reports, but also on test data, patches/commits (e.g. to choose which tests to run for a given patch, https://hacks.mozilla.org/2020/07/testing-firefox-more-efficiently-with-machine-learning/, and to evaluate the regression riskiness associated to a patch), and so on.

bugbug also has an infrastructure (built to make it a good fit for ML projects) for Continuous Integration and Continuous Deployment using Taskcluster, and a HTTP service that other services can use for classification (e.g. Firefox CI uses it to choose which tests to run on try and on autoland, the autonag bot uses it to classify bugs).

Deprecating webcompat-ml in favor of a bugbug-based model?

I know you don't use Bugzilla to track web-compat reports, so if you wanted to replace webcompat-ml with a bugbug-based classifier you would first need to add support to bugbug to retrieve GitHub issues (mozilla/bugbug#259), then define a new model similar to an already existing one (it is super easy to do, many external contributors have done it). I could guide you if you are interested.

The main benefit of this move would be to have a proven ML pipeline and service, with very limited maintenance (as I'm maintaining the infrastructure and the ML models part, your side of the maintenance would be limited to pinging me when you notice a problem :P) and no duplication. The main drawback would be that you have to reimplement some parts of webcompat-ml in bugbug (though it should be a really small part, just the "GitHub issue retrieval").

autowebcompat

Regarding autowebcompat, I think it would be an interesting project. It is a totally different kind of project (to summarize for Joe and Sylvestre: a crawler that navigates the web to automatically find web compatibility issues by comparing screenshots from different browsers using a ML model). It is more of a research project though, as it is a shot in the dark: it might work and find lots of issues, but it might not work or might work and find few issues. We have no way to tell the likelihood it would work without investing some time in it.

This kind of project could also be useful for other similar purposes: e.g. instead of looking for web compat issues, it could find issues by comparing a normal Firefox run with a tracking-protection-enabled Firefox run.

Moreover, even if it didn't work, parts of it could be reused for other purposes (which might not be interesting for you, but could be interesting for us). E.g. navigating the web with an ASAN build of Firefox in search of crashes.

@karlcow
Copy link
Author

karlcow commented Apr 5, 2021

from @ksy36

I have a couple questions on what would be the best way to handle the data.

  1. In our current project we store pretty much all data pulled from github for all webcompat issues in the ES database. But we only use title, body and one of the milestones of each issue to build a csv for the model. Just wanted to make sure that I only need to store required fields in the json file (3 fields in our case for each issue) as the rest are redundant?
    I'm thinking of creating a special use case for webcompat while keeping github retrieval code agnostic and save all data by default for a specified org/repo.
  2. As DBs are stored on a server, how do I upload one when it's ready? I've also noticed some are stored on taskcluster (for example https://github.com/mozilla/bugbug/blob/master/bugbug/phabricator.py#L23 ) and some are stored in s3, not sure what is the difference.

@karlcow
Copy link
Author

karlcow commented Apr 5, 2021

From @marco-c

1. In our current project we store pretty much all data pulled from github for all webcompat issues in the ES database. But we only use title, body and one of the milestones of each issue to build a csv for the model. Just wanted to make sure that I only need to store required fields in the json file (3 fields in our case for each issue) as the rest are redundant?
   I'm thinking of creating a special use case for webcompat while keeping github retrieval code agnostic and save all data by default for a specified org/repo.

I'd suggest storing everything (unless it makes the DB huge, but I imagine it doesn't), and then we can just filter what we need later as part of the model training. That's what I do for Bugzilla data too, you never know what you might need in the future. And it's simpler as you don't need special cases.

2. As DBs are stored on a server, how do I upload one when it's ready? I've also noticed some are stored on taskcluster (for example https://github.com/mozilla/bugbug/blob/master/bugbug/phabricator.py#L23 ) and some are stored in s3, not sure what is the difference.

Usually they are stored as artifacts of Taskcluster tasks. Taskcluster automatically handles uploads to S3, and provides an indexed URL to access them. This is the preferred way of handling DBs.

Some are stored "manually" on s3 because they take a long time to be generated (multiple days), and so we upload them while the task is still running to make sure that if it stops for some reason (e.g. AWS kills the instance) we haven't lost the work done by the task up to that point.

For simplicity, in the GitHub case I'd go with the first option. You don't need to do almost anything to make Taskcluster upload the DB, we'll just need to define the path to the DB in the task definition (similarly to https://github.com/mozilla/bugbug/blob/53a0bf62a0fb7b31f147d830417f52e7db49bb10/infra/data-pipeline.yml#L170-L176).

In the end, we'll have:

Once this is done, we can move to the training part.

@karlcow
Copy link
Author

karlcow commented Apr 5, 2021

from @marco-c

I've added a Taskcluster task to retrieve the webcompat issues (mozilla/bugbug@cbd43a0), here's an example run: https://community-tc.services.mozilla.com/tasks/F37Si0aYS-aGCDkahMdKTw. It will run at least once every two weeks from now on.

I've also added support for building models that operate on issues (mozilla/bugbug@7ed57c2 and mozilla/bugbug@5d1fcbd), and I've defined a basic model (a model that simply identifies issues that affect Firefox) to try out the new features: mozilla/bugbug@c45dd7c. The new model can be trained with python -m scripts.trainer browsername.

Here's what's left to do:

  • defining a model to do what you want to do (we can either define it from scratch or take inspiration from the one you already had), in a similar way to the "browsername" model;
  • define a training task in the pipeline, so the model is trained biweekly;
  • (optional) add support for classifying issues to the bugbug HTTP service. I suggest using the HTTP service as it solves versioning issues and deployments for you, but this step is optional because you can also use bugbug directly in your tools.

@karlcow
Copy link
Author

karlcow commented Apr 5, 2021

Ksenia @ksy36 has done a very cool work on https://github.com/mozilla/bugbug/pull/2242/files

@marco-c
Copy link

marco-c commented Apr 15, 2021

@ksy36 also implemented updating the issues DB by re-fetching issues that have been modified since the last run: mozilla/bugbug#2281.

@karlcow
Copy link
Author

karlcow commented Apr 15, 2021

Thanks @marco-c
Wonderful @ksy36 All this work will make it possible to remove a big hurdle.

@marco-c
Copy link

marco-c commented Apr 15, 2021

from @ksy36

I've started looking into creating the model for webcompat and have a couple questions, would appreciate it if you could share your thoughts.

With the current model we don't extract features from the issue body yet (it was marked as enhancement/experiment in mozilla/webcompat-ml#3) and I'm wondering what is the optimal way of handling this case. As far as I understand all models in bugbug extract certain features. Perhaps in my case I could have an empty list in https://github.com/mozilla/bugbug/blob/master/bugbug/models/browsername.py#L24 or would you suggest extracting some features?

And another question I have is about the title change. In the current model we use original titles (they're being extracted from events list, i.e. we're looking for the title that the issue had before "renaming event").
I wonder if changing it in transform function of issue_features.py is the right place (https://github.com/mozilla/bugbug/blob/master/bugbug/issue_features.py#L69), so the right titles are being used in the data set.

@marco-c
Copy link

marco-c commented Apr 15, 2021

from myself:

I've started looking into creating the model for webcompat and have a couple questions, would appreciate it if you could share your thoughts.

With the current model we don't extract features from the issue body yet (it was marked as enhancement/experiment in mozilla/webcompat-ml#3) and I'm wondering what is the optimal way of handling this case. As far as I understand all models in bugbug extract certain features. Perhaps in my case I could have an empty list in https://github.com/mozilla/bugbug/blob/master/bugbug/models/browsername.py#L24 or would you suggest extracting some features?

I'd suggest replicating the original model first, and then we can add more features later if we want to try improving it.

The BrowserName model was using three sources for the features: title, first comment, and generic characteristics (in this case, the total number of comments). I defined all three just to showcase them all, but we can remove them.

In general, the models have an extraction pipeline (e.g. https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/models/browsername.py#L35-L54) where the first step parses the issues and extracts data from them (the input is a list of issues, the output is a list of dicts with "data", "title", "first_comment" keys https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/issue_features.py#L77-L79), and the second step transforms the list of dicts into lists of numpy arrays that can be given as input to the actual ML algorithm.

To remove the generic characteristics, you can simply remove the feature_extractors list (from https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/models/browsername.py#L23) or make it an empty list like you said, and remove https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/models/browsername.py#L45 (the second step is actually the only one necessary, the first is more of a cleanup).

Then, if you want to remove the first comment (i.e. issue body) from consideration, you can remove these three lines https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/models/browsername.py#L47-L51 (basically, the "first_comment" data from the list of dicts won't be turned into part of the numpy array that is given as input to the ML algorithm).

And another question I have is about the title change. In the current model we use original titles (they're being extracted from events list, i.e. we're looking for the title that the issue had before "renaming event").
I wonder if changing it in transform function of issue_features.py is the right place ( https://github.com/mozilla/bugbug/blob/master/bugbug/issue_features.py#L69) , so the right titles are being used in the data set.

We have something similar for Bugzilla bugs. For Bugzilla bugs it is super complex because we're handling a lot of fields and a lot of corner cases, but for GitHub issues it should be simpler (especially if you limit the "rollback" to the title).

You'd need to patch https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/issue_features.py#L19 in a similar way to what I did here mozilla/bugbug@8bab68a, and define a issue_snapshot model with a rollback function (like https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/bug_snapshot.py#L536, but ignore this function as it is super complex!) that simply takes as input the issue as a dict and returns the issue with the title changed to the original title. There is a when argument that could be used to specify when in the history of the issue you want to rollback to, but for simplicity I suggest not implementing it for now (just assert it is None at the beginning of the function).

Then, you can use the rollback in your model like this: https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/models/component.py#L94.

Sorry I sent you so many links, but the situation should be relatively simple in the end :)

@ksy36
Copy link

ksy36 commented May 5, 2021

from @marco-c :

I've released a new version of bugbug with the new model. Unfortunately there is still a (very easy) problem to solve.
The task is failing as it runs out of memory: https://community-tc.services.mozilla.com/tasks/FXN9ARgUQGuwO_m1BZJwsg/runs/0/logs/public/logs/live.log. The logs are not really informative in OOM situations, but in all similar cases from the past (where the log just abruptly stops) it was a OOM. Indeed, I tried to rerun the task with a bigger worker ('compute-large' instead of 'compute-small'), and it worked: https://community-tc.services.mozilla.com/tasks/CnresoMPQXaNLFQ_-D2S7g/runs/0/logs/public/logs/live.log.

The OOM is occuring while calculating feature importances. There are two possible fixes:

  1. disable feature importance calculation, by adding self.calculate_importance = False in the model's constructor (https://github.com/mozilla/bugbug/blob/f1adf21880b5af39383febe363cfd398d6acd695/bugbug/models/spambug.py#L21);
  2. change the task definition in infra/data-pipeline.yml to use compute-large instead of compute-small.
    I'd lean towards the first, you can re-enable feature importance calculation locally in case you want to debug the model or figure out possible improvements.

Once this is done, the last remaining step is using the model.
You can either do that by installing the bugbug package and running the scripts/github_issue_classifier.py script (you might want to add it to setup.py so you can run it as bugbug-classify-issue, https://github.com/mozilla/bugbug/blob/f1adf21880b5af39383febe363cfd398d6acd695/setup.py#L65) in your environment, or we could add support to the bugbug HTTP service to classify issues and then use a rest API from your environment to classify issues.

If you want to go the second way, we can add a new parameter to https://github.com/mozilla/bugbug/blob/f1adf21880b5af39383febe363cfd398d6acd695/http_service/bugbug_http/app.py#L345 and https://github.com/mozilla/bugbug/blob/f1adf21880b5af39383febe363cfd398d6acd695/http_service/bugbug_http/app.py#L412 to select the bug tracking system ("bugzilla" or "github") and an equivalent to https://github.com/mozilla/bugbug/blob/f1adf21880b5af39383febe363cfd398d6acd695/http_service/bugbug_http/models.py#L55 for classifying issues.

Installing bugbug in your environment is easier to implement, but you might need to periodically change its version to sync up with changes in bugbug. E.g. if you make an improvement in the bugbug model and ship it, you'll need to bump the local bugbug version too right after the training is done. If the new model is somehow broken and we rollback to an earlier version, you need to change your local bugbug version back too. And so on and so forth.

With the HTTP service instead you don't suffer from any of these problems. If the model is broken, I simply rollback the HTTP service to an earlier version and your REST API calls will still work exactly as before.

To summarize, I suggest the HTTP service way to have a more stable and more maintainable usage of bugbug, but going the local bugbug way is also acceptable as a starting point. This is what we did in our Bugzilla bot: we started with bugbug integrated directly in the bot, we later changed to using the HTTP service.

@ksy36
Copy link

ksy36 commented May 5, 2021

from me:

As for using the model, I'll look into modifying the http service to classify github issues as you suggested. I've noticed that classification happens in the background and to get the results I need to make a second request. I have found an example of batch processing in autonag https://github.com/mozilla/relman-auto-nag/blob/master/auto_nag/bugbug_utils.py#L25 where the function continues to to retry getting bugs that are "ready" until a certain limit is reached, so it looks like I could use a similar approach.

Also I wonder whether it's preferable to classify issues in batches or individually? With the current model we're classifying it on a per issue basis (a github webhook event triggers the classification script). We could instead create a cron job for classifying the issues in batches and run it a couple times a day, however in this case it's possible that a human will get to some issues before they're classified. We currently get around 700-800 reports a week though and I suppose sending a request to bugbug service for each individual issue is not ideal.

@ksy36
Copy link

ksy36 commented May 5, 2021

from @marco-c :

yes, the service is running the classification in the background as it can be slower than the maximum 30 seconds allowed by Heroku for a response (e.g. in case of batch classification, it won't be that slow in case of single classifications).

For your use case, single classifications should work well. In autonag we were requesting thousands of bugs at a time (we are using a poll mechanism, so we query Bugzilla for e.g. 7 days worth of bugs and classify all of them in one go), that's why we really needed batch classification (the classification itself is actually pretty quick, the slow part is retrieving data from Bugzilla).

@ksy36
Copy link

ksy36 commented May 31, 2021

webcompat/webcompat.com#3574 has been deployed and issues that are labelled needsdiagnosis=False with high probability by bugbug are now closed automatically in private and public repositories.

Public issues that are closed that way are receiving placeholder text instead of the actual content (we don't want to accidentally post abusive content because it is not moderated).

I have also disabled webhook events for the old ML bot.

What's left to do:

  • need to adjust data fetching step in bugbug to make sure we still get the actual content of closed public issues (will need to make a request to fetch original issue content from the private repository)
  • check whether it makes sense to adjust the model and include the issues that were marked as duplicates in addition to needsdiagnosis criteria. As duplicate issues content might actually be useful (as they're technically valid)

@marco-c
Copy link

marco-c commented Jun 3, 2021

check whether it makes sense to adjust the model and include the issues that were marked as duplicates in addition to needsdiagnosis criteria. As duplicate issues content might actually be useful (as they're technically valid)

Duplicate detection for Bugzilla is something that a contributor was working on, but never finished. It is different than a classification problem, as for duplicate detection you want to compare an issue with already existing issues.

@karlcow
Copy link
Author

karlcow commented Jun 3, 2021

Two types of Duplication in the context of webcompat

  1. a mix of full URL/domain name and issue description. We do not duplicate the same flexbox issues on different domains. often done at Triage time, and sometimes diagnosis.
  2. The same flexbox issue will be duplicated against the same bugzilla core issues after diagnosis

I wonder if @ksy36 here is talking about the 1st type of duplication.

@ksy36
Copy link

ksy36 commented Jun 3, 2021

Yeah, it's the 1st type of duplication, I should have clarified that :)

It's for the case when an issue is closed as duplicate of anoter github issue bypassing the event of diagnosis. So the content of an issue is valid and should be contributing to the same class as the issues that have been moved to needsdiagosis. An example of such issue would be webcompat/web-bugs#74902

I have noticed that our model marks certain issues as needsdiagnosis=False, however the content of the description seems valid. For example webcompat/web-bugs#75721.
I recall that we received lots of duplicates for it. But only one of them was moved to needsdiagnosis and the rest are duplicates, so it makes sense that it's marking them needsdiagnosis=False.

So I think it's worth trying to compare the metrics of the current model and the model with added duplicates

@ksy36
Copy link

ksy36 commented Jun 10, 2021

So I think it's worth trying to compare the metrics of the current model and the model with added duplicates

I've compared two models and the original one has better cross validation scores, so I'm going to keep the original model

Original:

Cross Validation scores:
Accuracy: f0.9254523066897928 (+/- 0.0027432841231565144)
Precision: f0.9306239883158366 (+/- 0.002419008133168852)
Recall: f0.9925023393546839 (+/- 0.0017654638927441645)
X_train: (67648, 11441), y_train: (67648,)
X_test: (7517, 11441), y_test: (7517,)

Confidence threshold > 0.9 - 6047 classified
                          pre       rec       spe        f1       geo       iba       sup

                 1       0.96      0.83      0.63      0.89      0.72      0.53      6898
                 0       0.96      0.10      1.00      0.19      0.32      0.09       619

With duplicates:

Cross Validation scores:
Accuracy: f0.8311405676028013 (+/- 0.0031624898984296593)
Precision: f0.8407995062319105 (+/- 0.0018327855100581617)
Recall: f0.9713621138816524 (+/- 0.003017499337644095)
X_train: (67648, 11441), y_train: (67648,)
X_test: (7517, 11441), y_test: (7517,)

Confidence threshold > 0.9 - 2642 classified

                          pre       rec       spe        f1       geo       iba       sup

                 1       0.95      0.40      0.91      0.56      0.60      0.35      5998
                 0       0.98      0.06      1.00      0.12      0.25      0.06      1519

@karlcow
Copy link
Author

karlcow commented Jun 11, 2021

Thanks @ksy36
It's cool to have the number for making rationale decisions.
Super nice work.

@karlcow
Copy link
Author

karlcow commented Jul 7, 2021

This was successful. And we can switch to additional phases. Let's focus on https://github.com/mozilla/webcompat-team-okrs/projects/14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants