-
Notifications
You must be signed in to change notification settings - Fork 4
2021H1 - Improve Machine Learning Webcompat Bot #194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We should probably move this to a public forum of some sort. https://groups.google.com/g/mozilla.compatibility/c/4yQxX9EZGQ8/m/ARNK5Da7AQAJ See https://github.com/marco-c/autowebcompat
Since then Open Innovation team started a project to do auto-triage on webcompat web-bugs. Then the Open Innovation team disappeared and the knowledge transfer was minimal. The documentation of the webcompat ML project We inherited of a useful project but that we are not mastering and this is a big weakness. Ksenia started to get an understanding of the project. I’m pretty sure there are a lot more we could with these tools. But first we need to have a better hold on them. And it’s cool if we can work together on making things useful for everyone. |
Some pieces of a message by Marco Castelluccio @marco-c Some context about bugbugbugbug is currently being used for a few purposes. It started as a project to automatically assign a type to bugs (defect vs enhancement vs task, back when we introduced the "type" we needed a way to fill it for already existing bugs), and then evolved to be a platform to build ML models on bug reports: we now have many models, some of which are being used on Bugzilla, e.g. to assign a type, to assign a component (https://hacks.mozilla.org/2019/04/teaching-machines-to-triage-firefox-bugs/), to close bugs detected as spam, to detect "regression" bugs, and so on. Then, it evolved to be a platform to build ML models for generic software engineering purposes: we now no longer only have models that operate on bug reports, but also on test data, patches/commits (e.g. to choose which tests to run for a given patch, https://hacks.mozilla.org/2020/07/testing-firefox-more-efficiently-with-machine-learning/, and to evaluate the regression riskiness associated to a patch), and so on. bugbug also has an infrastructure (built to make it a good fit for ML projects) for Continuous Integration and Continuous Deployment using Taskcluster, and a HTTP service that other services can use for classification (e.g. Firefox CI uses it to choose which tests to run on try and on autoland, the autonag bot uses it to classify bugs). Deprecating webcompat-ml in favor of a bugbug-based model?I know you don't use Bugzilla to track web-compat reports, so if you wanted to replace webcompat-ml with a bugbug-based classifier you would first need to add support to bugbug to retrieve GitHub issues (mozilla/bugbug#259), then define a new model similar to an already existing one (it is super easy to do, many external contributors have done it). I could guide you if you are interested. The main benefit of this move would be to have a proven ML pipeline and service, with very limited maintenance (as I'm maintaining the infrastructure and the ML models part, your side of the maintenance would be limited to pinging me when you notice a problem :P) and no duplication. The main drawback would be that you have to reimplement some parts of webcompat-ml in bugbug (though it should be a really small part, just the "GitHub issue retrieval"). autowebcompatRegarding autowebcompat, I think it would be an interesting project. It is a totally different kind of project (to summarize for Joe and Sylvestre: a crawler that navigates the web to automatically find web compatibility issues by comparing screenshots from different browsers using a ML model). It is more of a research project though, as it is a shot in the dark: it might work and find lots of issues, but it might not work or might work and find few issues. We have no way to tell the likelihood it would work without investing some time in it. This kind of project could also be useful for other similar purposes: e.g. instead of looking for web compat issues, it could find issues by comparing a normal Firefox run with a tracking-protection-enabled Firefox run. Moreover, even if it didn't work, parts of it could be reused for other purposes (which might not be interesting for you, but could be interesting for us). E.g. navigating the web with an ASAN build of Firefox in search of crashes. |
from @ksy36 I have a couple questions on what would be the best way to handle the data.
|
From @marco-c
I'd suggest storing everything (unless it makes the DB huge, but I imagine it doesn't), and then we can just filter what we need later as part of the model training. That's what I do for Bugzilla data too, you never know what you might need in the future. And it's simpler as you don't need special cases.
Usually they are stored as artifacts of Taskcluster tasks. Taskcluster automatically handles uploads to S3, and provides an indexed URL to access them. This is the preferred way of handling DBs. Some are stored "manually" on s3 because they take a long time to be generated (multiple days), and so we upload them while the task is still running to make sure that if it stops for some reason (e.g. AWS kills the instance) we haven't lost the work done by the task up to that point. For simplicity, in the GitHub case I'd go with the first option. You don't need to do almost anything to make Taskcluster upload the DB, we'll just need to define the path to the DB in the task definition (similarly to https://github.com/mozilla/bugbug/blob/53a0bf62a0fb7b31f147d830417f52e7db49bb10/infra/data-pipeline.yml#L170-L176). In the end, we'll have:
Once this is done, we can move to the training part. |
from @marco-c I've added a Taskcluster task to retrieve the webcompat issues (mozilla/bugbug@cbd43a0), here's an example run: https://community-tc.services.mozilla.com/tasks/F37Si0aYS-aGCDkahMdKTw. It will run at least once every two weeks from now on. I've also added support for building models that operate on issues (mozilla/bugbug@7ed57c2 and mozilla/bugbug@5d1fcbd), and I've defined a basic model (a model that simply identifies issues that affect Firefox) to try out the new features: mozilla/bugbug@c45dd7c. The new model can be trained with Here's what's left to do:
|
Ksenia @ksy36 has done a very cool work on https://github.com/mozilla/bugbug/pull/2242/files |
@ksy36 also implemented updating the issues DB by re-fetching issues that have been modified since the last run: mozilla/bugbug#2281. |
from @ksy36 I've started looking into creating the model for webcompat and have a couple questions, would appreciate it if you could share your thoughts. With the current model we don't extract features from the issue body yet (it was marked as enhancement/experiment in mozilla/webcompat-ml#3) and I'm wondering what is the optimal way of handling this case. As far as I understand all models in bugbug extract certain features. Perhaps in my case I could have an empty list in https://github.com/mozilla/bugbug/blob/master/bugbug/models/browsername.py#L24 or would you suggest extracting some features? And another question I have is about the title change. In the current model we use original titles (they're being extracted from events list, i.e. we're looking for the title that the issue had before "renaming event"). |
from myself:
I'd suggest replicating the original model first, and then we can add more features later if we want to try improving it. The BrowserName model was using three sources for the features: title, first comment, and generic characteristics (in this case, the total number of comments). I defined all three just to showcase them all, but we can remove them. In general, the models have an extraction pipeline (e.g. https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/models/browsername.py#L35-L54) where the first step parses the issues and extracts data from them (the input is a list of issues, the output is a list of dicts with "data", "title", "first_comment" keys https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/issue_features.py#L77-L79), and the second step transforms the list of dicts into lists of numpy arrays that can be given as input to the actual ML algorithm. To remove the generic characteristics, you can simply remove the Then, if you want to remove the first comment (i.e. issue body) from consideration, you can remove these three lines https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/models/browsername.py#L47-L51 (basically, the "first_comment" data from the list of dicts won't be turned into part of the numpy array that is given as input to the ML algorithm).
We have something similar for Bugzilla bugs. For Bugzilla bugs it is super complex because we're handling a lot of fields and a lot of corner cases, but for GitHub issues it should be simpler (especially if you limit the "rollback" to the title). You'd need to patch https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/issue_features.py#L19 in a similar way to what I did here mozilla/bugbug@8bab68a, and define a Then, you can use the rollback in your model like this: https://github.com/mozilla/bugbug/blob/c7fe80ccc29fa5fd5de2b7c7ed59e8e6ce829c09/bugbug/models/component.py#L94. Sorry I sent you so many links, but the situation should be relatively simple in the end :) |
from @marco-c : I've released a new version of bugbug with the new model. Unfortunately there is still a (very easy) problem to solve. The OOM is occuring while calculating feature importances. There are two possible fixes:
Once this is done, the last remaining step is using the model. If you want to go the second way, we can add a new parameter to https://github.com/mozilla/bugbug/blob/f1adf21880b5af39383febe363cfd398d6acd695/http_service/bugbug_http/app.py#L345 and https://github.com/mozilla/bugbug/blob/f1adf21880b5af39383febe363cfd398d6acd695/http_service/bugbug_http/app.py#L412 to select the bug tracking system ("bugzilla" or "github") and an equivalent to https://github.com/mozilla/bugbug/blob/f1adf21880b5af39383febe363cfd398d6acd695/http_service/bugbug_http/models.py#L55 for classifying issues. Installing bugbug in your environment is easier to implement, but you might need to periodically change its version to sync up with changes in bugbug. E.g. if you make an improvement in the bugbug model and ship it, you'll need to bump the local bugbug version too right after the training is done. If the new model is somehow broken and we rollback to an earlier version, you need to change your local bugbug version back too. And so on and so forth. With the HTTP service instead you don't suffer from any of these problems. If the model is broken, I simply rollback the HTTP service to an earlier version and your REST API calls will still work exactly as before. To summarize, I suggest the HTTP service way to have a more stable and more maintainable usage of bugbug, but going the local bugbug way is also acceptable as a starting point. This is what we did in our Bugzilla bot: we started with bugbug integrated directly in the bot, we later changed to using the HTTP service. |
from me: As for using the model, I'll look into modifying the http service to classify github issues as you suggested. I've noticed that classification happens in the background and to get the results I need to make a second request. I have found an example of batch processing in autonag https://github.com/mozilla/relman-auto-nag/blob/master/auto_nag/bugbug_utils.py#L25 where the function continues to to retry getting bugs that are "ready" until a certain limit is reached, so it looks like I could use a similar approach. Also I wonder whether it's preferable to classify issues in batches or individually? With the current model we're classifying it on a per issue basis (a github webhook event triggers the classification script). We could instead create a cron job for classifying the issues in batches and run it a couple times a day, however in this case it's possible that a human will get to some issues before they're classified. We currently get around 700-800 reports a week though and I suppose sending a request to bugbug service for each individual issue is not ideal. |
from @marco-c : yes, the service is running the classification in the background as it can be slower than the maximum 30 seconds allowed by Heroku for a response (e.g. in case of batch classification, it won't be that slow in case of single classifications). For your use case, single classifications should work well. In autonag we were requesting thousands of bugs at a time (we are using a poll mechanism, so we query Bugzilla for e.g. 7 days worth of bugs and classify all of them in one go), that's why we really needed batch classification (the classification itself is actually pretty quick, the slow part is retrieving data from Bugzilla). |
webcompat/webcompat.com#3574 has been deployed and issues that are labelled needsdiagnosis=False with high probability by bugbug are now closed automatically in private and public repositories. Public issues that are closed that way are receiving placeholder text instead of the actual content (we don't want to accidentally post abusive content because it is not moderated). I have also disabled webhook events for the old ML bot. What's left to do:
|
Duplicate detection for Bugzilla is something that a contributor was working on, but never finished. It is different than a classification problem, as for duplicate detection you want to compare an issue with already existing issues. |
Two types of Duplication in the context of webcompat
I wonder if @ksy36 here is talking about the 1st type of duplication. |
Yeah, it's the 1st type of duplication, I should have clarified that :) It's for the case when an issue is closed as duplicate of anoter github issue bypassing the event of diagnosis. So the content of an issue is valid and should be contributing to the same class as the issues that have been moved to needsdiagosis. An example of such issue would be webcompat/web-bugs#74902 I have noticed that our model marks certain issues as needsdiagnosis=False, however the content of the description seems valid. For example webcompat/web-bugs#75721. So I think it's worth trying to compare the metrics of the current model and the model with added duplicates |
I've compared two models and the original one has better cross validation scores, so I'm going to keep the original model Original:
With duplicates:
|
Thanks @ksy36 |
This was successful. And we can switch to additional phases. Let's focus on https://github.com/mozilla/webcompat-team-okrs/projects/14 |
We had two choices on either mastering the current infrastructure or in re-using the infrastructure of BugBug
Here a summary of the previous discussions:
The text was updated successfully, but these errors were encountered: