-
Notifications
You must be signed in to change notification settings - Fork 24.1k
PyTorch CI Stability Action Tracker #65439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Using pretrained weights from torchvision causes a network call to download weights. Example: https://github.com/pytorch/pytorch/blob/master/test/onnx/test_pytorch_onnx_onnxruntime.py#L354 Found from failure: https://circleci.com/gh/pytorch/pytorch/16139241 A cursory search shows quite a few tests where we are doing this. Fixing is probably as easy as just switching |
Looks like CMake binary is not available? Example: https://github.com/pytorch/pytorch/runs/3677706168 I'm going to chalk this one up as "probably fixed by ephemeral runners" |
Downloading nvidia stuffs in test failed: Would be resolved by baking into the AMI |
sccache on windows has had network failures 25 times in the last week: https://fburl.com/scuba/opensource_ci_jobs/q3nf3g6n Googling the error message shows an issue that @ezyang commented on: mozilla/sccache#256. Suggested fixes in that issue are 1) reducing parallelism, and 2) increasing the timeout |
Problems noted by @mruberry:
in https://github.com/pytorch/pytorch/runs/3682204176 and
in https://github.com/pytorch/pytorch/runs/3683674360 and
in https://github.com/pytorch/pytorch/runs/3682134416 and
in https://app.circleci.com/pipelines/github/pytorch/pytorch/384098/workflows/c9f78d92-487c-4a85-b97e-10b222e5c3a9/jobs/16158997 (an ASAN test job). |
This is not a real failure, it's a build failure. We should mark this step as "continue-through-error" or whatever because it's not a user-facing failure. |
Still something we can fix though, we should have guards on those steps instead of |
I'll take this--I've modified the big summary at the top to include this action item and assigned it to me |
Summary: This should help alleviate workflows failing due to docker pull timing out, which doesn't happen often, but did happen once in the past day. Was also reported in #65439 Pull Request resolved: #65103 Reviewed By: driazati Differential Revision: D31157772 Pulled By: janeyx99 fbshipit-source-id: 7bf556f849b41eeb6dea69d73e5a8e1a40dec514
Raft of issues from spot check today:
|
Summary: Fixes one of the flakiness concerns mentioned #65439 (comment) Pull Request resolved: #66159 Reviewed By: ngimel Differential Revision: D31406485 Pulled By: janeyx99 fbshipit-source-id: cf7834cdab58360ecef1748075d52969de2e0778
Summary: Addresses this network risk mitigation mentioned in #65439 (comment). I didn't include any mobile app/benchmarking changes because I think the pretrained matters there. I ended up removing the changes in test_utils because those were sensitive to the pretrained variable. I am saving the quantization test changes for another PR because they are currently disabled. Pull Request resolved: #66312 Reviewed By: ejguan Differential Revision: D31542992 Pulled By: janeyx99 fbshipit-source-id: 57b4f70247af25cc96c57abd9e689c34641672ff
PyTorch docs builds failing on installing dependencies: https://app.circleci.com/pipelines/github/pytorch/pytorch/393635/workflows/ba76eff9-428a-4e3d-8023-626b7413ed1f/jobs/16432372
|
Access denied on windows build: https://github.com/pytorch/pytorch/runs/3897754598 (cc @seemethere) |
Windows dependency installation failure: https://github.com/pytorch/pytorch/runs/3898324416?check_suite_focus=true
Would be resolved by baking into AMI I guess Here's another: |
Is data about CI stability collected somewhere? Before making changes it would be nice to see and set some metrics. https://observablehq.com/@observablehq/integration-test-flakiness |
This looks super cool! We do have data, but right now it's dumped to an internal system at FB (Meta??) for analysis. We'd love to make something that everyone can look at and help out with; I'll play with that notebook a bit. For current master health, we have hud.pytorch.org, which presents a nice view of current jobs running on master. |
@suo what is the process then to make the stats alive open data. Should I join Meta (FB ??) to champion the change? :D |
persistent dirty checkout, 27 failed jobs on master in the last week: https://fburl.com/scuba/opensource_ci_jobs/vs1j5mvq |
Completed as of today (1.7.21):
No longer relevant/De-prioritized
|
Completed between 4.11 and 6.17 |
Closing the meta tracker since we aren't using it anymore |
This will be a meta-issue/tracker for directly actionable issues regarding stability and reliability in PyTorch CI.
Motivation
CI stability is crucial to developer velocity and user experience. Otherwise, developers may waste time debugging their CI failures that are irrelevant to their changes and the on-call who is in charge of keeping trunk green may be tasked with more work to triage issues. When trunk is red, PyTorch users who pull from source could be impacted as well.
Immediately Actionable Issues
Windows Specific
none so far!
Rolled into bigger projects (please comment relevant suggestions in their respective issues)
The text was updated successfully, but these errors were encountered: