PyTorch CI Stability Action Tracker #65439

janeyx99 · 2021-09-22T00:01:58Z

This will be a meta-issue/tracker for directly actionable issues regarding stability and reliability in PyTorch CI.

Motivation

CI stability is crucial to developer velocity and user experience. Otherwise, developers may waste time debugging their CI failures that are irrelevant to their changes and the on-call who is in charge of keeping trunk green may be tasked with more work to triage issues. When trunk is red, PyTorch users who pull from source could be impacted as well.

Immediately Actionable Issues

Windows Specific

none so far!

Rolled into bigger projects (please comment relevant suggestions in their respective issues)

Flaky Test Detection + Retry Automatically Detect and Remove Flaky Tests #71005
Removing Network Connectivity Risks In CI CI: Removing Network Connectivity Risks #71003
Retryable Steps GHA: Use retryable steps #71563

suo · 2021-09-22T06:32:08Z

Using pretrained weights from torchvision causes a network call to download weights. Example: https://github.com/pytorch/pytorch/blob/master/test/onnx/test_pytorch_onnx_onnxruntime.py#L354

Found from failure: https://circleci.com/gh/pytorch/pytorch/16139241

A cursory search shows quite a few tests where we are doing this.

Fixing is probably as easy as just switching pretrained to False—I highly doubt any of those tests are sensitive to whether the weights are pretrained are not.

suo · 2021-09-22T16:59:24Z

Looks like CMake binary is not available?

Example: https://github.com/pytorch/pytorch/runs/3677706168

I'm going to chalk this one up as "probably fixed by ephemeral runners"

suo · 2021-09-22T23:25:38Z

Downloading nvidia stuffs in test failed:
https://github.com/pytorch/pytorch/runs/3681335684

Would be resolved by baking into the AMI

suo · 2021-09-23T06:46:22Z

sccache on windows has had network failures 25 times in the last week: https://fburl.com/scuba/opensource_ci_jobs/q3nf3g6n

Googling the error message shows an issue that @ezyang commented on: mozilla/sccache#256. Suggested fixes in that issue are 1) reducing parallelism, and 2) increasing the timeout

janeyx99 · 2021-09-23T17:15:16Z

Problems noted by @mruberry:

Run rm -rf "${PYTORCH_FINAL_PACKAGE_DIR}"
rm: cannot remove './build': Device or resource busy

in https://github.com/pytorch/pytorch/runs/3682204176 and

Run seemethere/upload-artifact-s3@v3
Error: No files were found with the provided path: C:\1264644579\build-results. No artifacts will be uploaded.

in https://github.com/pytorch/pytorch/runs/3683674360 and

Run docker pull "${DOCKER_IMAGE}"
Error response from daemon: Get "https://308535385114.dkr.ecr.us-east-1.amazonaws.com/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

in https://github.com/pytorch/pytorch/runs/3682134416 and

Sep 23 02:20:31   test_error_in_init (__main__.TestDataLoader) ... ok (0.164s)

Too long with no output (exceeded 1h30m0s): context deadline exceeded

in https://app.circleci.com/pipelines/github/pytorch/pytorch/384098/workflows/c9f78d92-487c-4a85-b97e-10b222e5c3a9/jobs/16158997 (an ASAN test job).

suo · 2021-09-23T17:19:45Z

Run seemethere/upload-artifact-s3@v3
Error: No files were found with the provided path: C:\1264644579\build-results. No artifacts will be uploaded.

This is not a real failure, it's a build failure. We should mark this step as "continue-through-error" or whatever because it's not a user-facing failure.

driazati · 2021-09-23T17:21:00Z

Run seemethere/upload-artifact-s3@v3
Error: No files were found with the provided path: C:\1264644579\build-results. No artifacts will be uploaded.

This is not a real failure, it's a build failure. We should mark this step as "continue-through-error" or whatever because it's not a user-facing failure.

Still something we can fix though, we should have guards on those steps instead of if: always(), they should check if the build step succeeded

janeyx99 · 2021-09-23T17:26:51Z

Still something we can fix though, we should have guards on those steps instead of if: always(), they should check if the build step succeeded

I'll take this--I've modified the big summary at the top to include this action item and assigned it to me

Summary: Fixes a task in #65439 And removes the Upload to GitHub step as it's redundant with the S3 step. Pull Request resolved: #65561 Reviewed By: seemethere Differential Revision: D31157685 Pulled By: janeyx99 fbshipit-source-id: cd23113a981eb4467fea3af14d916f6f2445a02b

Summary: This should help alleviate workflows failing due to docker pull timing out, which doesn't happen often, but did happen once in the past day. Was also reported in #65439 Pull Request resolved: #65103 Reviewed By: driazati Differential Revision: D31157772 Pulled By: janeyx99 fbshipit-source-id: 7bf556f849b41eeb6dea69d73e5a8e1a40dec514

suo · 2021-10-05T19:00:18Z

Raft of issues from spot check today:

mt.exe manifest update failure: https://github.com/pytorch/pytorch/runs/3806258421 I have no idea wtf this means, but this SO answer seemed relevant https://stackoverflow.com/a/42136824
network failure: https://github.com/pytorch/pytorch/runs/3798462187
network failure: https://circleci.com/gh/pytorch/pytorch/16331602
flaky test: https://github.com/pytorch/pytorch/runs/3804782432 tracked in DISABLED test_profiler (__main__.TestJit) #65521
flaky test: https://circleci.com/gh/pytorch/pytorch/16332387, I alerted the relevant team

Summary: Fixes one of the flakiness concerns mentioned #65439 (comment) Pull Request resolved: #66159 Reviewed By: ngimel Differential Revision: D31406485 Pulled By: janeyx99 fbshipit-source-id: cf7834cdab58360ecef1748075d52969de2e0778

Summary: Addresses this network risk mitigation mentioned in #65439 (comment). I didn't include any mobile app/benchmarking changes because I think the pretrained matters there. I ended up removing the changes in test_utils because those were sensitive to the pretrained variable. I am saving the quantization test changes for another PR because they are currently disabled. Pull Request resolved: #66312 Reviewed By: ejguan Differential Revision: D31542992 Pulled By: janeyx99 fbshipit-source-id: 57b4f70247af25cc96c57abd9e689c34641672ff

suo · 2021-10-13T06:56:46Z

PyTorch docs builds failing on installing dependencies: https://app.circleci.com/pipelines/github/pytorch/pytorch/393635/workflows/ba76eff9-428a-4e3d-8023-626b7413ed1f/jobs/16432372

Oct 13 06:43:37 ++ pip -q install -r requirements.txt
Oct 13 06:53:38   ERROR: Command errored out with exit status 128:
Oct 13 06:53:38    command: git clone -q https://github.com/pytorch/pytorch_sphinx_theme.git /var/lib/jenkins/workspace/docs/src/pytorch-sphinx-theme
Oct 13 06:53:38        cwd: None
Oct 13 06:53:38   Complete output (3 lines):
Oct 13 06:53:38   error: RPC failed; curl 18 transfer closed with outstanding read data remaining
Oct 13 06:53:38   fatal: The remote end hung up unexpectedly
Oct 13 06:53:38   fatal: protocol error: bad pack header

suo · 2021-10-14T17:46:03Z

Access denied on windows build: https://github.com/pytorch/pytorch/runs/3897754598 (cc @seemethere)

suo · 2021-10-14T18:53:32Z

Windows dependency installation failure: https://github.com/pytorch/pytorch/runs/3898324416?check_suite_focus=true

CondaError: Downloaded bytes did not match Content-Length
  url: https://repo.anaconda.com/pkgs/main/win-64/intel-openmp-2021.3.0-haa95532_3372.conda

Would be resolved by baking into AMI I guess

Here's another:
https://github.com/pytorch/pytorch/runs/3900764593

Summary: Helps resolve a bit of #65439 Pull Request resolved: #66795 Reviewed By: suo, jerryzh168 Differential Revision: D31732043 Pulled By: janeyx99 fbshipit-source-id: 10b71865fc937f9d72f2b1c04cbf3ea9a68c8818

abitrolly · 2021-11-07T09:26:02Z

Is data about CI stability collected somewhere? Before making changes it would be nice to see and set some metrics.

https://observablehq.com/@observablehq/integration-test-flakiness

suo · 2021-11-07T17:53:37Z

This looks super cool! We do have data, but right now it's dumped to an internal system at FB (Meta??) for analysis. We'd love to make something that everyone can look at and help out with; I'll play with that notebook a bit.

For current master health, we have hud.pytorch.org, which presents a nice view of current jobs running on master.

abitrolly · 2021-11-09T11:04:52Z

@suo what is the process then to make the stats alive open data. Should I join Meta (FB ??) to champion the change? :D

suo · 2021-12-06T17:47:39Z

persistent dirty checkout, 27 failed jobs on master in the last week: https://fburl.com/scuba/opensource_ci_jobs/vs1j5mvq

janeyx99 · 2022-01-07T18:20:27Z

Completed as of today (1.7.21):

No longer relevant/De-prioritized

Adding retries to Circle steps "docker pull" and "docker sign-in" Add retries to docker sign-in and docker pull for all CI #66508 no longer relevant as they've been migrated to GHA
- Circle docker pulls: docker pull error on a PR https://app.circleci.com/pipelines/github/pytorch/pytorch/390580/workflows/7daa146f-5883-46b7-b567-7c305e6b92e9/jobs/16358554
- failure in build-publish-docker here: https://github.com/pytorch/pytorch/runs/3815292363?check_suite_focus=true
Add periodic cron to dispatch workflows for trunk commits with missing workflow triggers due to ShipIt batching - assigned to @janeyx99
Since this failure has not re-occurred, there's no need to look into this right now: Chown workspace failing with ./.git/index.lock: No such file or directory in https://github.com/pytorch/pytorch/runs/3692140415

janeyx99 · 2022-04-11T15:56:50Z

Completed between 1.7.21 to 4.11

janeyx99 · 2022-06-17T14:37:03Z

Completed between 4.11 and 6.17

Fix docker update race condition #70881

ZainRizvi · 2022-10-28T22:04:37Z

Closing the meta tracker since we aren't using it anymore

soulitzer added module: ci Related to continuous integration tracker A tracking issue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 22, 2021

This was referenced Sep 23, 2021

Windows GHA: Only upload artifacts if prev steps pass #65561

Closed

.github: GHA retry docker pull #65103

Closed

janeyx99 mentioned this issue Oct 5, 2021

test_utils.py: Add another retry to test_download_url_to_file #66159

Closed

janeyx99 mentioned this issue Oct 8, 2021

[BE] set pretrained=False for onnx tests #66312

Closed

janeyx99 mentioned this issue Oct 18, 2021

[skip ci] Change pretrained to false for quantization tests #66795

Closed

This was referenced Feb 28, 2022

Stop using Homebrew in our macos jobs #66906

Closed

[Meta] CI infra flakiness tracker and status update #64390

Closed

janeyx99 moved this to In Progress in PyTorch OSS Dev Infra Mar 2, 2022

janeyx99 added this to PyTorch OSS Dev Infra Mar 2, 2022

janeyx99 mentioned this issue Mar 2, 2022

Notify dev infra team on PR failures that aren't the author's fault #56745

Closed

ZainRizvi closed this as completed Oct 28, 2022

Repository owner moved this from In Progress to Done in PyTorch OSS Dev Infra Oct 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch CI Stability Action Tracker #65439

PyTorch CI Stability Action Tracker #65439

janeyx99 commented Sep 22, 2021 •

edited

Loading

suo commented Sep 22, 2021

suo commented Sep 22, 2021

suo commented Sep 22, 2021

suo commented Sep 23, 2021

janeyx99 commented Sep 23, 2021

suo commented Sep 23, 2021

driazati commented Sep 23, 2021

janeyx99 commented Sep 23, 2021

suo commented Oct 5, 2021

suo commented Oct 13, 2021

suo commented Oct 14, 2021

suo commented Oct 14, 2021 •

edited

Loading

abitrolly commented Nov 7, 2021

suo commented Nov 7, 2021

abitrolly commented Nov 9, 2021

suo commented Dec 6, 2021

janeyx99 commented Jan 7, 2022

janeyx99 commented Apr 11, 2022

janeyx99 commented Jun 17, 2022

ZainRizvi commented Oct 28, 2022

PyTorch CI Stability Action Tracker #65439

PyTorch CI Stability Action Tracker #65439

Comments

janeyx99 commented Sep 22, 2021 • edited Loading

Motivation

Immediately Actionable Issues

Windows Specific

Rolled into bigger projects (please comment relevant suggestions in their respective issues)

suo commented Sep 22, 2021

suo commented Sep 22, 2021

suo commented Sep 22, 2021

suo commented Sep 23, 2021

janeyx99 commented Sep 23, 2021

suo commented Sep 23, 2021

driazati commented Sep 23, 2021

janeyx99 commented Sep 23, 2021

suo commented Oct 5, 2021

suo commented Oct 13, 2021

suo commented Oct 14, 2021

suo commented Oct 14, 2021 • edited Loading

abitrolly commented Nov 7, 2021

suo commented Nov 7, 2021

abitrolly commented Nov 9, 2021

suo commented Dec 6, 2021

janeyx99 commented Jan 7, 2022

Completed as of today (1.7.21):

No longer relevant/De-prioritized

janeyx99 commented Apr 11, 2022

Completed between 1.7.21 to 4.11

janeyx99 commented Jun 17, 2022

Completed between 4.11 and 6.17

ZainRizvi commented Oct 28, 2022

janeyx99 commented Sep 22, 2021 •

edited

Loading

suo commented Oct 14, 2021 •

edited

Loading