Skip to content

PyTorch CI Stability Action Tracker #65439

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 of 4 tasks
janeyx99 opened this issue Sep 22, 2021 · 20 comments
Closed
3 of 4 tasks

PyTorch CI Stability Action Tracker #65439

janeyx99 opened this issue Sep 22, 2021 · 20 comments
Labels
module: ci Related to continuous integration tracker A tracking issue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@janeyx99
Copy link
Contributor

janeyx99 commented Sep 22, 2021

This will be a meta-issue/tracker for directly actionable issues regarding stability and reliability in PyTorch CI.

Motivation

CI stability is crucial to developer velocity and user experience. Otherwise, developers may waste time debugging their CI failures that are irrelevant to their changes and the on-call who is in charge of keeping trunk green may be tasked with more work to triage issues. When trunk is red, PyTorch users who pull from source could be impacted as well.

Immediately Actionable Issues

Windows Specific

none so far!

Rolled into bigger projects (please comment relevant suggestions in their respective issues)

@suo
Copy link
Member

suo commented Sep 22, 2021

Using pretrained weights from torchvision causes a network call to download weights. Example: https://github.com/pytorch/pytorch/blob/master/test/onnx/test_pytorch_onnx_onnxruntime.py#L354

Found from failure: https://circleci.com/gh/pytorch/pytorch/16139241

A cursory search shows quite a few tests where we are doing this.

Fixing is probably as easy as just switching pretrained to False—I highly doubt any of those tests are sensitive to whether the weights are pretrained are not.

@soulitzer soulitzer added module: ci Related to continuous integration tracker A tracking issue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Sep 22, 2021
@suo
Copy link
Member

suo commented Sep 22, 2021

Looks like CMake binary is not available?

Example: https://github.com/pytorch/pytorch/runs/3677706168

I'm going to chalk this one up as "probably fixed by ephemeral runners"

@suo
Copy link
Member

suo commented Sep 22, 2021

Downloading nvidia stuffs in test failed:
https://github.com/pytorch/pytorch/runs/3681335684

Would be resolved by baking into the AMI

@suo
Copy link
Member

suo commented Sep 23, 2021

sccache on windows has had network failures 25 times in the last week: https://fburl.com/scuba/opensource_ci_jobs/q3nf3g6n

Googling the error message shows an issue that @ezyang commented on: mozilla/sccache#256. Suggested fixes in that issue are 1) reducing parallelism, and 2) increasing the timeout

@janeyx99
Copy link
Contributor Author

Problems noted by @mruberry:

Run rm -rf "${PYTORCH_FINAL_PACKAGE_DIR}"
rm: cannot remove './build': Device or resource busy

in https://github.com/pytorch/pytorch/runs/3682204176 and

Run seemethere/upload-artifact-s3@v3
Error: No files were found with the provided path: C:\1264644579\build-results. No artifacts will be uploaded.

in https://github.com/pytorch/pytorch/runs/3683674360 and

Run docker pull "${DOCKER_IMAGE}"
Error response from daemon: Get "https://308535385114.dkr.ecr.us-east-1.amazonaws.com/v2/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

in https://github.com/pytorch/pytorch/runs/3682134416 and

Sep 23 02:20:31   test_error_in_init (__main__.TestDataLoader) ... ok (0.164s)

Too long with no output (exceeded 1h30m0s): context deadline exceeded

in https://app.circleci.com/pipelines/github/pytorch/pytorch/384098/workflows/c9f78d92-487c-4a85-b97e-10b222e5c3a9/jobs/16158997 (an ASAN test job).

@suo
Copy link
Member

suo commented Sep 23, 2021

Run seemethere/upload-artifact-s3@v3
Error: No files were found with the provided path: C:\1264644579\build-results. No artifacts will be uploaded.

This is not a real failure, it's a build failure. We should mark this step as "continue-through-error" or whatever because it's not a user-facing failure.

@driazati
Copy link
Contributor

Run seemethere/upload-artifact-s3@v3
Error: No files were found with the provided path: C:\1264644579\build-results. No artifacts will be uploaded.

This is not a real failure, it's a build failure. We should mark this step as "continue-through-error" or whatever because it's not a user-facing failure.

Still something we can fix though, we should have guards on those steps instead of if: always(), they should check if the build step succeeded

@janeyx99
Copy link
Contributor Author

Still something we can fix though, we should have guards on those steps instead of if: always(), they should check if the build step succeeded

I'll take this--I've modified the big summary at the top to include this action item and assigned it to me

facebook-github-bot pushed a commit that referenced this issue Sep 24, 2021
Summary:
Fixes a task in #65439

And removes the Upload to GitHub step as it's redundant with the S3 step.

Pull Request resolved: #65561

Reviewed By: seemethere

Differential Revision: D31157685

Pulled By: janeyx99

fbshipit-source-id: cd23113a981eb4467fea3af14d916f6f2445a02b
facebook-github-bot pushed a commit that referenced this issue Sep 24, 2021
Summary:
This should help alleviate workflows failing due to docker pull timing out, which doesn't happen often, but did happen once in the past day.

Was also reported in #65439

Pull Request resolved: #65103

Reviewed By: driazati

Differential Revision: D31157772

Pulled By: janeyx99

fbshipit-source-id: 7bf556f849b41eeb6dea69d73e5a8e1a40dec514
@suo
Copy link
Member

suo commented Oct 5, 2021

Raft of issues from spot check today:

facebook-github-bot pushed a commit that referenced this issue Oct 5, 2021
Summary:
Fixes one of the flakiness concerns mentioned #65439 (comment)

Pull Request resolved: #66159

Reviewed By: ngimel

Differential Revision: D31406485

Pulled By: janeyx99

fbshipit-source-id: cf7834cdab58360ecef1748075d52969de2e0778
facebook-github-bot pushed a commit that referenced this issue Oct 11, 2021
Summary:
Addresses this network risk mitigation mentioned in #65439 (comment).

I didn't include any mobile app/benchmarking changes because I think the pretrained matters there.

I ended up removing the changes in test_utils because those were sensitive to the pretrained variable.

I am saving the quantization test changes for another PR because they are currently disabled.

Pull Request resolved: #66312

Reviewed By: ejguan

Differential Revision: D31542992

Pulled By: janeyx99

fbshipit-source-id: 57b4f70247af25cc96c57abd9e689c34641672ff
@suo
Copy link
Member

suo commented Oct 13, 2021

PyTorch docs builds failing on installing dependencies: https://app.circleci.com/pipelines/github/pytorch/pytorch/393635/workflows/ba76eff9-428a-4e3d-8023-626b7413ed1f/jobs/16432372

Oct 13 06:43:37 ++ pip -q install -r requirements.txt
Oct 13 06:53:38   ERROR: Command errored out with exit status 128:
Oct 13 06:53:38    command: git clone -q https://github.com/pytorch/pytorch_sphinx_theme.git /var/lib/jenkins/workspace/docs/src/pytorch-sphinx-theme
Oct 13 06:53:38        cwd: None
Oct 13 06:53:38   Complete output (3 lines):
Oct 13 06:53:38   error: RPC failed; curl 18 transfer closed with outstanding read data remaining
Oct 13 06:53:38   fatal: The remote end hung up unexpectedly
Oct 13 06:53:38   fatal: protocol error: bad pack header

@suo
Copy link
Member

suo commented Oct 14, 2021

Access denied on windows build: https://github.com/pytorch/pytorch/runs/3897754598 (cc @seemethere)

@suo
Copy link
Member

suo commented Oct 14, 2021

Windows dependency installation failure: https://github.com/pytorch/pytorch/runs/3898324416?check_suite_focus=true

CondaError: Downloaded bytes did not match Content-Length
  url: https://repo.anaconda.com/pkgs/main/win-64/intel-openmp-2021.3.0-haa95532_3372.conda

Would be resolved by baking into AMI I guess

Here's another:
https://github.com/pytorch/pytorch/runs/3900764593

facebook-github-bot pushed a commit that referenced this issue Oct 19, 2021
Summary:
Helps resolve a bit of #65439

Pull Request resolved: #66795

Reviewed By: suo, jerryzh168

Differential Revision: D31732043

Pulled By: janeyx99

fbshipit-source-id: 10b71865fc937f9d72f2b1c04cbf3ea9a68c8818
@abitrolly
Copy link

Is data about CI stability collected somewhere? Before making changes it would be nice to see and set some metrics.

https://observablehq.com/@observablehq/integration-test-flakiness

@suo
Copy link
Member

suo commented Nov 7, 2021

This looks super cool! We do have data, but right now it's dumped to an internal system at FB (Meta??) for analysis. We'd love to make something that everyone can look at and help out with; I'll play with that notebook a bit.

For current master health, we have hud.pytorch.org, which presents a nice view of current jobs running on master.

@abitrolly
Copy link

@suo what is the process then to make the stats alive open data. Should I join Meta (FB ??) to champion the change? :D

@suo
Copy link
Member

suo commented Dec 6, 2021

persistent dirty checkout, 27 failed jobs on master in the last week: https://fburl.com/scuba/opensource_ci_jobs/vs1j5mvq

@janeyx99
Copy link
Contributor Author

janeyx99 commented Jan 7, 2022

Completed as of today (1.7.21):

No longer relevant/De-prioritized

@janeyx99
Copy link
Contributor Author

Completed between 4.11 and 6.17

@ZainRizvi
Copy link
Contributor

Closing the meta tracker since we aren't using it anymore

Repository owner moved this from In Progress to Done in PyTorch OSS Dev Infra Oct 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: ci Related to continuous integration tracker A tracking issue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
Archived in project
Development

No branches or pull requests

6 participants