Skip to content

Fix missing attachment stuck actions #5355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

bdoyle0182
Copy link
Contributor

@bdoyle0182 bdoyle0182 commented Nov 29, 2022

Description

This is a critical bug that's gone unsolved for several years. Since attachments are considered secondary files to the main action document, if the attachment is lost for whatever reason (upload timeout, concurrent upload, bad database replication, etc.) the action document is therefore corrupted in a stuck state because the document factory is set up to do a get of the document first when performing a putEntity or deleteEntity. The WhiskAction type overrides the get operation to include getting the attachment as a part of the get operation. However once the root document is retrieved, the attachment is then tried to be read and returns a not found making the get return a DocumentNotFoundException to the putEntity or deleteEntity. You can therefore never delete or update an action (that is broken since the attachment was lost) once in this state through the openwhisk api and must go through an openwhisk admin to delete the document directly from the artifact db.

Worse, for updating an action it incorrectly returns a 409 to the user when a conflict isn't what's actually happening in this case. The get of the WhiskAction correctly returns a DocumentNotFoundException when the attachment is not found, however the putEntity function then catches this and tries to put the document without a revision thinking it is a brand new document for the artifact db when the root document does in fact exist so the db returns a 409 when a put request is sent to it without a revision; which is what is then fed back to the user on any update action attempt.

I have implemented a simple fix that finally resolves this issue for updating and deleting an action in this missing attachment state without impacting any other behavior of the system outside of the update and delete action api. The change is simple in that we recover in the get of the WhiskAction type if getting the attachment fails for not being found only if an ignore boolean is set. This ignore boolean is only set when get is called within putEntity and getEntity, and only has an effect on the WhiskAction type.

I've reproduced and verified in my environment that this change resolves the issue. Pre change put action returns 409, delete action returns 404, and get action returns 404 if code=true otherwise 200. Post change put action succeeds, delete action succeeds, and get action keeps the same expected behavior of 404 if code=true otherwise 200.

Related issue and scope

My changes affect the following components

  • API
  • Controller
  • Message Bus (e.g., Kafka)
  • Loadbalancer
  • Scheduler
  • Invoker
  • Intrinsic actions (e.g., sequences, conductors)
  • Data stores (e.g., CouchDB)
  • Tests
  • Deployment
  • CLI
  • General tooling
  • Documentation

Types of changes

  • Bug fix (generally a non-breaking change which closes an issue).
  • Enhancement or new feature (adds new functionality).
  • Breaking change (a bug fix or enhancement which changes existing behavior).

Checklist:

  • I signed an Apache CLA.
  • I reviewed the style guides and followed the recommendations (Travis CI will check :).
  • I added tests to cover my changes.
  • My changes require further changes to the documentation.
  • I updated the documentation where necessary.

Brendan Doyle added 2 commits November 28, 2022 23:44
@bdoyle0182
Copy link
Contributor Author

bdoyle0182 commented Nov 29, 2022

Line 336 of ApiUtils is the culprit of the 409 where it will catch and try to write the document as if it were new without revision. https://github.com/apache/openwhisk/pull/5355/files#diff-4e53d29523662c987931da7fe7bbc4f559f9c8971e85a4a777f180c97f1e81a6R336

case _: NoDocumentException => logging.debug(this, s"[PUT] entity does not exist, will try to create it") create().map(newDoc => (None, newDoc))

@bdoyle0182
Copy link
Contributor Author

@rabbah you might be interested in this bug since you were on the original issue

@bdoyle0182 bdoyle0182 requested a review from style95 November 29, 2022 08:50
@codecov-commenter
Copy link

codecov-commenter commented Dec 7, 2022

Codecov Report

Merging #5355 (7337d78) into master (8578887) will decrease coverage by 4.76%.
The diff coverage is 85.71%.

@@            Coverage Diff             @@
##           master    #5355      +/-   ##
==========================================
- Coverage   81.46%   76.69%   -4.77%     
==========================================
  Files         240      240              
  Lines       14407    14407              
  Branches      618      601      -17     
==========================================
- Hits        11736    11050     -686     
- Misses       2671     3357     +686     
Impacted Files Coverage Δ
...ache/openwhisk/core/database/DocumentFactory.scala 100.00% <ø> (ø)
...ntainerpool/v2/FunctionPullingContainerProxy.scala 78.50% <50.00%> (ø)
...org/apache/openwhisk/core/entity/WhiskAction.scala 93.44% <100.00%> (ø)
...rg/apache/openwhisk/core/controller/ApiUtils.scala 64.95% <100.00%> (ø)
.../openwhisk/core/scheduler/queue/QueueManager.scala 81.50% <100.00%> (ø)
...core/database/cosmosdb/RxObservableImplicits.scala 0.00% <0.00%> (-100.00%) ⬇️
...ore/database/cosmosdb/cache/CacheInvalidator.scala 0.00% <0.00%> (-100.00%) ⬇️
...e/database/cosmosdb/cache/ChangeFeedConsumer.scala 0.00% <0.00%> (-100.00%) ⬇️
...core/database/cosmosdb/CosmosDBArtifactStore.scala 0.00% <0.00%> (-95.85%) ⬇️
...sk/core/database/cosmosdb/CosmosDBViewMapper.scala 0.00% <0.00%> (-93.90%) ⬇️
... and 22 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Member

@style95 style95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bdoyle0182 bdoyle0182 closed this Dec 7, 2022
@bdoyle0182 bdoyle0182 reopened this Dec 7, 2022
@bdoyle0182 bdoyle0182 closed this Dec 8, 2022
@bdoyle0182 bdoyle0182 reopened this Dec 8, 2022
@bdoyle0182 bdoyle0182 closed this Dec 8, 2022
@bdoyle0182 bdoyle0182 reopened this Dec 8, 2022
@bdoyle0182 bdoyle0182 closed this Dec 8, 2022
@bdoyle0182 bdoyle0182 reopened this Dec 8, 2022
@bdoyle0182 bdoyle0182 merged commit daeadbf into apache:master Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants