Async jobs add endtime #15
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Overview:
CloudStack has the concept of executing certain API calls asynchronously when they take a long period of time to complete. They will immediately return a job id of the job that will be responsible for executing the command. This job id can be used to query the status of the job by using the queryAsyncJobResult application programming interface (API) call.
Among other fields, the result of this API query returns the 'created' field, which is the timestamp of when an asynchronous job started. There is currently no functional mechanism that captures or persists the end time of when the job has finished. As a result, the above "queryAsyncJobResult" API does not return an 'end_time' field.
QueryAsyncJobResult API changes:
The requirement outlined here is for Cloudstack to capture the job end timestamp of when the asynchronous job has finished and to populate this into the existing database field called 'removed' in the async_job table. Please note the 'removed' field is not currently being used anywhere else in the CloudStack code, and the 'removed' database column is also not currently being populated by any other processes. A new response tag should be added to the queryAsyncJobResult API called 'end_time'. When making a queryAsyncJobResult API request, the value of the database column 'removed' should be mapped to this 'end_time' response tag. The queryAsyncJobResult API field should be called 'end_time' instead of 'removed' because it will be more descriptive to an API user.
Management server process changes:
When an asynchronous job completes it is marked as complete by updating its finished status in the database. New functionality should be added to also update the 'removed' field with the timestamp of when an asynchronous job has completed.
In addition, when the Cloudstack management server is stopped and started again (gracefully or ungracefully), neither this management server nor other running management servers have any knowledge of the true status of the asynchronous jobs that completed during this time it was down. Their statuses are not tracked or updated in the database by any management server during this time, regardless of whether they are still running, finished successfully or finished with an error status. Currently, there is no mechanism to notify other running management servers that a specific management server is stopping or to notify a specific running management server that it should start to monitor/track the currently running asynchronous jobs belonging to the management server being stopped. When a Cloudstack management server starts up, it does not do a blanket delete of the asynchronous jobs it is the owner of. Instead, it finds in the database all the asynchronous jobs it is the owner of, whose statuses are in an 'in_progress' state and updates them to a 'failed' status. At the same time, it should now also mark them as 'removed' by updating the 'removed' field with the current timestamp.
Garbage collection:
CloudStack also has a garbage collector that does database clean-up of asynchronous jobs, whereby it periodically deletes unfinished and completed job records from the async_job database table. It uses the configurable global setting 'job.cancel.threshold.minutes' to cancel jobs that are still in the queue. It also uses a configurable global setting 'job.expire.minutes' that allows a user to specify how long in minutes to keep asynchronous jobs that have not been processed yet, as well as completed asynchronous job records in the database before deleting them. Unfinished jobs that haven't been processed yet and that are older than this expiry time will be expired and deleted by the garbage collector.
Asynchronous jobs that have completed before the timeout threshold will also be deleted. When these jobs complete they are marked as complete by updating their finished status in the database. Currently, the garbage collector uses the 'created' database column to find completed asynchronous job records that are older than the 'job.expire.minutes' global setting's timestamp. It should no longer use the 'created' column to do this. It should use the 'removed' column instead.
Due to the nature of the garbage collector, any reporting that needs to be done on asynchronous jobs, should be done before the garbage collector starts its cleaning up task.
Types of changes
GitHub Issue/PRs
Screenshots (if appropriate):
How Has This Been Tested?
Checklist:
Testing