Skip to content

Fix race between executor insert and advanceTime #1554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 24, 2017
Merged

Fix race between executor insert and advanceTime #1554

merged 5 commits into from
Jan 24, 2017

Conversation

pongad
Copy link
Contributor

@pongad pongad commented Jan 20, 2017

This is a follow up to #1552, which should be merged first.
cc @davidtorres

Fix race between executor insert and advanceTime

This is a follow up to
https://github.com/GoogleCloudPlatform/google-cloud-java/pull/1552 .

The linked PR has a race condition:
If the ExtensionJobs is added to the queue after the alarm is set up,
it is possible that the alarm would fire before the jobs are added.
Fixing this is easy, just set up the alarm after.
However, doing this consistently deadlocks the tests. Why?

The tests uses a fake executor.
In tests, time does not flow naturally, but is forced to increment,
usually by the executor's `advanceTime` method.
There is a race between the test thread advancing the time and
the mock server thread inserting more tasks to the fake executor.
If the tasks get inserted first, all is well.
Otherwise, `advanceTime` doesn't see the tasks,
and they do not get executed.
The fix is to check the "current time" every time a task is inserted.
If the task is inserted "in the past", we run the task immediately.
Doing this still deadlocks the tests. Why?

The fake executor needs to lock the task queue when registering a task.
If the task is inserted in the past, it also needs to run the task.
Running the task might involve sending a requst to the mock server.
A GRPC thread on the mock server might handle the request by adding more
tasks to the executor.
The executor's queue is locked by the first thread,
resulting in a deadlock.
The fix is to lock the queue just long enough to retrieve a task,
then execute the task without the lock.

The previous implementation modifies a HashMap while iterating through
its key set, causeing unpredictable iteration behavior.
This might be the reason our tests intermittently deadlock.

The new implementation uses a PriorityQueue.
The time complexity is O(M * log N),
where M is the number of expirations before the cut over time
and N is the total number of expirations.
I am not sure how this compares to O(N) intended in the previous
implementation.
If required, O(N) is also possible using an ArrayList.

Unfortunately, a new failure has emerged.
Instead of deadlocking, testModifyAckDeadline intermittently fails.
Maybe
- I have fixed the old bug and created a new one,
- I have fixed the old bug that was masking another one,
- The deadlock wasn't caused by the iteration. Now the tests just fail
  before they could deadlock,
or some combination thereof.
The incorrect iteration should be fixed regardless.
don't process the same job twice

record next expiration after the loop
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Jan 20, 2017
@pongad pongad closed this Jan 20, 2017
@pongad pongad reopened this Jan 20, 2017
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.007%) to 83.144% when pulling daa24e5 on pongad:tick into 0550871 on GoogleCloudPlatform:pubsub-hp.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.007%) to 83.144% when pulling daa24e5 on pongad:tick into 0550871 on GoogleCloudPlatform:pubsub-hp.

This is a follow up to
#1552 .

The linked PR has a race condition:
If the ExtensionJobs is added to the queue after the alarm is set up,
it is possible that the alarm would fire before the jobs are added.
Fixing this is easy, just set up the alarm after.
However, doing this consistently deadlocks the tests. Why?

The tests uses a fake executor.
In tests, time does not flow naturally, but is forced to increment,
usually by the executor's `advanceTime` method.
There is a race between the test thread advancing the time and
the mock server thread inserting more tasks to the fake executor.
If the tasks get inserted first, all is well.
Otherwise, `advanceTime` doesn't see the tasks,
and they do not get executed.
The fix is to check the "current time" every time a task is inserted.
If the task is inserted "in the past", we run the task immediately.
Doing this still deadlocks the tests. Why?

The fake executor needs to lock the task queue when registering a task.
If the task is inserted in the past, it also needs to run the task.
Running the task might involve sending a requst to the mock server.
A GRPC thread on the mock server might handle the request by adding more
tasks to the executor.
The executor's queue is locked by the first thread,
resulting in a deadlock.
The fix is to lock the queue just long enough to retrieve a task,
then execute the task without the lock.
@pongad pongad changed the title DO NOT COMMIT; Tick; experimental code Fix race between executor insert and advanceTime Jan 23, 2017
@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 021b5ad on pongad:tick into ** on GoogleCloudPlatform:pubsub-hp**.

@pongad pongad merged commit 2c858ee into googleapis:pubsub-hp Jan 24, 2017
@pongad pongad deleted the tick branch January 24, 2017 01:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants