Skip to content

Give sufficient time to get inference statistics for sequence tests #5670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 21, 2023

Conversation

dyastremsky
Copy link
Contributor

Since responses are sent before inference statistics are reported, there is the risk that statistics are checked before they are ready. This may be the cause of L0_sequence_batcher failing infrequently with statistics missing one request, with messages like:

Traceback (most recent call last):
  File "sequence_batcher_test.py", line 577, in test_no_sequence_end
    self.check_status(model_name, {1: 4 * (idx + 1)},
AssertionError: 7 != 8 : expected model-execution-count 8 for batch size 1, got 7

Only in the case that sequence batcher is going to fail due to not having enough requests ready, this change will provide up to 10 seconds in .5 second increments for the requests to be ready.

@dyastremsky dyastremsky requested review from rmccorm4 and GuanLuo April 20, 2023 13:41
@rmccorm4
Copy link
Contributor

I'd be surprised if the window between returning response and reporting statistics is really small enough for this to happen often.

On top of that, if it is just a timing issue, why is it so consistently off by 1? ex: 0 != 1, 7 != 8, 15 != 16 - that feels suspicious to me like more of a logic issue or internal race condition. Is there any reason why it's not off by 2 or 3 sometimes?

However, if it was in fact a logic/race condition issue (ex: the stats are just getting set incorrectly) and not a timing issue, then I suppose the test would still periodically fail even with this PR's change.

  1. I'd rather some more definitive proof via some backtraces or prints and timestamps in what backend is reporting vs what core is receiving, etc. -- but if you can't reproduce locally then I'm not sure we can do much. Maybe we can add a LOG_VERBOSE (3) around some suspected areas and run the test with that log level until it fails again? What do you think @GuanLuo

  2. Separately, if this sleep/retry change goes through, can you add some printing/logging from the python test indicating each retry @dyastremsky ? Would be some additional information in the log that may help in the future, ex: if it is actually passing because of a retry or not.

@dyastremsky
Copy link
Contributor Author

dyastremsky commented Apr 20, 2023

I'd be surprised if the window between returning response and reporting statistics is really small enough for this to happen often.

On top of that, if it is just a timing issue, why is it so consistently off by 1? ex: 0 != 1, 7 != 8, 15 != 16 - that feels suspicious to me like more of a logic issue or internal race condition. Is there any reason why it's not off by 2 or 3 sometimes?

However, if it was in fact a logic/race condition issue (ex: the stats are just getting set incorrectly) and not a timing issue, then I suppose the test would still periodically fail even with this PR's change.

  1. I'd rather some more definitive proof via some backtraces or prints and timestamps in what backend is reporting vs what core is receiving, etc. -- but if you can't reproduce locally then I'm not sure we can do much. Maybe we can add a LOG_VERBOSE (3) around some suspected areas and run the test with that log level until it fails again? What do you think @GuanLuo
  2. Separately, if this sleep/retry change goes through, can you add some printing/logging from the python test indicating each retry @dyastremsky ? Would be some additional information in the log that may help in the future, ex: if it is actually passing because of a retry or not.

Even if it does not resolve this issue, it seems prudent to have some sort of waiting to ensure the statistics are ready. I added the logging here. It's true that in the runs I've seen, it's always an off-by-one error and I wouldn't expect it to be as consistent if this was a timing issue. But this seemed like a good change to start with.

It's possible it would reproduce locally and it just hasn't hit it after a dozen runs yet. This test is a bit challenging as its runtime is just over an hour and it seems to hit this case <5% of runs. An additional option is enabling verbose logs so that if it fails again, we'll have verbose logs.

@GuanLuo
Copy link
Contributor

GuanLuo commented Apr 20, 2023

I agree that having the timestamp can be useful to understand the root cause of the issue.

Without further information, however, I think off-by-1 is more an indicator of timing issue than logical error. The client (test) will proceed once all responses are received and the batch stats will only be updated after all responses are sent on the server side, which is why 1-off may be observed. That being said, it is less likely that the network time for getting stats is less than the time for updating it, unless it is running on resource constrained system and the server thread is preempted.

Copy link
Contributor

@rmccorm4 rmccorm4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor comments for future bookkeeping

actual_exec_cnt = stats.model_stats[0].execution_count
if actual_exec_cnt == exec_cnt:
break
print("Waiting: expect {} executions, got {}".format(exec_cnt,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an attempt number and some kind of "ERROR/WARNING" or similar so it's easier to spot in the logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Updated.

@@ -112,7 +112,7 @@ else
fi
fi

SERVER_ARGS_EXTRA="--backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION}"
SERVER_ARGS_EXTRA="--backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION} --log-verbose=1"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating verbose logging so that we have more information if this fails again. SERVER_ARGS_EXTRA is used every time, so putting this here seemed to make more sense than putting it in five places.

@dyastremsky dyastremsky requested a review from rmccorm4 April 20, 2023 22:23
print("Waiting: expect {} executions, got {}".format(exec_cnt,
actual_exec_cnt))
print("WARNING: expect {} executions, got {} (attempt {})".format(
exec_cnt, actual_exec_cnt, loop_count))
Copy link
Contributor

@rmccorm4 rmccorm4 Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increment loop count? Alternatively just use a fixed number of iterations:

num_tries = 10
for i in range(num_tries):
  ...
  print(f"WARNING: ... (attempt {i})")
  sleep(1)

Copy link
Contributor Author

@dyastremsky dyastremsky Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, switched to using the loop count incrementor.

@@ -984,8 +984,22 @@
_max_sequence_idle_ms * 1000) # 5 secs

def check_status(self, model_name, batch_exec, exec_cnt, infer_cnt):
stats = self.triton_client_.get_inference_statistics(model_name, "1")
self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
start_time = time.time()

Check notice

Code scanning / CodeQL

Unused local variable

Variable start_time is not used.
@@ -986,18 +986,18 @@ def check_setup(self, model_name):
def check_status(self, model_name, batch_exec, exec_cnt, infer_cnt):
start_time = time.time()
# There is a time window between when responses are returned and statistics are updated.
# To prevent intermittent test failure during that window, wait up to 10 seconds for the
# To prevent intermittent test failure during that window, wait up to 5 seconds for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can keep it as 10 seconds by increasing sleep time. Might as well have more time for it to pass if we're doing the workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated.

@dyastremsky dyastremsky requested a review from rmccorm4 April 21, 2023 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants