Give sufficient time to get inference statistics for sequence tests #5670

dyastremsky · 2023-04-20T13:41:23Z

Since responses are sent before inference statistics are reported, there is the risk that statistics are checked before they are ready. This may be the cause of L0_sequence_batcher failing infrequently with statistics missing one request, with messages like:

Traceback (most recent call last):
  File "sequence_batcher_test.py", line 577, in test_no_sequence_end
    self.check_status(model_name, {1: 4 * (idx + 1)},
AssertionError: 7 != 8 : expected model-execution-count 8 for batch size 1, got 7

Only in the case that sequence batcher is going to fail due to not having enough requests ready, this change will provide up to 10 seconds in .5 second increments for the requests to be ready.

rmccorm4 · 2023-04-20T17:29:16Z

I'd be surprised if the window between returning response and reporting statistics is really small enough for this to happen often.

On top of that, if it is just a timing issue, why is it so consistently off by 1? ex: 0 != 1, 7 != 8, 15 != 16 - that feels suspicious to me like more of a logic issue or internal race condition. Is there any reason why it's not off by 2 or 3 sometimes?

However, if it was in fact a logic/race condition issue (ex: the stats are just getting set incorrectly) and not a timing issue, then I suppose the test would still periodically fail even with this PR's change.

I'd rather some more definitive proof via some backtraces or prints and timestamps in what backend is reporting vs what core is receiving, etc. -- but if you can't reproduce locally then I'm not sure we can do much. Maybe we can add a LOG_VERBOSE (3) around some suspected areas and run the test with that log level until it fails again? What do you think @GuanLuo
Separately, if this sleep/retry change goes through, can you add some printing/logging from the python test indicating each retry @dyastremsky ? Would be some additional information in the log that may help in the future, ex: if it is actually passing because of a retry or not.

dyastremsky · 2023-04-20T18:00:58Z

I'd be surprised if the window between returning response and reporting statistics is really small enough for this to happen often.

On top of that, if it is just a timing issue, why is it so consistently off by 1? ex: 0 != 1, 7 != 8, 15 != 16 - that feels suspicious to me like more of a logic issue or internal race condition. Is there any reason why it's not off by 2 or 3 sometimes?

However, if it was in fact a logic/race condition issue (ex: the stats are just getting set incorrectly) and not a timing issue, then I suppose the test would still periodically fail even with this PR's change.

I'd rather some more definitive proof via some backtraces or prints and timestamps in what backend is reporting vs what core is receiving, etc. -- but if you can't reproduce locally then I'm not sure we can do much. Maybe we can add a LOG_VERBOSE (3) around some suspected areas and run the test with that log level until it fails again? What do you think @GuanLuo

Separately, if this sleep/retry change goes through, can you add some printing/logging from the python test indicating each retry @dyastremsky ? Would be some additional information in the log that may help in the future, ex: if it is actually passing because of a retry or not.

Even if it does not resolve this issue, it seems prudent to have some sort of waiting to ensure the statistics are ready. I added the logging here. It's true that in the runs I've seen, it's always an off-by-one error and I wouldn't expect it to be as consistent if this was a timing issue. But this seemed like a good change to start with.

It's possible it would reproduce locally and it just hasn't hit it after a dozen runs yet. This test is a bit challenging as its runtime is just over an hour and it seems to hit this case <5% of runs. An additional option is enabling verbose logs so that if it fails again, we'll have verbose logs.

GuanLuo · 2023-04-20T18:20:17Z

I agree that having the timestamp can be useful to understand the root cause of the issue.

Without further information, however, I think off-by-1 is more an indicator of timing issue than logical error. The client (test) will proceed once all responses are received and the batch stats will only be updated after all responses are sent on the server side, which is why 1-off may be observed. That being said, it is less likely that the network time for getting stats is less than the time for updating it, unless it is running on resource constrained system and the server thread is preempted.

rmccorm4

LGTM, minor comments for future bookkeeping

rmccorm4 · 2023-04-20T21:46:06Z

qa/common/sequence_util.py

+            actual_exec_cnt = stats.model_stats[0].execution_count
+            if actual_exec_cnt == exec_cnt:
+                break
+            print("Waiting: expect {} executions, got {}".format(exec_cnt,


Can you add an attempt number and some kind of "ERROR/WARNING" or similar so it's easier to spot in the logs?

Good idea. Updated.

qa/common/sequence_util.py

dyastremsky · 2023-04-20T22:23:00Z

qa/L0_sequence_batcher/test.sh

@@ -112,7 +112,7 @@ else
    fi
 fi

-SERVER_ARGS_EXTRA="--backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION}"
+SERVER_ARGS_EXTRA="--backend-directory=${BACKEND_DIR} --backend-config=tensorflow,version=${TF_VERSION} --log-verbose=1"


Updating verbose logging so that we have more information if this fails again. SERVER_ARGS_EXTRA is used every time, so putting this here seemed to make more sense than putting it in five places.

rmccorm4 · 2023-04-20T23:45:13Z

qa/common/sequence_util.py

-            print("Waiting: expect {} executions, got {}".format(exec_cnt,
-                                                          actual_exec_cnt))
+            print("WARNING: expect {} executions, got {} (attempt {})".format(
+                exec_cnt, actual_exec_cnt, loop_count))


increment loop count? Alternatively just use a fixed number of iterations:

num_tries = 10 for i in range(num_tries): ... print(f"WARNING: ... (attempt {i})") sleep(1)

Nice catch, switched to using the loop count incrementor.

qa/common/sequence_util.py

@@ -984,8 +984,22 @@
                             _max_sequence_idle_ms * 1000)  # 5 secs

    def check_status(self, model_name, batch_exec, exec_cnt, infer_cnt):
-        stats = self.triton_client_.get_inference_statistics(model_name, "1")
-        self.assertEqual(len(stats.model_stats), 1, "expect 1 model stats")
+        start_time = time.time()


rmccorm4 · 2023-04-21T00:38:30Z

qa/common/sequence_util.py

@@ -986,18 +986,18 @@ def check_setup(self, model_name):
    def check_status(self, model_name, batch_exec, exec_cnt, infer_cnt):
        start_time = time.time()
        # There is a time window between when responses are returned and statistics are updated.
-        # To prevent intermittent test failure during that window, wait up to 10 seconds for the
+        # To prevent intermittent test failure during that window, wait up to 5 seconds for the


You can keep it as 10 seconds by increasing sleep time. Might as well have more time for it to pass if we're doing the workaround.

Sure, updated.

If statistics not ready, wait up to 10s to check them

8872e85

dyastremsky requested review from rmccorm4 and GuanLuo April 20, 2023 13:41

Add debug message to see if wait helps avoid test failure

b295c1e

rmccorm4 reviewed Apr 20, 2023

View reviewed changes

Comment timing logic, print loop, add verbose logs

ce8ba48

dyastremsky commented Apr 20, 2023

View reviewed changes

dyastremsky requested a review from rmccorm4 April 20, 2023 22:23

rmccorm4 reviewed Apr 20, 2023

View reviewed changes

Use tries as loop iterator

37ccc35

github-advanced-security bot found potential problems Apr 21, 2023

View reviewed changes

rmccorm4 reviewed Apr 21, 2023

View reviewed changes

dyastremsky added 3 commits April 21, 2023 08:10

Change back to 10 seconds

8792af7

Make 10 seconds by extending sleep

be0ecb3

Remove unused variable start_time

8c7e6da

dyastremsky requested a review from rmccorm4 April 21, 2023 19:43

rmccorm4 approved these changes Apr 21, 2023

View reviewed changes

dyastremsky merged commit d64a6ed into main Apr 21, 2023

dyastremsky deleted the dyas-sequence branch April 21, 2023 22:08

dyastremsky mentioned this pull request Apr 25, 2023

Give sufficient time to get inference statistics for batcher tests #5699

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give sufficient time to get inference statistics for sequence tests #5670

Give sufficient time to get inference statistics for sequence tests #5670

dyastremsky commented Apr 20, 2023

rmccorm4 commented Apr 20, 2023

dyastremsky commented Apr 20, 2023 •

edited

Loading

GuanLuo commented Apr 20, 2023

rmccorm4 left a comment

rmccorm4 Apr 20, 2023

dyastremsky Apr 20, 2023

dyastremsky Apr 20, 2023

rmccorm4 Apr 20, 2023 •

edited

Loading

dyastremsky Apr 20, 2023 •

edited

Loading

Check notice

rmccorm4 Apr 21, 2023

dyastremsky Apr 21, 2023

Give sufficient time to get inference statistics for sequence tests #5670

Give sufficient time to get inference statistics for sequence tests #5670

Conversation

dyastremsky commented Apr 20, 2023

rmccorm4 commented Apr 20, 2023

dyastremsky commented Apr 20, 2023 • edited Loading

GuanLuo commented Apr 20, 2023

rmccorm4 left a comment

Choose a reason for hiding this comment

rmccorm4 Apr 20, 2023

Choose a reason for hiding this comment

dyastremsky Apr 20, 2023

Choose a reason for hiding this comment

dyastremsky Apr 20, 2023

Choose a reason for hiding this comment

rmccorm4 Apr 20, 2023 • edited Loading

Choose a reason for hiding this comment

dyastremsky Apr 20, 2023 • edited Loading

Choose a reason for hiding this comment

Check notice

rmccorm4 Apr 21, 2023

Choose a reason for hiding this comment

dyastremsky Apr 21, 2023

Choose a reason for hiding this comment

dyastremsky commented Apr 20, 2023 •

edited

Loading

rmccorm4 Apr 20, 2023 •

edited

Loading

dyastremsky Apr 20, 2023 •

edited

Loading