Skip to content

Fix bug where client "forgets" it received CONNECT_ACK #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 25, 2021

Conversation

graebm
Copy link
Contributor

@graebm graebm commented Feb 25, 2021

Client bug

Bug would manifest as a AWS_ERROR_EVENT_STREAM_RPC_PROTOCOL_ERROR if an APPLICATION_MESSAGE was sent immediately upon receiving a CONNECT_ACK

Turns out the client was setting state=CONNECT_PROCESSED in two places:

  1. When writing the CONNECT message
  2. When the CONNECT message's write-completion callback fired.

Fix is to ONLY set state=CONNECT_PROCESSED in place 1.

Bug would occur if the CONNECT_ACK was received before the CONNECT's write-complete callback fired. The state would go from CONNECT_ACK_PROCESSED back to CONNECT_PROCESSED.

This bug began manifesting when aws-c-io changed (PR# awslabs/aws-c-io#353) so that the write-completion callback never fired synchronously. This made it possible to receive a CONNECT_ACK before the CONNECT's write-completion callback fired. The write-completion callback always worked this way on windows though, so it is still righteous to fix the bug in aws-c-event-stream, rather than make further changes to aws-c-io.

Server bug

Server was only setting state=CONNECT_ACK_PROCESSED in one place, but it was setting it from the write-completion callback. Seems like a similar thing could happen:

  1. server sends CONNECT_ACK
  2. client receives CONNECT_ACK
  3. client send APPLICATION_MESSAGE
  4. server receives APPLICATION_MESSAGE. This causes error because state is not yet CONNECT_ACK_PROCESSED
  5. CONNECT_ACK write-completion fires, we set STATE=CONNECT_ACK_PROCESSED, but too late connection is already shutting down in error

Fix is to set state=CONNECT_ACK_PROCESSED when writing the CONNECT_ACK message, not when the write completes.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The client was setting `state=CONNECT_PROCESSED` in two places:
1) Right writing the CONNECT message
2) When the CONNECT message's write-completion callback fired.

Fix is to ONLY set `state=CONNECT_PROCESSED` in place 1.

Bug would occur if the CONNECT_ACK was received before the CONNECT's write-complete callback fired. The state would go from CONNECT_ACK_PROCESSED back to CONNECT_PROCESSED.
@DavidOgunsAWS
Copy link

I'm seeing the failure observed here in CI intermittently as well (Ubuntu20, clang10) and this seems to be the core of it though I don't know how to eyeball the preconditions here:

[TRACE] [2021-02-25T02:13:40Z] [00007ffff7c403c0] [event-loop] - id=0x8755f0: Scheduling task 0x7fffe80026f0 cross-thread for timestamp 0
aws-c-event-stream-tests: ../nptl/pthread_mutex_lock.c:81: __pthread_mutex_lock: Assertion `mutex->__data.__owner == 0' failed.

Thread 3 "aws-c-event-str" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff7422700 (LWP 30915)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
1: s_rpc_client_message_flush = {void (int, void *)} 0x459100 <s_rpc_client_message_flush>
(gdb) list aws_mutex_lock
35
36          mutex->initialized = (return_code == AWS_OP_SUCCESS);
37          return return_code;
38      }
39
40      int aws_mutex_lock(struct aws_mutex *mutex) {
41          AWS_PRECONDITION(mutex && mutex->initialized);
42          return aws_private_convert_and_raise_error_code(pthread_mutex_lock(&mutex->mutex_handle));
43      }
44
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7c66859 in __GI_abort () at abort.c:79
#2  0x00007ffff7c66729 in __assert_fail_base (fmt=0x7ffff7dfc588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=0x7ffff7e4b62d "mutex->__data.__owner == 0", file=0x7ffff7e4b5fa "../nptl/pthread_mutex_lock.c", line=81, 
    function=<optimized out>) at assert.c:92
#3  0x00007ffff7c77f36 in __GI___assert_fail (assertion=assertion@entry=0x7ffff7e4b62d "mutex->__data.__owner == 0", 
    file=file@entry=0x7ffff7e4b5fa "../nptl/pthread_mutex_lock.c", line=line@entry=81, 
    function=function@entry=0x7ffff7e4b790 <__PRETTY_FUNCTION__.10174> "__pthread_mutex_lock") at assert.c:101
#4  0x00007ffff7e3f164 in __GI___pthread_mutex_lock (mutex=<optimized out>) at ../nptl/pthread_mutex_lock.c:81
#5  0x00000000006d99c6 in aws_mutex_lock (mutex=0x7fffffffdbb0) at /home/ubuntu/crt/aws-c-common/source/posix/mutex.c:42
#6  0x0000000000459129 in s_rpc_client_message_flush (error_code=0, user_data=0x7fffffffdba0)
    at /home/ubuntu/crt/aws-c-event-stream/tests/event_stream_rpc_client_connection_test.c:400
#7  0x000000000048df1a in s_on_protocol_message_written_fn (message=0x87a6f8, error_code=0, user_data=0x87a6f0)
    at /home/ubuntu/crt/aws-c-event-stream/source/event_stream_rpc_server.c:505
#8  0x0000000000483330 in s_on_message_write_completed_fn (channel=0x7fffe8002530, message=0x7fffe8013f00, err_code=0, 
    user_data=0x87ac00) at /home/ubuntu/crt/aws-c-event-stream/source/event_stream_channel_handler.c:247
#9  0x00000000004b2484 in s_on_socket_write_complete (socket=0x7fffe8001ef0, error_code=0, amount_written=107, 
    user_data=0x7fffe8013f00) at /home/ubuntu/crt/aws-c-io/source/socket_channel_handler.c:67
#10 0x00000000004ac534 in s_written_task (task=0x7fffe8001050, arg=0x7fffe8001ef0, status=AWS_TASK_STATUS_RUN_READY)
    at /home/ubuntu/crt/aws-c-io/source/posix/socket.c:1428
#11 0x00000000006dfa06 in aws_task_run (task=0x7fffe8001050, status=AWS_TASK_STATUS_RUN_READY)
    at /home/ubuntu/crt/aws-c-common/source/task_scheduler.c:44
#12 0x00000000006e00cc in s_run_all (scheduler=0x875b70, current_time=103441144724479, status=AWS_TASK_STATUS_RUN_READY)
    at /home/ubuntu/crt/aws-c-common/source/task_scheduler.c:249
#13 0x00000000006e0b12 in aws_task_scheduler_run_all (scheduler=0x875b70, current_time=103441144724479)
    at /home/ubuntu/crt/aws-c-common/source/task_scheduler.c:188
#14 0x00000000004a22c7 in s_main_loop (args=0x8755f0) at /home/ubuntu/crt/aws-c-io/source/linux/epoll_event_loop.c:635
#15 0x00000000006db762 in thread_fn (arg=0x876100) at /home/ubuntu/crt/aws-c-common/source/posix/thread.c:111
#16 0x00007ffff7e3c609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#17 0x00007ffff7d63293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@DavidOgunsAWS
Copy link

DavidOgunsAWS commented Feb 25, 2021

I've narrowed the mutex lock issue to this call: https://github.com/awslabs/aws-c-event-stream/blob/main/tests/event_stream_rpc_client_connection_test.c#L1174 via changing the callback handler to ensure they were unique.

At the same time of this crash, the test function has returned already and is waiting for the server connection shutdown to be completed. (gdb info threads; gdb thread 1; gdb bt)

@graebm graebm merged commit e87537b into main Feb 25, 2021
@graebm graebm deleted the client-state-fix branch February 25, 2021 23:34
Yamakuzure added a commit to Yamakuzure/gentoo that referenced this pull request Mar 14, 2021
The bump fixes two errors in aws-c-event-stream:

* awslabs/aws-c-event-stream#55
  Release lock before invoking callbacks. (Deadlocks seen)
* awslabs/aws-c-event-stream#60
  Fix bug where client "forgets" it received CONNECT_ACK

Package-Manager: Portage-3.0.17, Repoman-3.0.2
Signed-off-by: Sven Eden <[email protected]>
Yamakuzure added a commit to Yamakuzure/gentoo that referenced this pull request Mar 26, 2021
The bump fixes two errors in aws-c-event-stream:

* awslabs/aws-c-event-stream#55
  Release lock before invoking callbacks. (Deadlocks seen)
* awslabs/aws-c-event-stream#60
  Fix bug where client "forgets" it received CONNECT_ACK

Package-Manager: Portage-3.0.17, Repoman-3.0.2
Signed-off-by: Sven Eden <[email protected]>
gentoo-bot pushed a commit to gentoo/gentoo that referenced this pull request Mar 27, 2021
The bump fixes two errors in aws-c-event-stream:

* awslabs/aws-c-event-stream#55
  Release lock before invoking callbacks. (Deadlocks seen)
* awslabs/aws-c-event-stream#60
  Fix bug where client "forgets" it received CONNECT_ACK

Package-Manager: Portage-3.0.17, Repoman-3.0.2
Signed-off-by: Sven Eden <[email protected]>
Signed-off-by: Joonas Niilola <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants