Skip to content

ssl.wrap_socket() hangs on reconnecting #171

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
SteveLTN opened this issue Apr 30, 2025 · 10 comments
Open

ssl.wrap_socket() hangs on reconnecting #171

SteveLTN opened this issue Apr 30, 2025 · 10 comments

Comments

@SteveLTN
Copy link

If the internet quality isn't good, sometimes during reconnection, ssl.wrap_socket() blocks. The usually causes other coroutines to stop responding. In my case, this triggers WatchDog Timer to restart the MCU.

I found setting do_handshake = False in ssl_params solves the issue, as wrap_socket() doesn't try to do the handshake anymore.

My question is:

Is this expected behavior?
If so, would you kindly add it to the document, or make do_handshake = False the default behavior?

Much appreciated for the great project!

@peterhinch
Copy link
Owner

The mqtt_as module uses the ssl library to create an encrypted socket - see the code. Once this is done, the socket is used in exactly the same way as a normal one. Consequently where the behaviour of an SSL connection differs from that of a normal one, the cause is traceable to the ssl library. For example, as you point out, the connection process of an ssl socket can block.

I don't claim a detailed knowledge of the ssl library. In particular I don't know the implications of setting do_handshake = False. If you could enlighten me, or point me to a reference, I'd be grateful.

In terms of actions the best option (once I understand it) is for me to document this. I think specifying ssl_params should remain the responsibility of the user.

@SteveLTN
Copy link
Author

Unfortunately I don't have much knowledge about the SSL module as well. I only did some quick research on Micropython documentation and CPython documentation and had a browse with mqtt_as code.

Here is briefly what I found:

In Micropython Doc, the following statement particularly made me try setting do_handshake, as I found mqtt_as uses none-blocking sockets .

For non-blocking sockets (i.e. when the sock passed into wrap_socket is in non-blocking mode) the handshake should generally be deferred because otherwise wrap_socket blocks until it completes.

That's why I tried to set do_handshake = False.

The Micropython doc didn't say much more about this flag. It referred to CPython's ssl module doc though. In CPython's doc, it says one should manually call do_handshake if do_handshake_on_connect is False. But I don't see the equivalent in MicroPython and MicroPython doc seems to imply that when doing write or read, the handshake will automatically be done.

Note: Micropython's ssl.wrap_socket(do_handshake=False) is equivalent to ssl.wrap_socket(do_handshake_on_connect=False). This seems to be a discrepancy between the two implementations)

@peterhinch
Copy link
Owner

The MP doc states

Note that in AXTLS the handshake can be deferred until the first read or write but it then blocks until completion.

This seems to imply:

  1. The handshake always blocks until complete.
  2. If not done on connect it will be done when the socket is first used.

Under your conditions of poor connectivity, and with do_handshake=False, do you have evidence of blocking after the initial connection is complete?

The reason I ask is that, in ._connect() the code performs the following steps:

  1. Create a nonblocking socket and connect to broker address.
  2. Wrap in SSL.
  3. Write to the broker (clean session status, last will, user logon etc).

I'd therefore expect setting do_handshake=False to have little effect, merely deferring the blocking for a very brief period until that first write is performed.

@SteveLTN
Copy link
Author

SteveLTN commented May 1, 2025

I agree that the MP doc implies setting do_handshake=False simply defers the blocking. However, in my testing, this isn't the case. What I observe is:

  1. _connect() calls self._sock.setblocking(False)
  2. _connect() calls ssl.wrap_socket() if SSL is enabled. This call doesn't block with do_handshake=False when connection isn't good.
  3. _connect() calls _as_write. Which further calls sock.write().
  4. sock.write() doesn't block at all. When the connection isn't good, it simply returns 0 immediately. Then await asyncio.sleep_ms(0) yield the current coro and allows other coros to execute.
  5. Eventually the timeout is triggered and _as_write() fails.

I tried the above senario on ESP32S3. I simulated a bad connection by setting MQTT server to youtube.com, which obviously doesn't support MQTT. When having do_handshake=True, the setup blocks at ssl.wrap_socket and simply setting the flag makes it non-blocking all the way through.

I'm out this week. I'll come up with some example code to demonstrate the behavior and do some further testing next week. For instance, is the blocking behavior as documented if self._sock isn't set to non-blocking. I'll also try it on an RP2040 chip to see if the behavior differs from port to port.

@peterhinch
Copy link
Owner

I'll be very interested to see your results. I gather some ports use axtls and others mbedtls. The behaviour of non-blocking sockets under TLS is poorly documented. I will document your observations (with attribution).

A factor to bear in mind is that mqtt_as handles initial connection in a different way to reconnection. If the initial connection fails, the exception is thrown to the application. Reconnection is transparent. The reason for this is that failure of initial connection is typically a result of a condition needing human intervention e.g. wrong IP or credentials.

I don't think this has any bearing on your observations but the behaviour has caused confusion in the past.

@peterhinch
Copy link
Owner

This doc may be relevant. I wonder if do_handshake=False changes the handshake mode to 1?

@SteveLTN
Copy link
Author

SteveLTN commented May 7, 2025

In this branch I added some testing code to show the difference. Also a pull-request to show the diff.

Here's the logs when toggling do_handshake:

# do_handshake = True
❯ mpr -m . run mqtt_as/test_ssl_blocking.py
Local directory . is mounted at /remote
Connecting to MQTT broker, Current ticks_ms: 862177
Performing another task. tick_ms:  862180
Performing another task. tick_ms:  865190
Performing another task. tick_ms:  868200 # These prints are printed before entering `ssl.wrap_socket`. Then it blocks for 20+ seconds.
Connect() finished/aborted. Current ticks_ms: 886740 Time taken: 24563 ms
Traceback (most recent call last):
  File "<stdin>", line 42, in <module>
  File "asyncio/core.py", line 1, in run
  File "asyncio/core.py", line 1, in run_until_complete
  File "asyncio/core.py", line 1, in run_until_complete
  File "<stdin>", line 36, in main
  File "mqtt_as/__init__.py", line 800, in connect
  File "mqtt_as/__init__.py", line 315, in _connect
  File "ssl.py", line 1, in wrap_socket
  File "ssl.py", line 1, in wrap_socket
OSError: [Errno 113] ECONNABORTED # This is the error raised by the SSL library

# do_handshake = False
❯ mpr -m . run mqtt_as/test_ssl_blocking.py
Local directory . is mounted at /remote
Connecting to MQTT broker, Current ticks_ms: 924911
Performing another task. tick_ms:  924913
Performing another task. tick_ms:  927920
Performing another task. tick_ms:  930930
Performing another task. tick_ms:  933930
Performing another task. tick_ms:  936930
Performing another task. tick_ms:  939930
Connect() finished/aborted. Current ticks_ms: 941156 Time taken: 16245 ms
Traceback (most recent call last):
  File "<stdin>", line 42, in <module>
  File "asyncio/core.py", line 1, in run
  File "asyncio/core.py", line 1, in run_until_complete
  File "asyncio/core.py", line 1, in run_until_complete
  File "<stdin>", line 36, in main
  File "mqtt_as/__init__.py", line 800, in connect
  File "mqtt_as/__init__.py", line 341, in _connect
  File "mqtt_as/__init__.py", line 276, in _as_write
OSError: (-1, 'Timeout on socket write') # Timeout raised by mqtt_as code

I ran the code on ESP32S3.
I wanted to try on RP2040 as well. But I forgot it doesn't have WiFi capability 😂

@SteveLTN
Copy link
Author

SteveLTN commented May 7, 2025

My theory after reading your linked doc regarding handshake mode, I thought it might be interesting to try to put a fake CA cert and see what happens (to my real MQTT server on AWS IoT Core).

And the result is surprising.
Regardless if I use do_handshake = True or False, the client both connects and subscribes without issues. If my understanding is correct, it means in both cases we are using handshake mode 1?

@peterhinch
Copy link
Owner

Thanks for that interesting result. Setting do_handshake=False evidently fixes the blocking and speeds connection, but I still have some queries.

  • Does it reduce the level of security by changing the handshake mode?
  • Does it defer the handshake until the first transfer?

I'm unclear whether the test clarifies the second point. The MQTT client attempts a socket write immediately after connecting, but that write fails: whether a handshake was attempted is moot.

You might try the test on a public MQTT broker which supports TLS: with do_handshake=False see if any blocking occurs after the initial socket connect when the initial write is performed. There are some public brokers listed in my docs.

@SteveLTN
Copy link
Author

SteveLTN commented May 7, 2025

  • Does it reduce the level of security by changing the handshake mode?
    I fully understand your concern. I'll look into it a bit as well.
  • Does it defer the handshake until the first transfer?

In my particular case, handshake is usually fast enough when the connection quality is good.

My concern is rather when the server fails to respond. When having do_handshake = True, it would always block for 20+ seconds until it fails. My hardware would be unresponsive during the blocking. My test shows in such cases there would be no such blocking.

My application actually uses AWS IoT core. When the connection quality is good, I don't notice any blockage when using do_handshake = False in normal operation. But I'll try to test later to identify any "shorter" blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants