Skip to content

Using bytes options for TLS rather than path does not appear to work #52

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
leonardb opened this issue Jan 26, 2022 · 10 comments
Open

Comments

@leonardb
Copy link

FDB allows for:

tls_ca_path tls_ca_bytes
tls_key_path tls_key_bytes
tls_cert_path tls_cert_bytes

When erlfdb inits using the path env values I can connect to the cluster and use the DB.
If I use the bytes options erlfdb 'seems' to connect and returns a database when erlfdb:open(Cluster)., but all further operations just hang. A simple transactional set ends up stuck in erlfdb:wait and never returns.

The only error I see in the fdb trace logs is a 1026 network error which is of course helpfully undocumented.

Rationale:
I currently have a branch which does a lazy init of erlfdb_nif and passes in the network options as they are pulled from a key server on init of the parent application. Currently I'm having to store keys to disk and pass in the path variants, but would definitely prefer that keys do no get saved to disk at all.

@kocolosk
Copy link
Member

I don't know why 1026 doesn't show up in the documentation, but the description in the code is just "Network connection failed" so that's not really any help either.

The _bytes variants don't seem to be documented, but glancing through the C client code it does seem like they're still available. What version of FDB are you using? Back in 6.2.x I believe there was a rewrite of the TLS stack.

@leonardb
Copy link
Author

leonardb commented Jan 27, 2022

Hi @kocolosk

This was testing with 6.2.28-1 (and now with 6.3.23 as well)

Noticed that in fact tls_ca_path is currently not supported, only tls_ca_bytes. erlfdb will error when trying to use path.

In erlfdb_nif the following tls ca/key/cert options are types for network_option()

    | tls_cert_bytes
    | tls_cert_path
    | tls_key_bytes
    | tls_key_path
    | tls_ca_bytes
    | tls_ca_path

But, tls_ca_path does not exist in erlfdb_network_set_option in main.c, while all the others are.
^^ This is probably an oversight. The following patch adds it to network options handling

index 8cb03b0..fc27e00 100644
--- a/c_src/main.c
+++ b/c_src/main.c
@@ -488,6 +488,8 @@ erlfdb_network_set_option(ErlNifEnv* env, int argc, const ERL_NIF_TERM argv[])
         option = FDB_NET_OPTION_CLIENT_BUGGIFY_SECTION_FIRED_PROBABILITY;
     } else if(IS_ATOM(argv[0], tls_ca_bytes)) {
         option = FDB_NET_OPTION_TLS_CA_BYTES;
+    } else if(IS_ATOM(argv[0], tls_ca_path)) {
+        option = FDB_NET_OPTION_TLS_CA_PATH;
     } else if(IS_ATOM(argv[0], tls_password)) {
         option = FDB_NET_OPTION_TLS_PASSWORD;

I've also had a look at a few of the 'blessed' language bindings at https://github.com/apple/foundationdb/tree/main/bindings and they all seem to support the bytes options. Also seems they're all there in the native api code. https://github.com/apple/foundationdb/blob/1800259d7cb35be79541b848ddb83fd62d1ddd15/fdbclient/NativeAPI.actor.cpp#L2031-L2063

@nickva
Copy link
Contributor

nickva commented Jan 27, 2022

Good catch @leonardb. I think we just need to add the option. It looks like we validate the network options so it's surprising it didn't blow up at https://github.com/apache/couchdb-erlfdb/blob/main/c_src/main.c#L514 from return enif_make_badarg(env) if you try to set that option.

https://github.com/apache/couchdb-erlfdb/blob/main/src/erlfdb_nif.erl#L105-L106

@kocolosk is right that we may have to wrangle the correct API versions here if it's not documented or was available only earlier/later we could resort to ifdefs possibly.

@kocolosk
Copy link
Member

BTW did not mean to close the issue, sorry.

@kocolosk kocolosk reopened this Jan 27, 2022
@leonardb
Copy link
Author

I'm really just worried about the tls_cert_bytes and tls_key_bytes options not working at all.

They're accepted when setting the network options (no errors) but obviously 'something' is not working as erlfdb is not connecting properly to the cluster. Maybe it's a bug in the C lib itself, but that's beyond my capabilities to track down.

Enabling trace on the connection gives very limited info. All I can see is a connection failure.

<?xml version="1.0"?>
<Trace>
<Event Severity="10" Time="1643300975.803927" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TRACE_ENABLE" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.804052" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TLS_CA_BYTES" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.804202" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TLS_KEY_BYTES" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.804275" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="TLS_CERT_BYTES" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.805842" DateTime="2022-01-27T16:29:35Z" Type="Net2Starting" ID="0000000000000000" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.806136" DateTime="2022-01-27T16:29:35Z" Type="SetNetworkOption" ID="0000000000000000" Option="SUPPORTED_CLIENT_VERSIONS" ThreadID="579746076798814255" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.805868" DateTime="2022-01-27T16:29:35Z" Type="Net2Running" ID="0000000000000000" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.805868" DateTime="2022-01-27T16:29:35Z" Type="ProcessTimeOffset" ID="0000000000000000" ProcessTime="1643300975.807528" SystemTime="1643300975.807528" OffsetFromSystemTime="-0.000001" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="ClientStart" ID="0000000000000000" SourceVersion="47b9a81d1c10897c863098fe2d66e827fed0d239" Version="6.3.23" PackageName="6.3" ClusterFile="/opt/mfdb/test/fdb.local.cluster" ConnectionString="foo:[email protected]:4500:tls" ActualTime="1643300975" ApiVersion="630" ImageOffset="0x7ff1b2c6b000" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" TrackLatestType="Original" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="MachineLoadDetail" ID="0000000000000000" User="31710408" Nice="169636" System="8478089" Idle="475535396" IOWait="3923774" IRQ="0" SoftIRQ="1135099" Steal="0" Guest="0" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="Net2TLSConfig" ID="0000000000000000" CAPath="" CertificatePath="/etc/foundationdb/fdb.pem" KeyPath="/etc/foundationdb/fdb.pem" HasPassword="0" VerifyPeers="Check.Valid=1" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.808202" DateTime="2022-01-27T16:29:35Z" Type="NotifyAddressHealthy" ID="0000000000000000" SuppressedEventCount="0" Address="10.0.0.205:4500:tls" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.812328" DateTime="2022-01-27T16:29:35Z" Type="CodeCoverage" ID="0000000000000000" File="/root/src/oOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOoOo/foundationdb/fdbclient/ReadYourWrites.actor.cpp" Line="1357" Condition="true" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300975.812701" DateTime="2022-01-27T16:29:35Z" Type="ConnectingTo" ID="0000000000000000" SuppressedEventCount="0" PeerAddr="10.0.0.205:4500:tls" PeerReferences="2" FailureStatus="OK" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="20" Time="1643300977.812787" DateTime="2022-01-27T16:29:37Z" Type="N2_ConnectHandshakeError" ID="aac42de923f58eed" SuppressedEventCount="0" ErrorCode="104" Message="Connection reset by peer" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300977.812787" DateTime="2022-01-27T16:29:37Z" Type="ConnectionTimedOut" ID="aac42de923f58eed" SuppressedEventCount="0" PeerAddr="10.0.0.205:4500:tls" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />
<Event Severity="10" Time="1643300977.812787" DateTime="2022-01-27T16:29:37Z" Type="ConnectionClosed" ID="aac42de923f58eed" Error="connection_failed" ErrorDescription="Network connection failed" ErrorCode="1026" SuppressedEventCount="0" PeerAddr="10.0.0.205:4500:tls" ThreadID="18086592241478077095" Machine="10.0.0.205:16186" LogGroup="default" />

@nickva
Copy link
Contributor

nickva commented Jan 27, 2022

I see the logs are referring to paths Type="Net2TLSConfig" ID="0000000000000000" CAPath="" CertificatePath="/etc/foundationdb/fdb.pem" KeyPath="/etc/foundationdb/fdb.pem" .../> did it find both or are those just default values?

To separate the values being mismanaged in the erlfdb erlang/c code vs fdb client not handling it properly see if you can pass the exact values via the environment variables https://apple.github.io/foundationdb/tls.html

@leonardb
Copy link
Author

leonardb commented Jan 27, 2022

This did not help so I decided to just nuke my entire local DB and start from scratch.

After setting up a fresh install of 6.3.23 with new certs it appears that passing the _bytes variants works fine.

So, not a bug and apologies for the distraction.

I do still have an open question about the ability to defer loading and initialization of the NIF though.

Currently I have a branch which:

  • checks for application:env(erlfdb, init, auto) and if it's anything but manual if continues the init as usual.
  • exports a init_manual/1 function from erlfdb_nif which allows me to manually init with all the cert details

Would this be considered as a contribution? It does not change or alter default behavior but it does give consumers the ability to control initialization and network settings outside of settings in sys.config

@kocolosk
Copy link
Member

Fascinating. I was just going to comment that I cannot seem to get the _bytes variants to work in either Erlang or Python, so now I wonder what I was doing wrong!

@leonardb
Copy link
Author

leonardb commented Jan 27, 2022

"I wish I knew....." what I was doing wrong previously. All I can add is that when I added {trace_enable, <<"/tmp/">>} to network options to have a separate trace file for the client connection I was seeing 'tls' errors and the KeyPath="/etc/foundationdb/fdb.pem" @nickva pointed out seemed very off since I don't even have that file on my machine

That sent me down a rabbit-hole of checking with different settings and keys and finally re-installing/initializing with newly generated keys, at which point I discovered you now have to manually add all the key settings to the backup_agent command in foundationdb.conf to stop that throwing a ton of errors.

Once I had a 6.3.23 fresh cluster up and could connect, erlfdb just started working with all the _bytes options.

I honestly wish I knew why though
Also added {api_version, 630} to the erlfdb env in my config

@kocolosk
Copy link
Member

OK I figured out the issue on my side, just some whitespace issues with the way I was producing the byte strings for the key and cert to pass to the application environment.

Agreed that tls_ca_path was just an oversight, clearly that should exist.

Regarding the loading and initialization behavior for the NIF ... I'm wondering if we ought to just always defer the part that sets up the network thread until the first call to erlfdb:open/1. That seems to be more in line with e.g. the behavior of the Python bindings. Might be a little work around the error handling for the sentinel that ensures we only attempt to set up the network thread once, but not a big deal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants