Skip to content

Error: Proposal failed to bind to state #2247

Closed
@lynshi

Description

@lynshi

Describe the bug
In a 3 node CCF network, when submitting a proposal, the following error is returned.

{'error': {'code': 'InternalError', 'message': 'Proposal failed to bind to state.'}}

To debug, we sent proposals to each CCF node individually instead of going through the load balancer. The above error is observed on all the secondary nodes. Meanwhile, the primary returns a response like the following:

{'proposal_id': '56c1acb4247462a9201b254bdcd84958accf1bb22ac942e26d8a545fa9dffa20', 'proposer_id': 0, 'state': 'OPEN'}

The commit was observed to increment on all nodes from 9.22 to 9.24, so it appears that replication from the primary is functional.

Some time later, we observed a change in the primary. Sending the proposal to each node individually again uncovered the same issue; only the (new) primary could "bind state".

To Reproduce
Sadly, this appears to be a transient error and we're not able to reproduce consistently; it just happens occasionally after we start up the network. I'm not able to point to a possible cause yet.

For this case in particular, we were actually able to submit proposals before this, because checking /node/network shows the service status as OPEN. We may have gotten lucky with the load balancing and hit the primary though.

We are submitting the proposals using signature authentication, with the signing occurring in Azure Key Vault.

Expected behavior
The proposal can be submitted on the secondary nodes.

Environment information
CCF version: 0.18.2
Start node config:

consensus = cft
enclave-file = 
enclave-type = release
ledger-chunk-bytes = 104857600
ledger-dir = 
log-format-json = true
node-address = 10.240.0.112:16384
public-rpc-address = 10.240.0.112:16385
read-only-ledger-dir = 
rpc-address = 10.240.0.112:16385
snapshot-dir = 
snapshot-tx-interval = 10000

[start]
gov-script = 
network-cert-file =
member-info = 

Joining node config:

consensus = cft
enclave-file =
enclave-type = release
ledger-chunk-bytes = 104857600
ledger-dir = 
log-format-json = true
node-address = 10.240.0.9:16384
public-rpc-address = 10.240.0.9:16385
read-only-ledger-dir = 
rpc-address = 10.240.0.9:16385
snapshot-dir =
snapshot-tx-interval = 10000

[join]
network-cert-file = 
target-rpc-address = 10.240.0.112:16385

oe_sign.conf:

# Enclave settings:
Debug=0
NumHeapPages=70000
NumStackPages=1024
NumTCS=8
ProductID=1
SecurityVersion=1

Additional context
We are running the network in Kubernetes. Previously, we've gotten opaque errors where the root cause was running out of IP addresses, but that doesn't seem to be the cause here because the CCF Pods all have assigned IP addresses, and we can communicate with each individually.

The logs are not useful; everything ends with something like the following, and looks to be emitted shortly after startup rather than at the time of our debugging:

{"h_ts":"2021-02-26T14:48:05.969839Z","thread_id":"100","level":"fail","file":"../src/host/main.cpp","number":"765","msg":"No snapshot found: Node will request all historical transactions\n"}Azure Quote Provider: libdcap_quoteprov.so [ERROR]: Could not retreive environment variable for 'AZDCAP_DEBUG_LOG_LEVEL'

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions