-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: read and write leader time out of lock #1192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…ut when appendentries took a long time
Hi @fedtye, welcome to SOFAStack community, Please sign Contributor License Agreement! After you signed CLA, we will automatically sync the status of this pull request in 3 minutes. |
WalkthroughThe changes reorganize the control flow in the Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant NodeImpl
participant Lock
Client->>NodeImpl: handleAppendEntriesRequest(request)
alt Node inactive or invalid server ID or stale term
NodeImpl-->>Client: Early error response (no lock)
else
NodeImpl->>Lock: Acquire write lock
NodeImpl->>NodeImpl: Check term/state, step down if needed
NodeImpl->>NodeImpl: Verify leader identity, handle snapshot/log
NodeImpl->>Lock: Release write lock
NodeImpl-->>Client: Response
end
Poem
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ast-grep (0.38.1)jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (2)
jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java (2)
177-177
: 🛠️ Refactor suggestionUse AtomicLong for thread-safe timestamp operations
The
volatile
keyword alone is insufficient for the concurrent read/write pattern introduced by this PR. While it ensures visibility, it doesn't prevent race conditions whenhandleElectionTimeout
reads the timestamp whilehandleAppendEntriesRequest
writes to it.-private volatile long lastLeaderTimestamp; +private final AtomicLong lastLeaderTimestamp = new AtomicLong();And update the accessor methods:
private void updateLastLeaderTimestamp(final long lastLeaderTimestamp) { - this.lastLeaderTimestamp = lastLeaderTimestamp; + this.lastLeaderTimestamp.set(lastLeaderTimestamp); }Also update reads in
isCurrentLeaderValid()
andcheckLeaderLease()
:private boolean isCurrentLeaderValid() { - return Utils.monotonicMs() - this.lastLeaderTimestamp < this.options.getElectionTimeoutMs(); + return Utils.monotonicMs() - this.lastLeaderTimestamp.get() < this.options.getElectionTimeoutMs(); }Also applies to: 1784-1786
552-552
: 💡 Verification agent🧩 Analysis chain
Ensure consistent usage of atomic timestamp operations
After converting
lastLeaderTimestamp
toAtomicLong
, ensure all usages are updated:Also applies to: 1316-1316, 1776-1778, 2293-2293
🏁 Script executed:
#!/bin/bash # Find all references to lastLeaderTimestamp to ensure complete migration rg -n "lastLeaderTimestamp" --type javaLength of output: 822
Migrate lastLeaderTimestamp to AtomicLong
It looks like
lastLeaderTimestamp
is still declared as avolatile long
and is being accessed/updated directly. To complete the migration toAtomicLong
, update the following in NodeImpl.java:• Declaration (around line 177):
// before private volatile long lastLeaderTimestamp; // after private final AtomicLong lastLeaderTimestamp = new AtomicLong();• Constructor (where you call updateLastLeaderTimestamp):
- updateLastLeaderTimestamp(Utils.monotonicMs()); + this.lastLeaderTimestamp.set(Utils.monotonicMs());• updateLastLeaderTimestamp method (lines 1784–1785):
- private void updateLastLeaderTimestamp(final long lastLeaderTimestamp) { - this.lastLeaderTimestamp = lastLeaderTimestamp; - } + private void updateLastLeaderTimestamp(final long ts) { + this.lastLeaderTimestamp.set(ts); + }• All direct reads (lines 1777 and 1781):
- return monotonicNowMs - this.lastLeaderTimestamp < ... + return monotonicNowMs - this.lastLeaderTimestamp.get() < ...Ensuring these changes will guarantee atomic, thread-safe timestamp updates and reads.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java
(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java (3)
jraft-core/src/main/java/com/alipay/sofa/jraft/util/RpcFactoryHelper.java (1)
RpcFactoryHelper
(25-37)jraft-core/src/main/java/com/alipay/sofa/jraft/rpc/RpcRequests.java (1)
AppendEntriesResponse
(10068-10915)jraft-core/src/main/java/com/alipay/sofa/jraft/util/Utils.java (1)
Utils
(59-527)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: sca
🔇 Additional comments (1)
jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java (1)
620-625
:⚠️ Potential issueCritical: Thread safety violation - reading shared state without synchronization
Reading
this.state
and callingisCurrentLeaderValid()
(which readslastLeaderTimestamp
) without holding any lock creates race conditions. The state could change between the check and when the write lock is acquired, andlastLeaderTimestamp
could be read while being written by another thread.Consider using a read lock for these checks:
private void handleElectionTimeout() { - if (this.state != State.STATE_FOLLOWER) { - return; - } - if (isCurrentLeaderValid()) { - return; - } + this.readLock.lock(); + try { + if (this.state != State.STATE_FOLLOWER) { + return; + } + if (isCurrentLeaderValid()) { + return; + } + } finally { + this.readLock.unlock(); + } boolean doUnlock = true; this.writeLock.lock();Likely an incorrect or invalid review comment.
if (!this.state.isActive()) { | ||
LOG.warn("Node {} is not in active state, currTerm={}.", getNodeId(), this.currTerm); | ||
return RpcFactoryHelper // | ||
.responseFactory() // | ||
.newResponse(AppendEntriesResponse.getDefaultInstance(), RaftError.EINVAL, | ||
"Node %s is not in active state, state %s.", getNodeId(), this.state.name()); | ||
} | ||
|
||
final PeerId serverId = new PeerId(); | ||
if (!serverId.parse(request.getServerId())) { | ||
LOG.warn("Node {} received AppendEntriesRequest from {} serverId bad format.", getNodeId(), | ||
request.getServerId()); | ||
return RpcFactoryHelper // | ||
.responseFactory() // | ||
.newResponse(AppendEntriesResponse.getDefaultInstance(), RaftError.EINVAL, | ||
"Parse serverId failed: %s.", request.getServerId()); | ||
} | ||
|
||
// Check stale term | ||
if (request.getTerm() < this.currTerm) { | ||
LOG.warn("Node {} ignore stale AppendEntriesRequest from {}, term={}, currTerm={}.", getNodeId(), | ||
request.getServerId(), request.getTerm(), this.currTerm); | ||
return AppendEntriesResponse.newBuilder() // | ||
.setSuccess(false) // | ||
.setTerm(this.currTerm) // | ||
.build(); | ||
} | ||
updateLastLeaderTimestamp(Utils.monotonicMs()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Multiple thread safety violations in lock-free validation
Moving these checks outside the write lock introduces several race conditions:
- Reading
this.state
without synchronization (line 1938) - Reading
this.currTerm
without synchronization (line 1957) - Writing to
lastLeaderTimestamp
without synchronization (line 1965)
These fields are modified under write locks elsewhere, creating data races. The state and term could change between validation and lock acquisition, leading to incorrect behavior.
To fix this while still addressing the original timeout issue, consider one of these approaches:
Option 1: Use atomic operations for timestamp
// Declare as AtomicLong
private final AtomicLong lastLeaderTimestamp = new AtomicLong();
// In handleAppendEntriesRequest (before lock)
lastLeaderTimestamp.set(Utils.monotonicMs());
// In isCurrentLeaderValid
return Utils.monotonicMs() - lastLeaderTimestamp.get() < this.options.getElectionTimeoutMs();
Option 2: Use read lock for validation
+ this.readLock.lock();
+ try {
if (!this.state.isActive()) {
LOG.warn("Node {} is not in active state, currTerm={}.", getNodeId(), this.currTerm);
return RpcFactoryHelper //
.responseFactory() //
.newResponse(AppendEntriesResponse.getDefaultInstance(), RaftError.EINVAL,
"Node %s is not in active state, state %s.", getNodeId(), this.state.name());
}
+ } finally {
+ this.readLock.unlock();
+ }
final PeerId serverId = new PeerId();
if (!serverId.parse(request.getServerId())) {
// ... parsing validation (safe without lock) ...
}
+ this.readLock.lock();
+ try {
// Check stale term
if (request.getTerm() < this.currTerm) {
// ... stale term handling ...
}
+ } finally {
+ this.readLock.unlock();
+ }
updateLastLeaderTimestamp(Utils.monotonicMs());
🤖 Prompt for AI Agents
In jraft-core/src/main/java/com/alipay/sofa/jraft/core/NodeImpl.java around
lines 1938 to 1965, the code reads and writes shared fields this.state,
this.currTerm, and lastLeaderTimestamp without synchronization, causing thread
safety issues and race conditions. To fix this, either convert
lastLeaderTimestamp to an AtomicLong and update it atomically outside the write
lock, or perform all validation reads under a read lock to ensure consistent and
thread-safe access to these fields. Choose one approach and apply it
consistently to prevent data races while maintaining the intended timeout
behavior.
Motivation:
When appendEntry takes time more than electiontimeout, election-timer would block in writer lock and then dectect leader timeout.
Modification:
It's no matter to write and read leader timestamp out of lock
Result:
Summary by CodeRabbit