Skip to content

Fix resume issues [JENKINS-49686] and [JENKINS-50199] and [JENKINS-50407] #93

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 8, 2018

Conversation

svanoort
Copy link
Member

@svanoort svanoort commented Mar 28, 2018

Aims to resolve (in conjunction with workflow-cps changes as well that help harden the logic):

Includes lazy-load for FlowExecutions (enhancement): JENKINS-45585 -- which if merged will replace 49086 because it will defer a ton of the load-time logic for completed Pipelines.

TODOs:

  • Cross-test with workflow-cps durability tests
  • Fix issue with unitialized listener field for unable-to-resume flows (may mandate changes in wf-cps too)
  • Hardening: consider forcing atomic write of XML file when re-saving with the completed field persisted -- even when normally non-atomic (guarantees we won't mangle original build record if an error occurs upon save).

@abayer abayer requested review from abayer and jglick March 30, 2018 15:03
Copy link
Member

@abayer abayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to come back and do a second pass on the logic around completed, but feels good so far.

@@ -171,6 +171,9 @@ public String url() {

private transient boolean allowKill;

/** Controls whether or not our execution has been initialized via its {@link FlowExecution#onLoad(FlowExecutionOwner)} method yet if.*/
transient boolean executionLoaded = false; // NonPrivate for tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wonder if that might be something we want to expose outside of just tests - I don't have a specific scenario in mind, but I know I've stumbled across cases where this could be useful information. I think. I may be getting confused with something else, though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can always expose it with a getter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…if and when there is a downstream PR demonstrating the need yada yada.


/** map from node IDs to log positions from which we should copy text */
private Map<String,Long> logsToCopy;
Map<String,Long> logsToCopy;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth a // Non-private for testing comment, but not strictly necessary since that's basically the only reason anything we do ever is package-private.

return logCopyGuard;
}

/** Avoids creating new instances, analogous to {@link TaskListener} but as full StreamBuildListener. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you meant TaskListener.NULL here?

@svanoort svanoort changed the title [WIP] Fix resume issues [JENKINS-49686] and [JENKINS-50199] and [JENKINS-50407] [WIP] Fix resume issues [JENKINS-46986] and [JENKINS-50199] and [JENKINS-50407] Apr 3, 2018
@svanoort svanoort changed the title [WIP] Fix resume issues [JENKINS-46986] and [JENKINS-50199] and [JENKINS-50407] [WIP] Fix resume issues [JENKINS-49686] and [JENKINS-50199] and [JENKINS-50407] Apr 3, 2018
@jglick jglick changed the title [WIP] Fix resume issues [JENKINS-49686] and [JENKINS-50199] and [JENKINS-50407] Fix resume issues [JENKINS-49686] and [JENKINS-50199] and [JENKINS-50407] Apr 3, 2018
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serial format of completed should be changed.

As to the implementation, sounds OK, though the changes are too extensive for me to really review seriously (in part due to reindentation perhaps?), so I hope tests will catch any mistakes.

@@ -68,6 +68,7 @@
<workflow-support-plugin.version>2.17</workflow-support-plugin.version>
<scm-api-plugin.version>2.1.1</scm-api-plugin.version>
<git-plugin.version>3.2.0</git-plugin.version>
<jenkins-test-harness.version>2.33</jenkins-test-harness.version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just update parent POM to 3.4 or newer.

@@ -171,6 +171,9 @@ public String url() {

private transient boolean allowKill;

/** Controls whether or not our execution has been initialized via its {@link FlowExecution#onLoad(FlowExecutionOwner)} method yet if.*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray if?

@@ -171,6 +171,9 @@ public String url() {

private transient boolean allowKill;

/** Controls whether or not our execution has been initialized via its {@link FlowExecution#onLoad(FlowExecutionOwner)} method yet if.*/
transient boolean executionLoaded = false; // NonPrivate for tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

…if and when there is a downstream PR demonstrating the need yada yada.

*/
private transient AtomicBoolean completed;
AtomicBoolean completed; // Non-private for testing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you really do not want this to be a serialized field of the build.xml:

  <completed>
    <value>1</value>
  </completed>

Ugh! Use boolean completed (or Boolean if null is a meaningful state, or an enum) and if you need to synchronize access to it against something, I guess use logCopyGuard.

private transient AtomicBoolean completed;
AtomicBoolean completed; // Non-private for testing

private transient Object logCopyGuard = new Object();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given getLogCopyGuard, could drop the field initializer.

@@ -506,9 +535,9 @@ private void copyLogs() {

String prefix = getLogPrefix(node);
if (prefix != null) {
logger = new LogLinePrefixOutputFilter(listener.getLogger(), "[" + prefix + "] ");
logger = new LogLinePrefixOutputFilter(getListener().getLogger(), "[" + prefix + "] ");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tons of merge conflicts with #27 coming up, sigh. Was the introduction of getListener() necessary to solve one of these bugs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I didn't want to introduce any more conflicts than needed but unfortunately this was the cleanest way to guard against NPEs with the listener that were seen in tests. FWIW I'll be happy to deal with the conflict resolution as part of the log rewrite integration since I'm introducing most of the conflicts -- seems only fair.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, person creating branch resolves merge conflicts. I will deal with it soon.

Timer.get().submit(() -> {
LOGGER.log(Level.INFO, "{0} completed: {1}", new Object[]{toString(), getResult()});
if (listener == null) {
// Can we actually rely on this, now that we use an initializing getter?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like that, more of a "I wonder what @jglick will say in review"

FlowExecutionListener.fireResumed(fetchedExecution);

getListener().getLogger().println("Resuming build at " + new Date() + " after Jenkins restart");
Timer.get().submit(new Runnable() { // JENKINS-31614
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as you are shuffling code around, might as well:

Timer.get().submit(() -> Queue.getInstance().schedule(new AfterRestartTask(this), 0)); // JENKINS-31614

});
}
}
} finally { // Ensure the run is ALWAYS removed from loading even if something failed, so threads awaken.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/me runs away in fear

@Test public void lazyLoadExecution() {
story.thenWithHardShutdown(r -> {
WorkflowJob p = r.jenkins.createProject(WorkflowJob.class, "p");
p.setDefinition(new CpsFlowDefinition("echo 'dosomething'", true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to set the durability mode too? Best to not rely on defaults if it is significant here.

@svanoort
Copy link
Member Author

svanoort commented Apr 3, 2018

@svanoort svanoort requested a review from jglick April 6, 2018 13:22
/** null until started, or after serious failures or hard kill */
private @CheckForNull FlowExecution execution;
/** Null until started, or after serious failures or hard kill. */
@CheckForNull FlowExecution execution; // Not private for test use only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then I guess should be @Restricted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, it's package-private, so it's already more restricted than if I used @Restricted on it

static final StreamBuildListener NULL_LISTENER = new StreamBuildListener(new NullStream());

/** Used internally to ensure listener has been initialized correctly. */
StreamBuildListener getListener() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synchronized?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I believe there was a specific reason why it wasn't based on testing but... darned if I can remember the issue/exact chain of reasoning since it was rather complex.

Give me a few, I'll either synchronize or provide a detailed comment to explain the rationale why not to.

LOGGER.log(Level.WARNING, "failed to save " + this, x);
}
Timer.get().submit(() -> {
LOGGER.log(Level.INFO, "{0} completed: {1}", new Object[]{toString(), getResult()});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Level.INFO really wise here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's what it was at before -- it should only run once per build (or at most twice in rare circumstances if something fails catastrophically right in the middle of the finalization).

}
Timer.get().submit(() -> {
LOGGER.log(Level.INFO, "{0} completed: {1}", new Object[]{toString(), getResult()});
if (listener == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about getListener()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment on synchronizing getListener() -- give me a bit and I'll either have it changed or better documented in code comments.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a shallow inspection looks OK.

FlowExecution fetchedExecution = getExecution(); // Triggers execution.onLoad so we can resume running if not done

if (fetchedExecution != null) {
fetchedExecution.addListener(new GraphL());
executionPromise.set(fetchedExecution);

if (completed == null) {
completed = new AtomicBoolean(fetchedExecution.isComplete());
completed = Boolean.valueOf(fetchedExecution.isComplete());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: can just drop the valueOf call as this is done by autoboxing anyway.

@svanoort
Copy link
Member Author

svanoort commented Apr 8, 2018

Merging now that I've done one more round of integration testing locally using the related workflow-cps PR and an updated snapshot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants