Skip to content

[JENKINS-43194] Use merge_commit_sha for master #209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 27, 2019

Conversation

bitwiseman
Copy link
Contributor

Work-in-progress.

@jglick jglick requested review from stephenc and jglick March 1, 2019 16:30
@jglick
Copy link
Member

jglick commented Mar 1, 2019

FTR: goes back to an approach abandoned in #60.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what you are missing is the basic promise of PullRequestSCMRevision: that it includes the minimal state needed to represent a particular, deterministic snapshot of the code to be checked out. For a merge head, that would include two commit SHAs: one taken from the PR at the time of indexing or event (not necessarily the current head of the PR!); and one taken from the base branch—again not necessarily the current head, and possibly not even a recent one (see discussion in JENKINS-37491).

The plugin deliberately implements its own merging (in JGit) of the exact two selected revisions, which in the general case requires a clone and local files. The GitHub API only offers some semi-documented convenience APIs (IIRC dropped from v4 and at best tentative in v3) to obtain an otherwise unadvertised merge of some recent revisions of the PR branch with the base branch.

If they had a first-class API whereby you could call methods like getFileContent and pass two SHAs, with the service performing a transient merge or the logical equivalent, then that would be perfect. But they do not. My suggestion in JIRA (not prototyped!) is to use something like the /repos/<owner>/<repo>/compare/<branch>...<hash> API to find the common ancestor of the base and pull commits; then check whether there are three distinct file contents in those three commits. If so, abort; otherwise you can pick one as the effective merge contents. This would preserve the existing behavior (to a reasonable degree of confidence), minus the heavyweight checkout.

@@ -981,19 +982,22 @@ public GHPermissionType fetch(String username) throws IOException, InterruptedEx
public SCMRevision create(@NonNull PullRequestSCMHead head,
@Nullable Void ignored)
throws IOException, InterruptedException {
String baseHash = pr.getBase().getSha();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessarily a meaningful value, as noted here. It is in particular not necessarily the current head commit of the base branch (what is being calculated in the deleted code).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a change from the existing behavior - if you read the code, all I did was move the assignments/return statements around. If you like, I can put them back so they don't show up in the diff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all I did was move the assignments/return statements around

Unclear to me. The original code called GhRepository.getRef on the base branch, which would retrieve its current head. The new code is extracting .base.sha from the PR API response object, which can be a completely different value. (IIRC it is typically, but not always, the last common ancestor between the two branches.)

I can put them back so they don't show up in the diff.

If some part of a patch can be reverted so that the real changes can be seen to have more localized effects than of course this is preferable.

mergeRef.getObject().getSha(),
pr.getHead().getSha());
baseHash = ghRepository.getRef("heads/" + head.getTarget().getName()).getObject().getSha();
mergeHash = pr.getMergeCommitSha();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is going to produce a synthetic commit which may or may not have as parents the baseHash and the .head.sha. Depends on exactly when you call it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm not sure what you're pointing out here or why.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That the new behavior will be nondeterministic, and the PullRequestSCMRevision may wind up with three fields that are not in fact consistent.

assert !pr.isMerge(); // TODO see below
this.ref = ((PullRequestSCMRevision) rev).getPullHash();
if (pr.isMerge()) {
this.ref = ((PullRequestSCMRevision) rev).getMergeHash();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will in general produce different results from what GitHubSCMBuilder.build does when you actually checkout scm.

Copy link
Contributor Author

@bitwiseman bitwiseman Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it? Git merge is in one place isn't git merge in another?
Could you explain or point me to documentation that will help me understand your point?

The merge_commit_sha is guaranteed to be be a valid automatic merge.
How significant can the differences be?

It will be a different commit sha, but that is already true for each agent.
I understand that what you are saying is factually true, but to my knowledge the difference is theoretical rather than practical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git merge is in one place isn't git merge in another?

The resulting trees would be identical if

  • both merges started with the same parent commits
  • both merges used the same algorithm and other options

The second point is perhaps only theoretical. The first point seems to be a real possibility.

How significant can the differences be?

Who knows? Its behavior is undocumented.

At a minimum, you would have to replace the existing behavior of GitHubSCMBuilder.build to ensure that it was using the same merge commit rather than doing its own local merge.

*
* @return The commit hash of the head of the pull request branch
*/
@CheckForNull
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be null if you load an old serialized object, and yet the one caller is not checking for null.

Copy link
Contributor Author

@bitwiseman bitwiseman Mar 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is once again, POC. The scenario for the caller is guaranteed to never have null.
In final version, this will need to be safely migrated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was just mentioning a potential problem while I was here, in case you had not noticed it yet.

@bitwiseman
Copy link
Contributor Author

bitwiseman commented Mar 1, 2019

@jglick

FTR: goes back to an approach abandoned in #60.

I don't think it does. At minimum, this actually uses the GHPullRequest.getMergeCommitSha() whereas that does not.

I think what you are missing is the basic promise of PullRequestSCMRevision: that it includes the minimal state needed to represent a particular, deterministic snapshot of the code to be checked out. For a merge head, that would include two commit SHAs: one taken from the PR at the time of indexing or event (not necessarily the current head of the PR!); and one taken from the base branch—again not necessarily the current head, and possibly not even a recent one (see discussion in JENKINS-37491).

I'm not getting your point. Once the PullRequestSCMRevision is created, it is just as deterministic as it was before this change. When it is created, the Base and Pull hashes (and if applicable the Merge Hash) are set and aren't changed. I'm not getting where the problem is.

The plugin deliberately implements its own merging (in JGit) of the exact two selected revisions, which in the general case requires a clone and local files. The GitHub API only offers some semi-documented convenience APIs (IIRC dropped from v4 and at best tentative in v3) to obtain an otherwise unadvertised merge of some recent revisions of the PR branch with the base branch.

I don't know what APIs you're referring to, but I'm not sure what that have to do with this change.
The API's used here are available in v3 and v4 and are well documented.

My suggestion in JIRA (not prototyped!) is to use something like the /repos/<owner>/<repo>/compare/<branch>...<hash> API to find the common ancestor of the base and pull commits; then check whether there are three distinct file contents in those three commits. If so, abort; otherwise you can pick one as the effective merge contents. This would preserve the existing behavior (to a reasonable degree of confidence), minus the heavyweight checkout.

That could work, too. We might need to use it for the bitbucket plugin. But in this plugin it would be a much larger change than querying and using merge_commit_sha.

I'm trying to make sense of your points here. I get that you're saying there are some cases where this change might not work in a provably correct and deterministic way, but I can't make sense when you think those cases will occur. You're not saying in which customer scenario those failures would occur. How would a customer produce/reproduce the problem(s) you're describing? What is the severity of the problem(s)?

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would a customer produce/reproduce the problem(s) you're describing?

I do not have the time to work out test cases for this I am afraid. You would have to consider various scenarios involving commits appearing on the base and/or PR branches, including merge commits of base into PR branch appearing on the PR branch, at different times interleaved with the scan or event hook vs. loading of Jenkinsfile contents vs. checkout scm, and introducing artificial delays between API calls to discover race conditions.

What is the severity of the problem(s)?

Again TBD. My broader point is that it is simpler to reason about the correctness of the current behavior, which relies only on straightforward calls to reasonably well documented APIs (that also would be likely to apply to other Git hosting as you mentioned): it picks up the two branch head references at the time of event/scan, saves them, and when requested merges them exactly as CLI git merge would.

@@ -981,19 +982,22 @@ public GHPermissionType fetch(String username) throws IOException, InterruptedEx
public SCMRevision create(@NonNull PullRequestSCMHead head,
@Nullable Void ignored)
throws IOException, InterruptedException {
String baseHash = pr.getBase().getSha();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all I did was move the assignments/return statements around

Unclear to me. The original code called GhRepository.getRef on the base branch, which would retrieve its current head. The new code is extracting .base.sha from the PR API response object, which can be a completely different value. (IIRC it is typically, but not always, the last common ancestor between the two branches.)

I can put them back so they don't show up in the diff.

If some part of a patch can be reverted so that the real changes can be seen to have more localized effects than of course this is preferable.

mergeRef.getObject().getSha(),
pr.getHead().getSha());
baseHash = ghRepository.getRef("heads/" + head.getTarget().getName()).getObject().getSha();
mergeHash = pr.getMergeCommitSha();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That the new behavior will be nondeterministic, and the PullRequestSCMRevision may wind up with three fields that are not in fact consistent.

assert !pr.isMerge(); // TODO see below
this.ref = ((PullRequestSCMRevision) rev).getPullHash();
if (pr.isMerge()) {
this.ref = ((PullRequestSCMRevision) rev).getMergeHash();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Git merge is in one place isn't git merge in another?

The resulting trees would be identical if

  • both merges started with the same parent commits
  • both merges used the same algorithm and other options

The second point is perhaps only theoretical. The first point seems to be a real possibility.

How significant can the differences be?

Who knows? Its behavior is undocumented.

At a minimum, you would have to replace the existing behavior of GitHubSCMBuilder.build to ensure that it was using the same merge commit rather than doing its own local merge.

*
* @return The commit hash of the head of the pull request branch
*/
@CheckForNull
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I was just mentioning a potential problem while I was here, in case you had not noticed it yet.

@jglick
Copy link
Member

jglick commented Mar 6, 2019

Another risk of merge_commit_sha is that it is undocumented what its lifecycle is. While the documentation does discuss what new values of this field would be after getting the PR metadata after it is merged, there is no discussion of what happens to a previously obtained merge commit beyond

This test commit is not added to the base branch or the head branch.

It will not be an ancestor of any visible ref, so by typical Git rules it would subject to garbage collection. Does the highly specialized backend used by GitHub in fact collect it? Hard to say. The value of the “test merge commit” (that returned for a currently open PR) is only discussed in the context of checking mergeability and the mergeable attribute. Is such a commit still valid after new changes are pushed to the PR and/or base branches? After the PR is merged? After the PR branch is deleted?

These questions are relevant for Jenkins because it is expected that a (deterministic) SCMRevision is valid indefinitely; for example the Replay action will run a fresh build which uses it (for both Jenkinsfile via SCMFileSystem and checkout scm via SCMBuilder.build). The current implementation of PullRequestSCMRevision based on a pair of pullHash and baseHash is safe because (at least assuming users refrain from destructive history operations) both of those commits are visible in the repository’s permanent record, and you can freely remerge them whenever you need to with identical results. It is hard to make a similar promise for merge_commit_sha.


As an interesting aside: Prow, used by JX “serverless”, bypasses all these issues because regular PR builds are only ever done on the PR head. Whereas Jenkins in github-branch-source’s merge mode tries to keep the head commit status up to date in response to base branch changes, under the assumption that some human will be clicking the Merge button and relying on the green check mark to know when it is safe to do so, Prow (in its “Tide” component) takes a different approach: the commit status is not trusted except to determine whether the PR should be added to a “merge pool”, and the physical merge when approved is performed by Tide after rerunning a final validation build on the proposed merge result (possibly including multiple PRs in the same pool).

We could try to emulate this logic, perhaps: trusted users could use a label or ChatOps to request that Jenkins perform a merge when it can. (Guiding users to use Prow/Tide in conjunction with Jenkins is also an option, though it requires a Kubernetes cluster and a bunch of configuration that would need to be synchronized with Jenkins’ own somehow.) People do frequently request such a feature. From a security perspective, full operation of github-branch-source already requires granting the Jenkins access token write access to the repository (both to create commit statuses, and to verify which PR authors are trusted), and I am not sure whether GitHub apps are capable of finer discrimination.

@bitwiseman
Copy link
Contributor Author

How would a customer produce/reproduce the problem(s) you're describing?

I do not have the time to work out test cases for this I am afraid.

Well, the scenario matters. If the non-determinism you're talking about is only encountered under rare conditions and has minor consequences, then I would say this change is still valid. My analysis indicates that is exactly the case here.

Assumption: Most (if not all) customers run PRs with the merge strategy because they want to know if their latest change is mergeable.

Scenario 1: When changes appear on the base or PR branch after the build is triggered but before the PR merge commit is queried. In that case, the build could pick up a later commit. However, all that results in two builds of the same newer commit, which in the PR scenario is a non-issue. Customers run PRs with the merge strategy because they want to know if their latest change is mergeable.

(Scenario 2 below)

Another risk of merge_commit_sha is that it is undocumented what its lifecycle is. While the documentation does discuss what new values of this field would be after getting the PR metadata after it is merged, there is no discussion of what happens to a previously obtained merge commit beyond

This test commit is not added to the base branch or the head branch.

It will not be an ancestor of any visible ref, so by typical Git rules it would subject to garbage collection. Does the highly specialized backend used by GitHub in fact collect it? Hard to say.

No, I tested this and the commit remains available via the api for at least several days - I haven't seen one become unavailable yet. But this behavior is undocumented, let's assume they are garbage collected immediately, what is the result? A build might fail due to asking for file from a merge commit that doesn't exist. But that isn't particularly bad - the build failed because the base or PR have changed so the user probably wants to run on the new changes not the old ones.

The value of the “test merge commit” (that returned for a currently open PR) is only discussed in the context of checking mergeability and the mergeable attribute. Is such a commit still valid after new changes are pushed to the PR and/or base branches? After the PR is merged? After the PR branch is deleted?

According to my testing, yes on all three counts.

These questions are relevant for Jenkins because it is expected that a (deterministic) SCMRevision is valid indefinitely; for example the Replay action will run a fresh build which uses it (for both Jenkinsfile via SCMFileSystem and checkout scm via SCMBuilder.build). The current implementation of PullRequestSCMRevision based on a pair of pullHash and baseHash is safe because (at least assuming users refrain from destructive history operations) both of those commits are visible in the repository’s permanent record, and you can freely remerge them whenever you need to with identical results. It is hard to make a similar promise for merge_commit_sha.

Scenario 2: Customer goes back replay an old PR build. Funny thing - this was already broken before this change. When I click "replay" the commits that are run by replay for merge PRs are the latest from both base and PR branch. This change doesn't modify how base and PR heads are determined, so ... This doesn't mean that this is okay, but fixing it a separate issue and also supports my base assumption as no one has filed this as bug.

As an interesting aside: Prow, used by JX “serverless”, bypasses all these issues because regular PR builds are only ever done on the PR head.

This also an interesting idea but out of scope in relation to fixing this plugin such that merge PRs do not need to be cloned on master.

From what I can see, using diffs will take significantly greater investment of resources and will provide little to no value in core customer scenarios.

@bitwiseman bitwiseman force-pushed the github-api-poc branch 2 times, most recently from 7c76b1f to 1ff3ff2 Compare March 26, 2019 00:15
It is possible for the base and source branches to change between when
a PR is seen and we look on the the commits and generate the merge commit.

This change adds checks that the three commits in the PullRequestSCMRevision
are consistent with each other.
@bitwiseman
Copy link
Contributor Author

bitwiseman commented Mar 26, 2019

After discussion with Jesse, I've updated this to check the consistency of the base, source, and merge commits. This should result in builds failing early when inconsistent commits are found.

@bitwiseman bitwiseman marked this pull request as ready for review March 27, 2019 18:24
@bitwiseman
Copy link
Contributor Author

I will not have time to do automated tests before I go on vacation. I don't feel this is a blocker for merging (there is plenty of this plugin that is doesn't have unit tests) and I have manually tested this sufficiently for it to go out for in an alpha release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants