[JENKINS-43194] Use merge_commit_sha for master #209

bitwiseman · 2019-03-01T15:51:24Z

Work-in-progress.

jglick · 2019-03-01T16:30:48Z

FTR: goes back to an approach abandoned in #60.

jglick

I think what you are missing is the basic promise of PullRequestSCMRevision: that it includes the minimal state needed to represent a particular, deterministic snapshot of the code to be checked out. For a merge head, that would include two commit SHAs: one taken from the PR at the time of indexing or event (not necessarily the current head of the PR!); and one taken from the base branch—again not necessarily the current head, and possibly not even a recent one (see discussion in JENKINS-37491).

The plugin deliberately implements its own merging (in JGit) of the exact two selected revisions, which in the general case requires a clone and local files. The GitHub API only offers some semi-documented convenience APIs (IIRC dropped from v4 and at best tentative in v3) to obtain an otherwise unadvertised merge of some recent revisions of the PR branch with the base branch.

If they had a first-class API whereby you could call methods like getFileContent and pass two SHAs, with the service performing a transient merge or the logical equivalent, then that would be perfect. But they do not. My suggestion in JIRA (not prototyped!) is to use something like the /repos/<owner>/<repo>/compare/<branch>...<hash> API to find the common ancestor of the base and pull commits; then check whether there are three distinct file contents in those three commits. If so, abort; otherwise you can pick one as the effective merge contents. This would preserve the existing behavior (to a reasonable degree of confidence), minus the heavyweight checkout.

jglick · 2019-03-01T21:01:30Z

src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMSource.java

@@ -981,19 +982,22 @@ public GHPermissionType fetch(String username) throws IOException, InterruptedEx
                                            public SCMRevision create(@NonNull PullRequestSCMHead head,
                                                                      @Nullable Void ignored)
                                                    throws IOException, InterruptedException {
+                                                String baseHash = pr.getBase().getSha();


This is not necessarily a meaningful value, as noted here. It is in particular not necessarily the current head commit of the base branch (what is being calculated in the deleted code).

This is not a change from the existing behavior - if you read the code, all I did was move the assignments/return statements around. If you like, I can put them back so they don't show up in the diff.

all I did was move the assignments/return statements around

Unclear to me. The original code called GhRepository.getRef on the base branch, which would retrieve its current head. The new code is extracting .base.sha from the PR API response object, which can be a completely different value. (IIRC it is typically, but not always, the last common ancestor between the two branches.)

I can put them back so they don't show up in the diff.

If some part of a patch can be reverted so that the real changes can be seen to have more localized effects than of course this is preferable.

jglick · 2019-03-01T21:02:57Z

src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMSource.java

-                                                                mergeRef.getObject().getSha(),
-                                                                pr.getHead().getSha());
+                                                        baseHash = ghRepository.getRef("heads/" + head.getTarget().getName()).getObject().getSha();
+                                                        mergeHash = pr.getMergeCommitSha();


So this is going to produce a synthetic commit which may or may not have as parents the baseHash and the .head.sha. Depends on exactly when you call it.

Again, I'm not sure what you're pointing out here or why.

That the new behavior will be nondeterministic, and the PullRequestSCMRevision may wind up with three fields that are not in fact consistent.

jglick · 2019-03-01T21:05:17Z

src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMFileSystem.java

-                assert !pr.isMerge(); // TODO see below
-                this.ref = ((PullRequestSCMRevision) rev).getPullHash();
+                if (pr.isMerge()) {
+                    this.ref = ((PullRequestSCMRevision) rev).getMergeHash();


This will in general produce different results from what GitHubSCMBuilder.build does when you actually checkout scm.

Will it? Git merge is in one place isn't git merge in another?
Could you explain or point me to documentation that will help me understand your point?

The merge_commit_sha is guaranteed to be be a valid automatic merge.
How significant can the differences be?

It will be a different commit sha, but that is already true for each agent.
I understand that what you are saying is factually true, but to my knowledge the difference is theoretical rather than practical.

Git merge is in one place isn't git merge in another?

The resulting trees would be identical if

both merges started with the same parent commits

both merges used the same algorithm and other options

The second point is perhaps only theoretical. The first point seems to be a real possibility.

How significant can the differences be?

Who knows? Its behavior is undocumented.

At a minimum, you would have to replace the existing behavior of GitHubSCMBuilder.build to ensure that it was using the same merge commit rather than doing its own local merge.

jglick · 2019-03-01T21:10:09Z

src/main/java/org/jenkinsci/plugins/github_branch_source/PullRequestSCMRevision.java

+     *
+     * @return The commit hash of the head of the pull request branch
+     */
+    @CheckForNull


It will be null if you load an old serialized object, and yet the one caller is not checking for null.

This is once again, POC. The scenario for the caller is guaranteed to never have null.
In final version, this will need to be safely migrated.

Sure, I was just mentioning a potential problem while I was here, in case you had not noticed it yet.

bitwiseman · 2019-03-01T23:31:59Z

@jglick

FTR: goes back to an approach abandoned in #60.

I don't think it does. At minimum, this actually uses the GHPullRequest.getMergeCommitSha() whereas that does not.

I think what you are missing is the basic promise of PullRequestSCMRevision: that it includes the minimal state needed to represent a particular, deterministic snapshot of the code to be checked out. For a merge head, that would include two commit SHAs: one taken from the PR at the time of indexing or event (not necessarily the current head of the PR!); and one taken from the base branch—again not necessarily the current head, and possibly not even a recent one (see discussion in JENKINS-37491).

I'm not getting your point. Once the PullRequestSCMRevision is created, it is just as deterministic as it was before this change. When it is created, the Base and Pull hashes (and if applicable the Merge Hash) are set and aren't changed. I'm not getting where the problem is.

The plugin deliberately implements its own merging (in JGit) of the exact two selected revisions, which in the general case requires a clone and local files. The GitHub API only offers some semi-documented convenience APIs (IIRC dropped from v4 and at best tentative in v3) to obtain an otherwise unadvertised merge of some recent revisions of the PR branch with the base branch.

I don't know what APIs you're referring to, but I'm not sure what that have to do with this change.
The API's used here are available in v3 and v4 and are well documented.

My suggestion in JIRA (not prototyped!) is to use something like the /repos/<owner>/<repo>/compare/<branch>...<hash> API to find the common ancestor of the base and pull commits; then check whether there are three distinct file contents in those three commits. If so, abort; otherwise you can pick one as the effective merge contents. This would preserve the existing behavior (to a reasonable degree of confidence), minus the heavyweight checkout.

That could work, too. We might need to use it for the bitbucket plugin. But in this plugin it would be a much larger change than querying and using merge_commit_sha.

I'm trying to make sense of your points here. I get that you're saying there are some cases where this change might not work in a provably correct and deterministic way, but I can't make sense when you think those cases will occur. You're not saying in which customer scenario those failures would occur. How would a customer produce/reproduce the problem(s) you're describing? What is the severity of the problem(s)?

jglick

How would a customer produce/reproduce the problem(s) you're describing?

I do not have the time to work out test cases for this I am afraid. You would have to consider various scenarios involving commits appearing on the base and/or PR branches, including merge commits of base into PR branch appearing on the PR branch, at different times interleaved with the scan or event hook vs. loading of Jenkinsfile contents vs. checkout scm, and introducing artificial delays between API calls to discover race conditions.

What is the severity of the problem(s)?

Again TBD. My broader point is that it is simpler to reason about the correctness of the current behavior, which relies only on straightforward calls to reasonably well documented APIs (that also would be likely to apply to other Git hosting as you mentioned): it picks up the two branch head references at the time of event/scan, saves them, and when requested merges them exactly as CLI git merge would.

jglick · 2019-03-04T17:46:35Z

src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMSource.java

@@ -981,19 +982,22 @@ public GHPermissionType fetch(String username) throws IOException, InterruptedEx
                                            public SCMRevision create(@NonNull PullRequestSCMHead head,
                                                                      @Nullable Void ignored)
                                                    throws IOException, InterruptedException {
+                                                String baseHash = pr.getBase().getSha();


all I did was move the assignments/return statements around

Unclear to me. The original code called GhRepository.getRef on the base branch, which would retrieve its current head. The new code is extracting .base.sha from the PR API response object, which can be a completely different value. (IIRC it is typically, but not always, the last common ancestor between the two branches.)

I can put them back so they don't show up in the diff.

If some part of a patch can be reverted so that the real changes can be seen to have more localized effects than of course this is preferable.

jglick · 2019-03-04T17:47:43Z

src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMSource.java

-                                                                mergeRef.getObject().getSha(),
-                                                                pr.getHead().getSha());
+                                                        baseHash = ghRepository.getRef("heads/" + head.getTarget().getName()).getObject().getSha();
+                                                        mergeHash = pr.getMergeCommitSha();


That the new behavior will be nondeterministic, and the PullRequestSCMRevision may wind up with three fields that are not in fact consistent.

jglick · 2019-03-04T17:51:29Z

src/main/java/org/jenkinsci/plugins/github_branch_source/GitHubSCMFileSystem.java

-                assert !pr.isMerge(); // TODO see below
-                this.ref = ((PullRequestSCMRevision) rev).getPullHash();
+                if (pr.isMerge()) {
+                    this.ref = ((PullRequestSCMRevision) rev).getMergeHash();


Git merge is in one place isn't git merge in another?

The resulting trees would be identical if

both merges started with the same parent commits

both merges used the same algorithm and other options

The second point is perhaps only theoretical. The first point seems to be a real possibility.

How significant can the differences be?

Who knows? Its behavior is undocumented.

At a minimum, you would have to replace the existing behavior of GitHubSCMBuilder.build to ensure that it was using the same merge commit rather than doing its own local merge.

jglick · 2019-03-04T17:52:15Z

src/main/java/org/jenkinsci/plugins/github_branch_source/PullRequestSCMRevision.java

+     *
+     * @return The commit hash of the head of the pull request branch
+     */
+    @CheckForNull


Sure, I was just mentioning a potential problem while I was here, in case you had not noticed it yet.

jglick · 2019-03-06T13:28:32Z

Another risk of merge_commit_sha is that it is undocumented what its lifecycle is. While the documentation does discuss what new values of this field would be after getting the PR metadata after it is merged, there is no discussion of what happens to a previously obtained merge commit beyond

This test commit is not added to the base branch or the head branch.

It will not be an ancestor of any visible ref, so by typical Git rules it would subject to garbage collection. Does the highly specialized backend used by GitHub in fact collect it? Hard to say. The value of the “test merge commit” (that returned for a currently open PR) is only discussed in the context of checking mergeability and the mergeable attribute. Is such a commit still valid after new changes are pushed to the PR and/or base branches? After the PR is merged? After the PR branch is deleted?

These questions are relevant for Jenkins because it is expected that a (deterministic) SCMRevision is valid indefinitely; for example the Replay action will run a fresh build which uses it (for both Jenkinsfile via SCMFileSystem and checkout scm via SCMBuilder.build). The current implementation of PullRequestSCMRevision based on a pair of pullHash and baseHash is safe because (at least assuming users refrain from destructive history operations) both of those commits are visible in the repository’s permanent record, and you can freely remerge them whenever you need to with identical results. It is hard to make a similar promise for merge_commit_sha.

As an interesting aside: Prow, used by JX “serverless”, bypasses all these issues because regular PR builds are only ever done on the PR head. Whereas Jenkins in github-branch-source’s merge mode tries to keep the head commit status up to date in response to base branch changes, under the assumption that some human will be clicking the Merge button and relying on the green check mark to know when it is safe to do so, Prow (in its “Tide” component) takes a different approach: the commit status is not trusted except to determine whether the PR should be added to a “merge pool”, and the physical merge when approved is performed by Tide after rerunning a final validation build on the proposed merge result (possibly including multiple PRs in the same pool).

We could try to emulate this logic, perhaps: trusted users could use a label or ChatOps to request that Jenkins perform a merge when it can. (Guiding users to use Prow/Tide in conjunction with Jenkins is also an option, though it requires a Kubernetes cluster and a bunch of configuration that would need to be synchronized with Jenkins’ own somehow.) People do frequently request such a feature. From a security perspective, full operation of github-branch-source already requires granting the Jenkins access token write access to the repository (both to create commit statuses, and to verify which PR authors are trusted), and I am not sure whether GitHub apps are capable of finer discrimination.

bitwiseman · 2019-03-07T02:03:37Z

How would a customer produce/reproduce the problem(s) you're describing?

I do not have the time to work out test cases for this I am afraid.

Well, the scenario matters. If the non-determinism you're talking about is only encountered under rare conditions and has minor consequences, then I would say this change is still valid. My analysis indicates that is exactly the case here.

Assumption: Most (if not all) customers run PRs with the merge strategy because they want to know if their latest change is mergeable.

Scenario 1: When changes appear on the base or PR branch after the build is triggered but before the PR merge commit is queried. In that case, the build could pick up a later commit. However, all that results in two builds of the same newer commit, which in the PR scenario is a non-issue. Customers run PRs with the merge strategy because they want to know if their latest change is mergeable.

(Scenario 2 below)

Another risk of merge_commit_sha is that it is undocumented what its lifecycle is. While the documentation does discuss what new values of this field would be after getting the PR metadata after it is merged, there is no discussion of what happens to a previously obtained merge commit beyond

This test commit is not added to the base branch or the head branch.

It will not be an ancestor of any visible ref, so by typical Git rules it would subject to garbage collection. Does the highly specialized backend used by GitHub in fact collect it? Hard to say.

No, I tested this and the commit remains available via the api for at least several days - I haven't seen one become unavailable yet. But this behavior is undocumented, let's assume they are garbage collected immediately, what is the result? A build might fail due to asking for file from a merge commit that doesn't exist. But that isn't particularly bad - the build failed because the base or PR have changed so the user probably wants to run on the new changes not the old ones.

The value of the “test merge commit” (that returned for a currently open PR) is only discussed in the context of checking mergeability and the mergeable attribute. Is such a commit still valid after new changes are pushed to the PR and/or base branches? After the PR is merged? After the PR branch is deleted?

According to my testing, yes on all three counts.

These questions are relevant for Jenkins because it is expected that a (deterministic) SCMRevision is valid indefinitely; for example the Replay action will run a fresh build which uses it (for both Jenkinsfile via SCMFileSystem and checkout scm via SCMBuilder.build). The current implementation of PullRequestSCMRevision based on a pair of pullHash and baseHash is safe because (at least assuming users refrain from destructive history operations) both of those commits are visible in the repository’s permanent record, and you can freely remerge them whenever you need to with identical results. It is hard to make a similar promise for merge_commit_sha.

Scenario 2: Customer goes back replay an old PR build. Funny thing - this was already broken before this change. When I click "replay" the commits that are run by replay for merge PRs are the latest from both base and PR branch. This change doesn't modify how base and PR heads are determined, so ... This doesn't mean that this is okay, but fixing it a separate issue and also supports my base assumption as no one has filed this as bug.

As an interesting aside: Prow, used by JX “serverless”, bypasses all these issues because regular PR builds are only ever done on the PR head.

This also an interesting idea but out of scope in relation to fixing this plugin such that merge PRs do not need to be cloned on master.

From what I can see, using diffs will take significantly greater investment of resources and will provide little to no value in core customer scenarios.

Work-in-progress.

It is possible for the base and source branches to change between when a PR is seen and we look on the the commits and generate the merge commit. This change adds checks that the three commits in the PullRequestSCMRevision are consistent with each other.

bitwiseman · 2019-03-26T00:34:00Z

After discussion with Jesse, I've updated this to check the consistency of the base, source, and merge commits. This should result in builds failing early when inconsistent commits are found.

bitwiseman · 2019-03-27T18:26:55Z

I will not have time to do automated tests before I go on vacation. I don't feel this is a blocker for merging (there is plenty of this plugin that is doesn't have unit tests) and I have manually tested this sufficiently for it to go out for in an alpha release.

jglick requested review from stephenc and jglick March 1, 2019 16:30

jglick reviewed Mar 1, 2019

View reviewed changes

jglick reviewed Mar 4, 2019

View reviewed changes

bitwiseman force-pushed the github-api-poc branch from 30b0e98 to 5be2e88 Compare March 7, 2019 00:10

bitwiseman added 2 commits March 25, 2019 17:11

[JENKINS-43194] Use merge_commit_sha for master

6addac1

Work-in-progress.

Revert minor code cleanup to highlight only merge_commit_sha change

dbd2786

bitwiseman force-pushed the github-api-poc branch 2 times, most recently from 7c76b1f to 1ff3ff2 Compare March 26, 2019 00:15

bitwiseman force-pushed the github-api-poc branch from 1ff3ff2 to fe37f64 Compare March 26, 2019 00:21

bitwiseman marked this pull request as ready for review March 27, 2019 18:24

bitwiseman mentioned this pull request Mar 27, 2019

JENKINS-48737 Fixing lightweight pull request checkouts, also for files in subdirectories jenkinsci/bitbucket-branch-source-plugin#174

Merged

bitwiseman merged commit 497b3c3 into jenkinsci:master Mar 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-43194] Use merge_commit_sha for master #209

[JENKINS-43194] Use merge_commit_sha for master #209

bitwiseman commented Mar 1, 2019

jglick commented Mar 1, 2019

jglick left a comment

jglick Mar 1, 2019

bitwiseman Mar 1, 2019

jglick Mar 4, 2019

jglick Mar 1, 2019

bitwiseman Mar 1, 2019

jglick Mar 4, 2019

jglick Mar 1, 2019

bitwiseman Mar 1, 2019 •

edited

Loading

jglick Mar 4, 2019

jglick Mar 1, 2019

bitwiseman Mar 1, 2019 •

edited

Loading

jglick Mar 4, 2019

bitwiseman commented Mar 1, 2019 •

edited

Loading

jglick left a comment

jglick Mar 4, 2019

jglick Mar 4, 2019

jglick Mar 4, 2019

jglick Mar 4, 2019

jglick commented Mar 6, 2019

bitwiseman commented Mar 7, 2019

bitwiseman commented Mar 26, 2019 •

edited

Loading

bitwiseman commented Mar 27, 2019

[JENKINS-43194] Use merge_commit_sha for master #209

[JENKINS-43194] Use merge_commit_sha for master #209

Conversation

bitwiseman commented Mar 1, 2019

jglick commented Mar 1, 2019

jglick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitwiseman Mar 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitwiseman Mar 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bitwiseman commented Mar 1, 2019 • edited Loading

jglick left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jglick commented Mar 6, 2019

bitwiseman commented Mar 7, 2019

bitwiseman commented Mar 26, 2019 • edited Loading

bitwiseman commented Mar 27, 2019

bitwiseman Mar 1, 2019 •

edited

Loading

bitwiseman Mar 1, 2019 •

edited

Loading

bitwiseman commented Mar 1, 2019 •

edited

Loading

bitwiseman commented Mar 26, 2019 •

edited

Loading