Skip to content

Fix deduplication of virtual packages installed under aliases #6735

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 28, 2025

Conversation

clemyan
Copy link
Member

@clemyan clemyan commented Mar 22, 2025

What's the problem this PR addresses?

Fixes #6573

In order to explain why the bug happens, let's do a refresher on virtual packages and deduplication, and how we end up here

Virtuals

Let's say we have the following packages:

  • pkg-z has a peer dependency on pkg-p@*
  • pkg-a has a dependency on pkg-z@^1.0.0 and pkg-p@^1.0.0
  • pkg-b has a dependency on pkg-z@^1.0.0 and pkg-p@^2.0.0
pkg-a --[ pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0 ==> pkg-p@*
      --[ pkg-p@^1.0.0 ]--> pkg-p@npm:1.0.0
pkg-b --[ pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0 ==> pkg-p@*
      --[ pkg-p@^2.0.0 ]--> pkg-p@npm:2.0.0

When we resolve the peer dependencies, we resolve which instance of pkg-p each pkg-z instance should get, then we add that instance of pkg-p as a regular dependency to the pkg-z instance in our internal package data

pkg-a --[ pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0 --[...]--> pkg-p@npm:1.0.0
pkg-b --[ pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0 --[...]--> pkg-p@npm:2.0.0

This is a problem because we end up with the same package having two different sets of dependencies. To avoid that, we create virtual packages for each instance of peer requesters (i.e. packages with peer dependencies) in the tree. We also replace descriptors that resolve to those packages with virtual descriptors.

pkg-a --[ pkg-z@virtual:<id-a> ]--> pkg-z@virtual:<id-a> --[...]--> pkg-p@npm:1.0.0
pkg-b --[ pkg-z@virtual:<id-b> ]--> pkg-z@virtual:<id-b> --[...]--> pkg-p@npm:2.0.0

The two instance of pkg-z that have different dependency sets are now literally two different packages in our internals. The virtual id is based on the parent package and the pre-virtualization package.

Deduplication

However, just virtualizing every peer requester naively can lead to some problems. Take this example:

pkg-a --[ pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0 ==> pkg-p@*
      --[ pkg-p@^1.0.0 ]--> pkg-p@npm:1.0.0
pkg-b --[ pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0 ==> pkg-p@*
      --[ pkg-p@^1.0.0 ]--> pkg-p@npm:1.0.0
pkg-a --[ pkg-z@virtual:<id-a> ]--> pkg-z@virtual:<id-a> --[...]-> pkg-p@npm:1.0.0
pkg-b --[ pkg-z@virtual:<id-b> ]--> pkg-z@virtual:<id-b> --[...]-> pkg-p@npm:1.0.0

We have the reverse "problem" -- we end up with two different virtual pkg-z with the same dependency sets. That's not incorrect per se, but it'd be nice if we can use the same package instance for both.

That's why we also have a deduplication process where we find virtual descriptors that resolve to virtualizations of the same package with the same dependency set. Then we replace all those virtual descriptors and packages with a single "master" descriptor/package among them

pkg-a --[ pkg-z@virtual:<id-a> ]--> pkg-z@virtual:<id-a> --[...]-> pkg-p@npm:1.0.0
pkg-b --[ pkg-z@virtual:<id-a> ]--> pkg-z@virtual:<id-a> --[...]-> pkg-p@npm:1.0.0

Now we go back to having one pkg-z instance, and we can remove the discarded virtual descriptor/package (pkg-z@virtual:<id-b>) from our internals.

Bug with aliases

This deduplication, however, in turn leads to a bug (#1352) in some edge cases. (In this and subsequent examples, I'll omit pkg-p. Assume that all instances of pkg-z has their peer requests satisfied the same way)

{
  "name": "pkg-a",
  "dependencies": {
    "pkg-x": "npm:pkg-z@^1.0.0",
    "pkg-y": "npm:pkg-z@^1.0.0"
  }
}
pkg-a --[ pkg-x@npm:pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0
      --[ pkg-y@npm:pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0
pkg-a --[ pkg-x@virtual:<id-a> ]--> pkg-z@virtual:<id-a>
      --[ pkg-y@virtual:<id-a> ]--> pkg-z@virtual:<id-a>

We have used aliases to make two different virtual descriptors resolve to the same virtual package. When deduping, we will dedupe one of the virtual descriptors to the other descriptor, and discard the package it resolves to. Now we end up with zero instances of pkg-z...

#1726 fixes that, by having the deduplication key include the descriptor ident (pkg-x vs pkg-y), so those two are not deduped against each other.

This is where we are currently at.

More edge cases !!!

This fix, however, once again causes more edge cases.

In one edge case, it will cause virtual packages to not dedupe when they should

pkg-a --[ pkg-x@npm:pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0
pkg-b --[ pkg-y@npm:pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0
pkg-a --[ pkg-x@virtual:<id-a> ]--> pkg-z@virtual:<id-a>
pkg-b --[ pkg-y@virtual:<id-b> ]--> pkg-z@virtual:<id-b>

#1726 fixes specifically for two aliased dependencies of the same package. What if we have two packages each using an alias? We don't dedupe those because the virtual descriptors have different idents, and end up with two virtual instances because the parent package is part of the virtual id calculation.


That case in itself is also not a problem per se, just inefficient. However, using that we can construct a situation where the same package can either dedupe or not dedupe depending on which descriptor is being checked.

pkg-a --[ pkg-x@npm:pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0
pkg-b --[ pkg-x@npm:pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0
      --[ pkg-y@npm:pkg-z@^1.0.0 ]--> pkg-z@npm:1.0.0
pkg-a --[ pkg-x@virtual:<id-a> ]--> pkg-z@virtual:<id-a>
pkg-b --[ pkg-x@virtual:<id-b> ]--> pkg-z@virtual:<id-b>
      --[ pkg-y@virtual:<id-b> ]--> pkg-z@virtual:<id-b>

Here, the second descriptor pkg-x@virtual:<id-b> will be deduped into the first: they have the same ident and resolve to the same "physical" package. Once again we will discard pkg-z@virtual:<id-b> in the process.

However, the third descriptor pkg-x@virtual:<id-b> is not deduped, leaving a dangling reference to the discarded package.

This is the case reported in #6573

How did you fix it?

At its root, #1352 is caused by the same virtual package being checked for dedupe twice against two different virtual descriptors that resolve to it. What if we just... not do that? Instead of deduping virtual descriptors, just dedupe the virtual packages.

We would have virtual descriptors resolving to virtual packages with a different virtual id, but that would not cause problems. In fact, the entire point of resolution is to link descriptors and locators.

This ends up simplifying the implementation a lot because we don't need to keep track of which packages have which virtual descriptors as dependencies. We just need a reverse resolution map from virtual locators to virtual descriptors. Deduping is as simple as updating the project resolution map.

The only downside is we have a few more descriptors in the internals, but that's just a very tiny cost compared to the benefits.

Checklist

  • I have set the packages that need to be released for my changes to be effective.
  • I will check that all automated PR checks pass before the PR gets reviewed.

@clemyan clemyan force-pushed the clemyan/fix-virtual-dedupe branch from c0b0d1d to 0bf3eef Compare March 22, 2025 16:24
@arcanis arcanis merged commit 1ef45ae into master Mar 28, 2025
26 checks passed
@arcanis arcanis deleted the clemyan/fix-virtual-dedupe branch March 28, 2025 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug?]: Error: Assertion failed: The locator should have been registered
2 participants