-
Notifications
You must be signed in to change notification settings - Fork 4.2k
TensorFlow is broken with Bazel@HEAD #16296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/cc @c-parsons @oquenchil Can you look into this? This is blocking 6.0 cut since it's breaking TensorFlow. My guess is dadc49e somehow caused the command line to be too long? If dadc49e is valid, maybe the fix should be from the TensorFlow side? |
It's unclear to me how dadc49e could be causing this failure, both by looking at the failure and looking at the surrounding builds.
I'm unsure how this could be related...
I'm rerunning the culprit finder, just in case this was a flake. If I get a similar result pointing to that commit, I'll need to dig much deeper. |
Yes, that's why we didn't discover this earlier. TensorFlow started to use cc_shared_library recently. I believe it's newer commits in TensorFlow that are broken with this change. |
I can probably help you bisect TensorFlow to find out which change is incompatible with dadc49e |
A known good TF commit is a3182a1eb28b5b6d081391669db3546325b5f1f1, and a known bad TF commit is 322c0a555914006635603c51f39d073ce83cd368 |
Bisecting on my gLinux machine with:
|
Indeed, the second culprit finder still points to that commit. Would be interested in the results of your bisect. |
The bisect is taking longer than I expected, it was interrupted by unrelated build failures. |
I can confirm I can reproduce this issue at tensorflow/tensorflow@007f31c with Bazel at dadc49e, but not at the parent commit of tensorflow/tensorflow@007f31c |
I'm having issues getting even a clean build passing with tensorflow, either at HEAD or at tensorflow/tensorflow@007f31c (with bazel 5.3.1) :\ In both cases:
Error at tensorflow/tensorflow@007f31c:
Error at HEAD:
Tensorflow seems green at HEAD, so I'm not sure what's up. |
Ah, I was also having this strange error. Temporary hack is to
Then comment out the
I guess you'll have to run |
Thanks a bunch! Have reproed the error successfully. Still don't have any leads on the underlying failure -- it may take me a bit. |
This has been exceedingly hard to debug, both because Tensorflow's setup is very complicated, and that debug tools for tracking cc_shared_library dependencies could use some work. I've made a little progress, and have a fix, but I'm not sure if it's the most principled solution. Effectively, it seems that, with dadc49e, we collect static dependencies of cc_shared_library across any deps-like edges. This results in a change to By doing some print debugging and diffing, I tried to track down which link-static dependencies were new to this shared library as a result of the Bazel change. Several filegroup dependencies via Amending the check in the offending change to:
seems to make tensorflow's build pass successfully. Still, I'm not sure if this is really a principled change. I feel like I still haven't found the true root cause of tensorflow's failure. While the proposed fix might work to fix tensorflow's build, it feels like a hacky and unprincipled solution. |
@c-parsons Thanks! /cc @oquenchil Do you have any insight of this issue? How the aspect should work for cc_shared_library? /cc @learning-to-play, who is working on TF's cc_shared_library migration. |
Hi Chris, I agree with you that if we add more attribute names in the check, it would be getting a bit out of hand. Even hard coding "_cc_toolchain" does not seem ideal when looking at it again, I can imagine someone's custom C++ rule in Starlark placing the toolchain in an attribute called "_toolchain". From the change description and attached issues I can't really tell what prompted this change in the first place. Maybe if we go back to the original problem, we can find a different way to do this that doesn't break TF. |
Looking at this a bit more I can imagine why you sent the change in the first place. First just to document what the problem is, let me rephrase the problem that Chris found during debugging. The cc_shared_library implementation decides what to link statically based on the structure of the build graph, it uses the aspect to figure out that structure. It tries to link statically every library in the transitive closure of the direct deps of the shared library (i.e. roots) that is reachable without going through dynamic libraries. That's why we need the structure of the graph. Chris' original change is correct and wrong at the same time. It makes libraries be linked statically that should be linked statically but weren't linked before (because his project has custom C++ rules with different attribute names which is a valid use case) but it also links libraries statically that shouldn't be linked statically. The proposed fix with hard-coding the attribute names "srcs" and "hdrs" cuts the chain that builds the structure of the graph that propagates via genrules, filegroups, etc.. I think that chain can be cut earlier by using required_providers on the aspect definition. This will skip the genrule and filegroup altogether, it also forces the custom C++ rules to advertise the CcInfo provider, so Chris you might want to check that this works for your project too, not just for Tensorflow. If it works, I think the whole line Note: |
Thanks for the summary! You are indeed correct on my project's motivations. Using required_providers of CcInfo sounds like a good idea, and a more principled solution. Let me give this a try on our project and i'll get back to you. |
I've run into some issues using the required_providers approach. My understanding is that this will only propagate the aspect to targets which advertise their provider using For example, the native implementation of We can get around this by assessing However, this seems to not fix the Tensorflow issue. Briefly, the proposed solution is to change bazel/src/main/starlark/builtins_bzl/common/cc/experimental_cc_shared_library.bzl Lines 592 to 603 in f6a99af
|
I guess the reason is that the |
I think one unsolved mystery is why linking more dependencies will cause a crash. |
Indeed, can we get any help from TF here to demystify their build? I'm inclined to go with the hacky solution I outlined in #16296 (comment) for the time being. We can follow up with a more principled solution after we unblock the release. WDYT? |
It feels like we are missing a way to express this in Starlark. Chris how are your custom attribute names called? If we are going to hardcode hdrs and srcs, I'd prefer that we go back to the way it was before your change and include your attribute names in a list apart from "deps". Unless they don't sound very generic and are very custom to your project. In any case, to unblock this it will definitely require one of those two hacky solutions while we come up with a better way. |
We use a number of custom Starlark rules in our graph (generally hidden away by macros). Attribute names include: "root", "roots", "shared", "includes_deps", "system_libraries", "stub_target", "library_target", "root_target", "source_target" I'm frankly not sure without further investigation whether we'd need to allowlist all of these attribute names -- it's possible we only need "root", "roots", and "shared". If it's worth investigating the viability of this path (that a hacky allowlist is better than a hacky denylist), I'm happy to give this a whirl in my project. Let me know. |
So I was thinking about this yesterday some more and I think the right way and the way to do it from the very beginning is with the concept of aspect hints introduced by @scentini and @hlopko. I said "concept" because we don't need to use the built-in support for aspect_hints. There is no need for individual targets of your custom rule to set the aspect hints. It would just be the rule definition that does this. So cc_shared_library can have a new provider PropagationHints that is public just like CcSharedLibraryInfo. In there you would put "root", "roots" and "shared" which we would read from the aspect implementation to decide whether we should merge the results. If the PropagationHints provider is not present, then propagation would happen just like it happened before your original change, via the "deps" attribute. I should probably come back later to this and make sure that in the default case it only propagates via the "deps" attribute if it has CcInfo but that's a change in behavior that may break things and it's unrelated to your change, so I'll take care of this. |
@c-parsons Does Pedro's proposal work for your use case? |
It probably does, but I haven't had the cycles to verify. I hope to get you an answer tomorrow. Sorry about this! |
@c-parsons No worries, we pushed the release cut another 2 weeks back (2022-10-24), so we'll have some extra time for this. That said, this is still one of the few hard release blockers for 6.0. |
OK, I have confirmed the approach described above fixes my project. Furthermore, after about another day of diagnosing and debugging, I've determined we would only need to add In light of this finding, we technically have a workaround: we can rename So this begs the question: Do we want to fixforward with your PropagationHints suggestion, or rollback the original offending commit dadc49e ? Issues with the rollforward suggestion This requires more thought, IMO. Issues with the rollback suggestion My suggestion What do you think? |
Thanks @c-parsons for such detailed analysing, I agree with your suggestion! I think we can even cherry pick the solution in 6.x minor releases if necessary, ideally it should be made backwards compatible, but since cc_shared_library is still under the experimental flag, I guess it's even fine to make some small incompatible change? |
Relatedly, it looks like 21904a9 somewhat contradicts the philosophy we've discussed here. |
/cc @fmeum |
@c-parsons @meteorcloudy I created this commit over half a year ago, feel free to change the logic in any way you see fit - including rolling it back. It only affects the non-Starlark implementation anyway. |
@oquenchil @c-parsons I guess we should rollback 21904a9 as well. |
I agree with rolling back both changes. The logic for 21904a9 is also in the Starlark implementation now and I think cc_library.implementation_deps should indeed follow the same philosophy but we can cut cc_library some slack for now since I cannot imagine a future where cc_library and cc_shared_library are not living next to each other maintained by the same owners. If we need to rollback 21904a9 too in order to cleanly rollback dadc49e then as Fabian says that should not be an issue. |
*** Reason for rollback *** Due to #16296 (comment) *** Original change description *** Collect implementation_deps in graph node aspect This is needed for implementation_deps of cc_library targets to be linked into cc_binary targets with dynamic_deps and cc_shared_library targets. Fixes #14731 Closes #14730. PiperOrigin-RevId: 480360695 Change-Id: Ic1b9c17ce3af730fa0131f5d87e5ca99ae2740c5
*** Reason for rollback *** Breaks Tensorflow. See #16296 *** Original change description *** Apply graph_structure aspect across all edges RELNOTES: None. PiperOrigin-RevId: 480381529 Change-Id: Ied2f6b14a4b80e3c77691cf716c199b9c8a63f85
*** Reason for rollback *** Due to bazelbuild#16296 (comment) *** Original change description *** Collect implementation_deps in graph node aspect This is needed for implementation_deps of cc_library targets to be linked into cc_binary targets with dynamic_deps and cc_shared_library targets. Fixes bazelbuild#14731 Closes bazelbuild#14730. PiperOrigin-RevId: 480360695 Change-Id: Ic1b9c17ce3af730fa0131f5d87e5ca99ae2740c5
*** Reason for rollback *** Breaks Tensorflow. See bazelbuild#16296 *** Original change description *** Apply graph_structure aspect across all edges RELNOTES: None. PiperOrigin-RevId: 480381529 Change-Id: Ied2f6b14a4b80e3c77691cf716c199b9c8a63f85
This is strange, after rolling back both commits, TF still failed with the same error: |
indicates some library was probably linked twice. |
I found there are still some differences of @oquenchil My guess is some change caused the failure by making some libraries linked twice. Can you help debug this to see if this is a bug in cc_shared_library or TF? |
*** Reason for rollback *** Due to #16296 (comment) *** Original change description *** Collect implementation_deps in graph node aspect This is needed for implementation_deps of cc_library targets to be linked into cc_binary targets with dynamic_deps and cc_shared_library targets. Fixes #14731 Closes #14730. PiperOrigin-RevId: 480360695 Change-Id: Ic1b9c17ce3af730fa0131f5d87e5ca99ae2740c5
*** Reason for rollback *** Breaks Tensorflow. See #16296 *** Original change description *** Apply graph_structure aspect across all edges RELNOTES: None. PiperOrigin-RevId: 480381529 Change-Id: Ied2f6b14a4b80e3c77691cf716c199b9c8a63f85
This hack allows us to support native cc_shared_library aspect propagation along the "deps" attribute. See bazelbuild/bazel#16296 (comment) for details. Test: mixed_droid, both at HEAD and after a rollback of the referenced Bazel commit Change-Id: I2247b974a5cd7cc105ec1d674569d9eacdbc302b
This is rolling forward ff927f7 which I mistakenly confused as the root cause for #16296. The implementation does have an issue with too much propagation which is fixed on this change with required_providers and required_aspect_providers restrictions on the aspect. Fixes #17091. RELNOTES:none PiperOrigin-RevId: 507457624 Change-Id: I7317c4532ef20cd7584c55aacc3eca82c50a618b
This is rolling forward ff927f7 which I mistakenly confused as the root cause for #16296. The implementation does have an issue with too much propagation which is fixed on this change with required_providers and required_aspect_providers restrictions on the aspect. Fixes #17091. RELNOTES:none PiperOrigin-RevId: 507457624 Change-Id: I7317c4532ef20cd7584c55aacc3eca82c50a618b
One reason why cc_shared_library wasn't working for upb_proto_library was because indirect cc_shared_library.deps (i.e. linker_inputs not provided by the immediate dependency) were being dropped, that was fixed in f9008f6. A second reason why it wasn't working was because until now cc_shared_library has relied on transitive deps providing CcInfo or ProtoInfo while the upb_proto_library aspect provides a custom wrapped CcInfo. This change fixes that via aspect hints. (It would still require a change in the upb rule implementation). A third reason why it didn't work for upb was because to avoid a collision with the other aspects applied on proto_libraries, the upb_proto_library implementation added a suffix to the name when calling cc_common.create_linking_context(). This in turn caused the `owner` used in the `linker_inputs` to not match the labels of the targets seen by the cc_shared_library. To fix this, the hints provider accepts a list of owners. Apart from those features, the hints provider also allows specifying a list of attributes for the cases where we don't want the result of every dependency to be used (this hasn't come up yet and it doesn't affect upb). This idea was more or less described [here](#16296 (comment)). Note: I am using the words "aspect hints" because conceptually they are. However, none of this is using the aspect_hints mechanism supported by Bazel rules. Here we are simply using providers. Fixes #17676. Closes #17725. PiperOrigin-RevId: 516488095 Change-Id: I603ed8df90fef0636525d60707692930cd23fa5a
One reason why cc_shared_library wasn't working for upb_proto_library was because indirect cc_shared_library.deps (i.e. linker_inputs not provided by the immediate dependency) were being dropped, that was fixed in bazelbuild@f9008f6. A second reason why it wasn't working was because until now cc_shared_library has relied on transitive deps providing CcInfo or ProtoInfo while the upb_proto_library aspect provides a custom wrapped CcInfo. This change fixes that via aspect hints. (It would still require a change in the upb rule implementation). A third reason why it didn't work for upb was because to avoid a collision with the other aspects applied on proto_libraries, the upb_proto_library implementation added a suffix to the name when calling cc_common.create_linking_context(). This in turn caused the `owner` used in the `linker_inputs` to not match the labels of the targets seen by the cc_shared_library. To fix this, the hints provider accepts a list of owners. Apart from those features, the hints provider also allows specifying a list of attributes for the cases where we don't want the result of every dependency to be used (this hasn't come up yet and it doesn't affect upb). This idea was more or less described [here](bazelbuild#16296 (comment)). Note: I am using the words "aspect hints" because conceptually they are. However, none of this is using the aspect_hints mechanism supported by Bazel rules. Here we are simply using providers. Fixes bazelbuild#17676. Closes bazelbuild#17725. PiperOrigin-RevId: 516488095 Change-Id: I603ed8df90fef0636525d60707692930cd23fa5a
https://buildkite.com/bazel/bazel-at-head-plus-downstream/builds/2640#01835384-a5a8-4cef-839a-92041e93f6fb
Culprit finder: https://buildkite.com/bazel/culprit-finder/builds/3607#01834b10-64ec-4e64-812f-2357b4ccd39a
Bisect shows dadc49e is the culprit
The text was updated successfully, but these errors were encountered: