Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preliminary soft reboot support #3390

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Mstrodl
Copy link
Contributor

@Mstrodl Mstrodl commented Mar 10, 2025

Most of the easy stuff from #3242

One big thing not handled here is detecting a difference in kernels and mounting in the old kernel's /lib/modules/$(uname -r). Probably we shouldn't even be doing that much by default, and instead error out without the user explicitly asking for their kernel to have weird problems.

One issue I ran into is that the upgrade process creates a mount namespace so /sysroot can be written to, but we need to escape from it when preparing /run/nextroot. To work around this, I keep an FD of /proc/self/ns/mnt so we can setns to it. Unfortunately, we have to do this in a separate fork()'d process, because setns will fail if we have any threads (I couldn't figure out how to stop GIO's main loop...).

I'm not thrilled about the setns() stuff or needing to parse through /proc/self/mounts, so if anyone has any ideas on how to do it more elegantly, I'm all ears. One option is to set up /run/nextroot and switch namespaces in the setup part of g_spawn_sync, which would save a fork, but I'm not sure if it's worth it.

Copy link

openshift-ci bot commented Mar 10, 2025

Hi @Mstrodl. Thanks for your PR.

I'm waiting for a ostreedev member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@github-actions github-actions bot added the area/prepare-root Issue relates to ostree-prepare-root label Mar 10, 2025
@Mstrodl Mstrodl force-pushed the feature/mstrodl/soft-reboot branch from 79852cc to 45f2550 Compare March 10, 2025 17:52
Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for working on this!! Just a preliminary review pass so far


/* Read in mounted FSes */
g_autofree gchar *mounts_contents = NULL;
if (!g_file_get_contents ("/proc/self/mounts", &mounts_contents, NULL, error))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's libmount for parsing this stuff

Comment on lines +655 to +656
/* Need to fork because setns will fail if we have threads (GIO loop) */
pid_t child_pid = fork ();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the problem here is that in the general case now we can't call any function which may have relied on global state - especially like the glib worker thread.

I used to help maintain glib upstream and I probably would have been sympathetic to this situation but just closed as wontfix any bugs arising in this situation.

In bootc I did something fairly similar to this in https://github.com/bootc-dev/bootc/blob/f818b37a57a9cdaf657e55c84de8d6e773f64e97/lib/src/mount.rs#L171 and I was pretty careful about what APIs I called there.

And even there I was pretty tempted to rework to do a "re-exec self" dance (ref bootc-dev/bootc#799 (comment) )


OTOH this won't be our first instance of similar code...the fork() we also do for fsfreeze is related.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I'm not thrilled about this. I couldn't find a good way to stop the GIO loop for this. Can you think of any better ways to do it? One option is to have ostree-prepare-root nsenter and set up the /run/nextroot mount. This might be doable by looking at /proc/1/ns/mnt, but I'm not sure. Or we could nsenter as part of the glib spawn (we're forking there anyways)

Copy link
Contributor Author

@Mstrodl Mstrodl Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you think is best and we can do it :)

  • Obviously keeping things the way they are is an option (forking isn't great though...)
  • I have a suspicion the /proc/1/ns/mnt option could cause issues/weird behavior, so I'd prefer not to do that.
  • ostree-prepare-root mounting /run/nextroot feels like it's a little out of scope for the tool (But also then we can nsenter in the fork created for the exec...)
    • Actually now that I just sent this, we could also maybe set up /run/nextroot as part of the spawn setup? Maybe that's a dumb idea though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately the replacement for fork() I think will be https://lwn.net/Articles/908268/
But Josh went back to working on Rust upstream I think and it doesn't have an active driver right now.

Anyways, the problem is there's a lot of APIs invoked here after fork() in ostree_sysroot_deployment_prepare_next_root() like e.g. ot_journal_print() and such.

I mentioned this above and we can try to prove this out but the most reliable thing is to do a re-exec of the current process. Except the massive complication is that libostree is a library...so we'd need to require all callers to give us the chance to intercept re-invocations in main() (doable, just ugly) or alternatively split out a little /usr/libexec/libostree-helper binary.

But probably the best alternative in this specific case is as you say

One option is to have ostree-prepare-root nsenter and set up the /run/nextroot mount.

If we can do as much parsing as possible before forking, and then just hand off to it the operations to do that seems likely to be tractable, right?

Copy link
Contributor Author

@Mstrodl Mstrodl Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that worries me is that the setns call might not work from ostree-prepare-root, but I guess I can't really be certain until I try it :) Definitely the FD I keep open for the mountns won't get passed on, so it'd need to look at pid 1's mountns, which might be okay?

Ideally it would be nice to have a setns thing in io_uring (in the io_uring_spawn scenario). I suspect a lot of the container folks would like that too.

I agree reexec is really not the solution.

Having ostree-prepare-root do the mounting shouldn't be too crazy.

end_loop:

/* At this point, /sysroot is mounted at /run/nextroot, do the pivot from there: */
g_ptr_array_add (args, "/usr/lib/ostree/ostree-prepare-root");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though maybe one alternative is to move this handling directly into ostree-prepare-root? Uncertain.

}
else
{
kernel_cmdline = argv[2];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! This looks like a really nice change and should crucially really help us for testing ostree-prepare-root outside of the initramfs in general (should be totally doable to run in a container)

@openshift-merge-robot
Copy link
Collaborator

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prepare-root Issue relates to ostree-prepare-root needs-ok-to-test needs-rebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants