Skip to content
This repository was archived by the owner on Dec 13, 2018. It is now read-only.

Add checkpoint/restore functionality to nsinit and libcontainer #204

Closed
wants to merge 2 commits into from
Closed

Conversation

SaiedKazemi
Copy link

New code was added to nsinit and libcontainer to use the criu(8) utility
to checkpoint and restore a container.

While there are no hard-coded assumptions about Docker, the new C/R
functionality has only been tested with Docker containers.

The nsinit command line usage for checkpoint and restore is consistent
with the rest of nsinit commands. However, for easier C/R usage, the
user can specify the container ID on the command line. The best way
to get started is to run "nsinit checkpoint --help" and look at the
"EXAMPLE:" section.

Signed-off-by: Saied Kazemi [email protected]

@rjnagal
Copy link
Contributor

rjnagal commented Sep 25, 2014

Is it possible to use criu (xemul/criu) as a library with a go wrapper? That way we don't have to manage dependency on another binary.

@mrunalp
Copy link
Contributor

mrunalp commented Sep 25, 2014

+1 @rjnagal I had the same thought.

@SaiedKazemi
Copy link
Author

If CRIU is launched as a service, there is a C API (libcriu) which is a simple wrapper around CRIU RPC. I am not sure this is what we want. Also, the page at http://criu.org/C_API says that "the library is not thread- and fork-safe". I don't have any experience with CRIU as a service or libcriu, but will look into it as soon as I get a chance.

@filbranden
Copy link

@filbranden so that I get cc'd.

func mountRootFs(containerId string, verbose bool) error {

// Check if already mounted.
aufsDir := filepath.Clean(libcontainerDir + "/../../aufs")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere where you're handling paths, why not just use filepath.Join ?

aufsDir := filepath.Join(libcontainerDir, "../../aufs")
rootFs := filepath.Join(aufsDir, "mnt", containerId)
...
f, err := os.Open(filepath.Join(aufsDir, "layers", containerId))
...
aufsBranches := filepath.join(aufsDir, "diff", containerId)
...
for scanner.Scan() {
    aufsBranches = aufsBranches + ":" + filepath.Join(aufsDir, "diff", scanner.Text())
}

The paths get an implicit filepath.Clean when you use filepath.Join but looking at it I think that would be appropriate in all of these cases (but I might be mistaken about that, so make sure that's OK to you.)

@filbranden
Copy link

Regarding CRIU as a library, that really doesn't work this way, the libcriu is really just a thin wrapper around spawning the criu binary with the appropriate options to checkpoint or restore the process.

Unfortunately doing checkpoint/restore from a library is not really trivial as it requires manipulating the process trees in not very trivial ways, so I think removing the dependency on the extra criu binary would be too complicated to be even worth the effort...

I think long-term having docker checkpoint and docker restore rely on the availability of an external criu binary (with a possible path override from environment variables) is the way to go, and if the external binary is not found the commands should just fail gracefully with an error indicating the binary can not be found.

@rjnagal
Copy link
Contributor

rjnagal commented Sep 25, 2014

The main problem is with versioning conflicts. Docker had lot of trouble managing lxc versions.

@xemul for his thought about library and versioning.

import (
"log"
"strings"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove empty line

@SaiedKazemi
Copy link
Author

As you've noticed, I am making the changes as individual commits, so don't merge them yet. I will squash them at the end after review.

// Mount the container's filesystem if it's not already mounted.
func mountRootFs(containerId string, verbose bool) error {
// Check if already mounted.
aufsDir := filepath.Join(libcontainerDir, "../../aufs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan to handle devicemapper backend?

@xemul
Copy link

xemul commented Sep 26, 2014

@filbranden , criu library is a wrapper, yes, but not for spawning criu. It only spawns one for restoring a sub-process. In other cases (e.g. dump) it connects to criu service.

And also -- what tree manipulations do you mean?

@rjnagal , can you tell more about versioning problems? What can be done in criu side to make this easier?

EXAMPLE:
# export LIBCONTAINER_DIR=/var/lib/docker/execdriver/native
# export CRIU_IMG_HOME_DIR=/var/lib/docker/containers
# docker ps -lq --no-trunch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: --no-trunc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@SaiedKazemi
Copy link
Author

I pushed reworked code based on the review comments above and the meeting on Friday. Please DO NOT merge any of the commits yet because --once the changes are final-- I will squash them into two commits: one for libcontainer and one for nsinit.

}
for _, mountpoint := range container.MountConfig.Mounts {
mntmap := []string{mountpoint.Destination, ":", mountpoint.Source}
args = append(args, "--ext-mount-map", strings.Join(mntmap, ""))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt.Sprintf("%s:%s") for the mounts here too

"github.com/docker/libcontainer"
)

func Checkpoint(criuBinary string, container *libcontainer.Config, imageDir string, initPid int, verbose bool) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add some comments to the exported functions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Saied Kazemi added 2 commits October 1, 2014 18:12
Two new functions, Checkpoint() and Restore(), were added to use
the criu(8) utility for checkpointing and restoring a container.
These functions have no assumptions about Docker.

Signed-off-by: Saied Kazemi <[email protected]>
Using the newly added API in libcontainer, support for checkpointing
and restoring containers was added.

The nsinit command line usage for checkpoint and restore is consistent
with the rest of the nsinit commands.  However, for easier C/R usage,
the user can specify the container ID on the command line.  The best
way to get started is to run "nsinit checkpoint --help" and look at the
"EXAMPLE:" section.

While there are no hard-coded assumptions about Docker, the new C/R
functionality has only been tested with Docker containers.

Signed-off-by: Saied Kazemi <[email protected]>
@vmarmol
Copy link
Contributor

vmarmol commented Oct 2, 2014

LGTM

@SaiedKazemi
Copy link
Author

@crosbymichael Any other comments? As I'll be off next week, it'd be great to merge by tomorrow in case we run into any issues. Thanks.

@crosbymichael
Copy link
Contributor

Ok, thanks for the update. I'll test this ASAP. Thanks for doing this.

Michael Crosby

On Oct 2, 2014, at 6:09 PM, Saied Kazemi [email protected] wrote:

@crosbymichael Any other comments? As I'll be off next week, it'd be great to merge by tomorrow in case we run into any issues. Thanks.


Reply to this email directly or view it on GitHub.

@xemul
Copy link

xemul commented Oct 3, 2014

@rjnagal , I see, thanks for the explanation. Maybe it's worth using CRIU via RPC then?

@SaiedKazemi
Copy link
Author

@crosbymichael Michael, I was wondering if you had a chance to test checkpoint and restore. If there are no additional comments, it'd be nice to have it merged before LPC next week. Thanks.

@crosbymichael
Copy link
Contributor

@rjnagal @vmarmol @SaiedKazemi what do you think about using it via RPC and not having to depend on a binary somewhere? It's much easier for our packagers and deployment that we don't have to provide extra binaries.

@vmarmol
Copy link
Contributor

vmarmol commented Oct 13, 2014

That sounds fine and not too different from fork/execing the CLI from the code's point of view. @SaiedKazemi WDYT and how hard would it be to interact with CRIU's RPC interface?

@filbranden
Copy link

Even if we use the RPC service for dump we still need to spawn the criu binary for restore since otherwise we can't have the restored process become a child of the docker daemon.

I think the criu binary is the way to go.

Dependency should not be a concern, the "checkpoint" and "restore" commands of libcontainer (and docker) should simply try to run the criu binary and complain if it cannot be found... That kind of a "loose" dependency would probably be the most appropriate in this case. For distro packages (e.g. Debian/Ubuntu) they could use a tag such as "suggested" for this dependency (installing docker suggests you might also want to install criu, but only if you want to.)

WDYT?

@SaiedKazemi
Copy link
Author

@vmarmol @crosbymichael If criu is not installed, nsinit will produce the following error message and exit:

2014/10/13 02:26:35 exec: "criu": executable file not found in $PATH

Re using RPC, it makes things more complicated because now not only you must already have criu installed on the system but you must also have it running as a service (for details see http://criu.org/RPC).

I suggest that we keep things simple (i.e., as is). What I can do is to add another line to the error message and direct the user to http://criu.org for downloading criu if it's not already installed on the system.

@SaiedKazemi SaiedKazemi reopened this Oct 13, 2014
@SaiedKazemi
Copy link
Author

@filbranden You're right, and I agree that running criu as a separate binary is the best way to go.

@crosbymichael
Copy link
Contributor

ping @tianon

On the packaging side what do you think about having a dependency like this? Is it just like some type of system dependency, if you don't have it then you just don't use it and we don't enforce any type of dependency on the packaging side of things.

@SaiedKazemi
Copy link
Author

Yup, that's actually how it's working right now. If criu isn't installed, nsinit prints 'exec: "criu": executable file not found in $PATH' and exits. We don't enforce any type of dependency on packaging.

@tianon
Copy link
Contributor

tianon commented Oct 16, 2014

Yeah, I think that's reasonable. It's just like our dependency on Git:
it's required if you want to "docker build git://..." but is otherwise
optional.

@SaiedKazemi
Copy link
Author

@crosbymichael just a quick note to check on the status of this PR.

@crosbymichael
Copy link
Contributor

One question.

Why do you think that this should be in libcontainer? I'm not really seeing anything in here that could not be done with another tool as everything just shells out to another binary. We don't shell out to anything in libcontainer currently.

@vmarmol
Copy link
Contributor

vmarmol commented Oct 23, 2014

I think this makes sense in libcontainer. It uses libcontainer's internal state extensively which an external tool probably shouldn't use.

I think the fork/execing is done since it's easier than calling out to a library. It'll be the first time we add an external dependency (that isn't a simple library) which is why it is weird.

@crosbymichael
Copy link
Contributor

@vmarmol sorry, i did not see this comment until now.

Do you know of any other pkgs that shell out like this? I don't think it's common and I know that doing this is easier I'm not sure sure if it should be in this pkg with the current implementation of shelling out to external dependencies.

It looks like the only internal state that is is using from libcontainer is the rootfs path and mounts so the dependency on the libcontainer state is not that large. Sorry that this has been open for so long but I really am unsure about what to do with the current shell out behavior within a pkg. @rjnagal @mrunalp what do you think? It's cool and awesome and I think everyone wants this but do we keep the shell out?

@mrunalp
Copy link
Contributor

mrunalp commented Nov 12, 2014

+1 to adding the feature but not fan of shelling out. It will be difficult to maintain just like with iptables where some flags are available on some distributions and some aren't.

@SaiedKazemi
Copy link
Author

@mrunalp although calling an external binary from libcontainer isn't the ideal solution, the alternatives of (1) duplicating all of criu logic in libcontainer or (2) linking libcontainer with criu are far worse in terms of complexity and dependency. To address the concern about criu version, we can easily check and print an error (if it's an old version) when user does checkpoint or restore.

That said, we don't have to merge the current PR as I have made some changes to support native docker checkpoint and restore comands. I will send out a new PR once the changes are final.

@bridiver
Copy link

what is the current status of this?

@SaiedKazemi
Copy link
Author

@bridiver
C/R is already integrated into the criu branch of the new libcontainer (#479), waiting to be merged into the master branch.

@bridiver
Copy link

Great! We have a use case where we want to repeatedly restore a container to a "pristine" state during a series of tests.

@crosbymichael
Copy link
Contributor

Closing this as we merged the criu branch into master

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.