Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support additional layer store (patch for containers/image) #1109

Merged
merged 1 commit into from
Apr 28, 2021

Conversation

ktock
Copy link
Contributor

@ktock ktock commented Dec 26, 2020

containers/podman#4739

Reconsidered the design on 2021/1/12

This enables podman to create containers using layers stored in a specified directory instead of pulling them from the registry. Leveraging this feature with remotely-mountable layers provided by stargz/zstd:chunked or CVMFS, podman can achieve lazy pulling.

Changes in contianers/storage: containers/storage#795

That directory is named "additional layer store" (call ALS in this doc) and has the following structure.

<ALS root>/base64("key1=value1")/base64("key2=value2")/.../
`-- diff
`-- info

diff directory contains the extracted layer diff contents specified by the key-value pairs.
info file contains *c/storage.Layer struct that indicates the information of this layer contents (*c/storage.Layer.ID, *c/storage.Layer.Parent and *c/storage.Layer.Created can be empty as it'll be filled by c/storage.Store).

On each pull, c/storage.Store searches the layer diff contents from ALS using pre-configured key-value pairs.
Each key-value pair is base64 encoded.
By default, the following key=value pairs can be used as elements of the path in ALS,

  • reference=<image reference>
  • layerdigest=<digest of the compressed layer contents>

Additionally, layer annotations (defined in the image manifest) prefixed by containers/image/target. can be used as well.
The prefix containers/image/target. will be trimmed from the key when it's used in the path on ALS.

Overlay driver supports an option to specify which key-value pair to be used and how the order should they be when c/storage.Store searches layers in the ALS.

layerstore=<ALS root directory>:<key1>:<key2>:...:<keyX>

In the above case, on each pull, c/storage.Store searches the following path in ALS,

<ALS root>/base64("key1=value1")/base64("key2=value2")/.../base64("keyX=valueX")/
`-- diff
`-- info

The underlying filesystem (e.g. stargz/zstd:chunked-based filesystem or CVMFS) should show the exploded view of the target layer diff and its information at these locations.
Example filesystem implementation (currently stargz-based) is https://github.com/ktock/stargz-snapshotter/tree/als-pool-example (this must be mounted on )

If the layer content is found in ALS, c/storage.Store creates layer using <ALS root>/.../info as *c/storage.Layer and using <ALS root>/.../diff as its diff directory.
So c/image's copier doesn't need to pull this layer from the registry.

Changes in containers/image: #1109

Now c/image's copier leverages this store.
Every time this pulls an image, it first tries to reuse blobs from ALS.
That copier passes each layer's OCI annotations (key-value pairs) + the following key-value to c/storage.Store.

  • containers/image/target.reference: The image reference of an image that contains the target layer. This is needed for supporting registry-backed filesystems (e.g. estargz, zstd:chunked).

*c/image.storageImageDestination.TryReusingBlob() cannot pass image reference to c/storage.Store so this commit adds a new API `*c/image.storageImageDestination.TryReusingBlobWithRef() for achieving this.
When this copier successfully acquires that layer, this reuses this layer without pulling.

Changes in containers/podman: none

Command exapmle

podman --storage-opt "layerstore=/tmp/storage:reference:layerdigest" pull ghcr.io/stargz-containers/rethinkdb:2.3.6-esgz
podman --storage-opt "layerstore=/tmp/storage:reference:layerdigest" run --rm -it ghcr.io/stargz-containers/rethinkdb:2.3.6-esgz /bin/echo 'Hello, World!'

In the above cases, c/storage.Store looks up /tmp/storage/base64("reference=ghcr.io/stargz-containers/rethinkdb:2.3.6-esgz")/base64(<layer digests>)/{diff, info} in ALS.
The example filesystem implementation (https://github.com/ktock/stargz-snapshotter/tree/als-pool-example) is mounted at /tmp/storage shows the extracted layer at that location.
Then rethinkdb:2.3.6-esgz can run without pulling it from registry.

Known limitation and request for comments

Some operations (e.g. save) requires correct value to be set to c/storage.Layer.UncompressedSize. This field seems to be the size of the layer but without compression (i.e. the size of the tar-archived format of that layer). For registry-backed ALS, getting this information is difficult because neither of OCI/Docker image nor registry API provides the way to get the uncompressed size of layers. We cannot get this information without actually pull and decompress the layer, which is not lazy pulling that this PR aims to.

I'll check the codebase deeper to come up with the way to get this information from somewhere or the way to safely allow c/storage.Layer.UncompressedSize to be unknown during operations. But if someone has good idea for managing this, please let me know.

cc: @siscia @giuseppe

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a very quick initial skim to highlight concerns. I have literally not read the PR description in full yet.

How does this relate to #1084 ?

If this extends the image format, there should eventually be formal documentation for interoperability — maybe after the code settles, sure.

copy/copy.go Outdated
// Add default labels containing information of the target image and layers.
srcLayer.Annotations[layerTargetDigest] = srcLayer.Digest.String()
if dref := ic.c.rawSource.Reference().DockerReference(); dref != nil {
srcLayer.Annotations[layerTargetReference] = dref.String()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This exposes private information; it’s not expected that the destination of a copy is told about the name used at the source.
  • Why the full Docker reference, anyway? c/storage keys things off an image ID, and a Docker reference can refer to many different images over time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registry-backed additional layer store filesystem (e.g. eStargz, zstd:chunked) needs to query layer data to the registry. The registry API requires image reference so this needs to be passed to the c/storage's Store and the underlying filesystem somehow.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having now read a bit more, containers/storage#795 (comment) : this is just data that should be passed as parameters, not by modifying the manifest, which affects every single skopeo copy including things like inter-registry copies exposing locations of private staging registries when making a product release.

(Even given that, it’s not at all clear to to me how this is going to work with authenticated registries; the credentials used by copy.Image are not made available to the on-demand additional store implementations by this. But that’s a secondary concern and probably solvable at least for the simple cases.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use a lookaside cache for storing such information (something implemented with: containers/storage#787)?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if 3 different images that share a layer are pulled, one after another; 2 from a single registry, the third from a completely different registry; the three pulls are using different credentials, and later some of the credentials are invalidated? The credentials are conceptually a property of (end user, repository), not a layer. (And the concept of “end user” gets rather tricky once CRI-O is involved.)

I suppose just starting with unauthenticated repositories would still be useful. Getting the complex cases right might be more interesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just data that should be passed as parameters, not by modifying the manifest, which affects every single skopeo copy including things like inter-registry copies exposing locations of private staging registries when making a product release.

Reference is now pased as parameters to c/storage.Store, through a new API *c/image.storageImageDestination.TryReusingBlobWithRef().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even given that, it’s not at all clear to to me how this is going to work with authenticated registries

As initial implementation, the underlying filesystem's authentication logic (creds management) is separated from copy.Image.
We can come up with the way to sync creds between them but this can be following-up PRs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can come up with the way to sync creds between them but this can be following-up PRs.

I can’t much imagine how that would look like; so is this “will never happen”? The tradeoffs are relevant for choosing an an approach/implementation.

copy/copy.go Outdated Show resolved Hide resolved
@ktock
Copy link
Contributor Author

ktock commented Jan 4, 2021

How does this relate to #1084 ?

The underlying filesystem (additional layer store) accessed from c/storage can support mounting layers from the registry leveraging eStargz/zstd:chunked.

storage/storage_image.go Outdated Show resolved Hide resolved
@@ -486,6 +489,16 @@ func (s *storageImageDestination) TryReusingBlob(ctx context.Context, blobinfo t
}, nil
}

// Check if the layer can be used from the additional layer store with fresh ID.
if l, err := s.imageRef.transport.store.Layer(stringid.GenerateRandomID(), storage.WithAnnotations(blobinfo.Annotations)); err == nil {
s.blobLayerIDs[blobinfo.Digest] = l.ID
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With blobDiffIDs not set, this will cause computeID to fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With blobDiffIDs not set, this will cause computeID to fail.

Fixed this. Now filesystem (ALS) provides diffID.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the filesystem obtain the diffID? By reading all of the layer? If so, wouldn’t it be more efficient to avoid it and just read all of the layer ourselves? (Or is that somehow cached across machines?)

storage/storage_image.go Outdated Show resolved Hide resolved
storage/storage_image.go Outdated Show resolved Hide resolved
storage/storage_image.go Outdated Show resolved Hide resolved
@mtrmac
Copy link
Collaborator

mtrmac commented Jan 4, 2021

How does this relate to #1084 ?

The underlying filesystem (additional layer store) accessed from c/storage can support mounting layers from the registry leveraging eStargz/zstd:chunked.

So should we have only one of the two features? If so, which one? Or if both, why have two so different implementations working with the same format? (I don’t have any opinion yet, I assume you have thought about these things much more than me.)

@ktock
Copy link
Contributor Author

ktock commented Jan 5, 2021

How does this relate to #1084 ?

The underlying filesystem (additional layer store) accessed from c/storage can support mounting layers from the registry leveraging eStargz/zstd:chunked.

So should we have only one of the two features? If so, which one? Or if both, why have two so different implementations working with the same format? (I don’t have any opinion yet, I assume you have thought about these things much more than me.)

We should have both of eStargz (fully compatible with legacy gzip-based images) for backward-compatibility to the current ecosystem and zstd:chunked (zstd-based) for "skippable frame" support & possibly better performance.

@mtrmac
Copy link
Collaborator

mtrmac commented Jan 5, 2021

I was not asking about the formats, but about the two structurally different implementations of zstd:chunked .

@ktock
Copy link
Contributor Author

ktock commented Jan 5, 2021

This patch is for enabling "lazy pulling" (skipping pulling of layers but fetching necessary chunks on-demand). But #1084 is for deduplication of pulling so the scope is different.

cc: @giuseppe

copy/copy.go Outdated Show resolved Hide resolved
copy/copy.go Outdated Show resolved Hide resolved
copy/copy.go Outdated
// Add default labels containing information of the target image and layers.
srcLayer.Annotations[layerTargetDigest] = srcLayer.Digest.String()
if dref := ic.c.rawSource.Reference().DockerReference(); dref != nil {
srcLayer.Annotations[layerTargetReference] = dref.String()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can come up with the way to sync creds between them but this can be following-up PRs.

I can’t much imagine how that would look like; so is this “will never happen”? The tradeoffs are relevant for choosing an an approach/implementation.

storage/storage_image.go Outdated Show resolved Hide resolved
types/types.go Outdated Show resolved Hide resolved
types/types.go Outdated Show resolved Hide resolved
types/types.go Outdated Show resolved Hide resolved
@ktock
Copy link
Contributor Author

ktock commented Feb 9, 2021

We can come up with the way to sync creds between them but this can be following-up PRs.

I can’t much imagine how that would look like; so is this “will never happen”? The tradeoffs are relevant for choosing an an approach/implementation.

I think the file system API isn't enough for achieving this. Another dedicated communication channel (e.g. unix socket) will need to be exposed by the filesystem (FUSE daemon) for communicating creds with the runtime (Podman, CRI-O, ...).

@AkihiroSuda
Copy link
Contributor

needs rebase

@rhatdan
Copy link
Member

rhatdan commented Mar 23, 2021

@ktock Could you rebase this PR>]?

@ktock
Copy link
Contributor Author

ktock commented Mar 24, 2021

Rebased.

@rhatdan
Copy link
Member

rhatdan commented Mar 24, 2021

And sadly it needs to be rebased again.

@ktock
Copy link
Contributor Author

ktock commented Apr 8, 2021

@giuseppe Do we need a new tag for containers/storage#795 to make CI of this PR pass?

@ktock
Copy link
Contributor Author

ktock commented Apr 21, 2021

@vrothberg Thanks for the review.
Added comments and a commit message.

@ktock
Copy link
Contributor Author

ktock commented Apr 21, 2021

CI failure (Skopeo) seems not related to this PR

time="2021-04-21T07:52:31Z" level=fatal msg="Error reading blob sha256:f7794a8687064c222edec1eacac34186676299758221ad448828a138c313432c: error fetching external blob from \"https://mcr.microsoft.com/v2/windows/nanoserver/blobs/sha256:f7794a8687064c222edec1eacac34186676299758221ad448828a138c313432c\": 500 (Internal Server Error)"

@vrothberg
Copy link
Member

CI failure (Skopeo) seems not related to this PR

time="2021-04-21T07:52:31Z" level=fatal msg="Error reading blob sha256:f7794a8687064c222edec1eacac34186676299758221ad448828a138c313432c: error fetching external blob from \"https://mcr.microsoft.com/v2/windows/nanoserver/blobs/sha256:f7794a8687064c222edec1eacac34186676299758221ad448828a138c313432c\": 500 (Internal Server Error)"

I concur and restarted the CI job.

Copy link
Member

@vrothberg vrothberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@mtrmac, if you have time to skim the changes. I'd feel more comfortable with your blessings.

@ktock
Copy link
Contributor Author

ktock commented Apr 28, 2021

Can we move this forward?

@vrothberg
Copy link
Member

Can we move this forward?

Thanks for the ping. Yes, let's get it merged. Can you rebase?

This commit adds support for "Additional Layer Store".

Pull is one of the time-consuming steps in the container lifecycle. Additional
Layer Store enables runtimes (e.g. Podman, CRI-O, etc) to startup containers
using layers stored in a specified directory, instead of pulling them from the
registry.

One of the expected use cases of this feature is "lazy pulling". This enables
the runtimes to startup containers without waiting for the entire image contents
to be locally available, but necessary chunks of the contents are fetched
on-demand (lazily).

There are several image formats and filesystems to enable lazy pulling in the
community. They includes stargz/eStargz, zstd:chunked, CVMFS, etc. Additional
Layer Store makes it easy to integrate with these filesystems for performing
lazy pulling.

Signed-off-by: Kohei Tokunaga <[email protected]>
@ktock
Copy link
Contributor Author

ktock commented Apr 28, 2021

Rebased :)

@vrothberg vrothberg merged commit 4be5dd3 into containers:master Apr 28, 2021
@giuseppe
Copy link
Member

@ktock do you have some instructions how I can play with the stargz-snapshotter and additional layer stores?

How do I run it and what setup to use for storage.conf?

@ktock
Copy link
Contributor Author

ktock commented Aug 18, 2021

@giuseppe
Copy link
Member

thanks that is helpful! I've followed the installation tutorial and I think it is running now. Do you have any image ready I can use to lazy pull?

@ktock
Copy link
Contributor Author

ktock commented Aug 18, 2021

@giuseppe There are pre-converted estargz images on ghcr.io/stargz-containers : https://github.com/containerd/stargz-snapshotter/blob/main/docs/pre-converted-images.md

Other methods to get estargz images: https://github.com/containerd/stargz-snapshotter#getting-estargz-images

@giuseppe
Copy link
Member

Thanks! That worked well

@giuseppe
Copy link
Member

is there also support for zstd:chunked? I've tried with docker.io/gscrivano/zstd-chunked:fedora but it failed

@ktock
Copy link
Contributor Author

ktock commented Aug 18, 2021

That is on containerd/stargz-snapshotter#293 but hasn't merged yet. I'll work on it soon.

@rhatdan
Copy link
Member

rhatdan commented Aug 20, 2021

@ktock We are looking to packge stargz-snapshotter as a separate package in Fedora. @lsm5 is taking the lead on this, are you interested in working on it? If we are going to truly allow users to use this method of pulling containers, we need to make it as easy as possible for them to do it.

This includes building and pushing images to registries, and then configuring container/storage to use the fuse snapshotter file system out of the box.

@ktock
Copy link
Contributor Author

ktock commented Aug 20, 2021

@rhatdan

We are looking to packge stargz-snapshotter as a separate package in Fedora. @lsm5 is taking the lead on this, are you interested in working on it?

Yes, I am. Thank you!

@giuseppe
Copy link
Member

I am adding support for estargz to the chunked package in containers/storage.

@ktock what is the easiest way to create estargz images?

Is it worth to support the original stargz format as well or we can safely assume estargz is the format used by containerd?

@ktock
Copy link
Contributor Author

ktock commented Aug 20, 2021

@giuseppe

I am adding support for estargz to the chunked package in containers/storage.

Thank you!

what is the easiest way to create estargz images?

On command line:

You can use nerdctl image convert (doc) to convert an arbitrary image into eStarz.
You can perform the conversion using containerized nerdctl.

docker build -t nerdctl https://github.com/containerd/nerdctl.git
docker run -v ~/.docker/config.json:/root/.docker/config.json -it --rm --privileged nerdctl /bin/bash
# nerdctl pull ghcr.io/stargz-containers/python:3.9-org
# nerdctl image convert --estargz --oci ghcr.io/stargz-containers/python:3.9-org ghcr.io/ktock/python:3.9-esgz
# nerdctl push ghcr.io/ktock/python:3.9-esgz

On golang code:

You can use estargz pkg of stargz snapshotter project. This provides estargz.Build() which converts a blob (gzip, zstd or plain tar) into the eStargz blob.

Stargz Snapshotter requires the following OCI annotations (source) in the OCI descriptor of the layer.

  • containerd.io/snapshot/stargz/toc.digest : the digest of uncompressed TOC JSON of the layer.
  • io.containers.estargz.uncompressed-size : uncompressed size of the layer.

we can safely assume estargz is the format used by containerd?

Yes, we can.

@lsm5
Copy link
Member

lsm5 commented Aug 20, 2021

@ktock if you would like to do the actual packaging it for fedora, let's talk via email (lsm5 AT redhat DOT com) or irc/matrix. My libera irc is lsm5 and matrix id is @lsm5:lsm5.ems.host . Thanks a lot!

@ktock
Copy link
Contributor Author

ktock commented Aug 20, 2021

@lsm5 Thank you. I sent an email.

@giuseppe
Copy link
Member

@ktock opened a PR for containers/image: #1351

With that in place, you can convert an image more easily as:

skopeo copy --dest-compress-format gzip:estargz docker://foo/bar docker://baz/bar

@ktock
Copy link
Contributor Author

ktock commented Aug 31, 2021

@giuseppe stargz snapshotter v0.8.0 comes with support for lazy pulling of zstd:chunked

@giuseppe
Copy link
Member

giuseppe commented Sep 1, 2021

@ktock thanks for the update. That is a great news :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants