Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support additional layer store (patch for containers/storage) #795

Merged
merged 1 commit into from
Apr 8, 2021

Conversation

ktock
Copy link
Contributor

@ktock ktock commented Dec 26, 2020

containers/podman#4739

Reconsidered the design on 2021/1/12

This enables podman to create containers using layers stored in a specified directory instead of pulling them from the registry. Leveraging this feature with remotely-mountable layers provided by stargz/zstd:chunked or CVMFS, podman can achieve lazy pulling.

Changes in contianers/storage: #795

That directory is named "additional layer store" (call ALS in this doc) and has the following structure.

<ALS root>/base64("key1=value1")/base64("key2=value2")/.../
`-- diff
`-- info

diff directory contains the extracted layer diff contents specified by the key-value pairs.
info file contains *c/storage.Layer struct that indicates the information of this layer contents (*c/storage.Layer.ID, *c/storage.Layer.Parent and *c/storage.Layer.Created can be empty as it'll be filled by c/storage.Store).

On each pull, c/storage.Store searches the layer diff contents from ALS using pre-configured key-value pairs.
Each key-value pair is base64 encoded.
By default, the following key=value pairs can be used as elements of the path in ALS,

  • reference=<image reference>
  • layerdigest=<digest of the compressed layer contents>

Additionally, layer annotations (defined in the image manifest) prefixed by containers/image/target. can be used as well.
The prefix containers/image/target. will be trimmed from the key when it's used in the path on ALS.

Overlay driver supports an option to specify which key-value pair to be used and how the order should they be when c/storage.Store searches layers in the ALS.

layerstore=<ALS root directory>:<key1>:<key2>:...:<keyX>

In the above case, on each pull, c/storage.Store searches the following path in ALS,

<ALS root>/base64("key1=value1")/base64("key2=value2")/.../base64("keyX=valueX")/
`-- diff
`-- info

The underlying filesystem (e.g. stargz/zstd:chunked-based filesystem or CVMFS) should show the exploded view of the target layer diff and its information at these locations.
Example filesystem implementation (currently stargz-based) is https://github.com/ktock/stargz-snapshotter/tree/als-pool-example (this must be mounted on )

If the layer content is found in ALS, c/storage.Store creates layer using <ALS root>/.../info as *c/storage.Layer and using <ALS root>/.../diff as its diff directory.
So c/image's copier doesn't need to pull this layer from the registry.

Changes in containers/image: containers/image#1109

Now c/image's copier leverages this store.
Every time this pulls an image, it first tries to reuse blobs from ALS.
That copier passes each layer's OCI annotations (key-value pairs) + the following key-value to c/storage.Store.

  • containers/image/target.reference: The image reference of an image that contains the target layer. This is needed for supporting registry-backed filesystems (e.g. estargz, zstd:chunked).

*c/image.storageImageDestination.TryReusingBlob() cannot pass image reference to c/storage.Store so this commit adds a new API `*c/image.storageImageDestination.TryReusingBlobWithRef() for achieving this.
When this copier successfully acquires that layer, this reuses this layer without pulling.

Changes in containers/podman: none

Command exapmle

podman --storage-opt "layerstore=/tmp/storage:reference:layerdigest" pull ghcr.io/stargz-containers/rethinkdb:2.3.6-esgz
podman --storage-opt "layerstore=/tmp/storage:reference:layerdigest" run --rm -it ghcr.io/stargz-containers/rethinkdb:2.3.6-esgz /bin/echo 'Hello, World!'

In the above cases, c/storage.Store looks up /tmp/storage/base64("reference=ghcr.io/stargz-containers/rethinkdb:2.3.6-esgz")/base64(<layer digests>)/{diff, info} in ALS.
The example filesystem implementation (https://github.com/ktock/stargz-snapshotter/tree/als-pool-example) is mounted at /tmp/storage shows the extracted layer at that location.
Then rethinkdb:2.3.6-esgz can run without pulling it from registry.

Known limitation and request for comments

Some operations (e.g. save) requires correct value to be set to c/storage.Layer.UncompressedSize. This field seems to be the size of the layer but without compression (i.e. the size of the tar-archived format of that layer). For registry-backed ALS, getting this information is difficult because neither of OCI/Docker image nor registry API provides the way to get the uncompressed size of layers. We cannot get this information without actually pull and decompress the layer, which is not lazy pulling that this PR aims to.

I'll check the codebase deeper to come up with the way to get this information from somewhere or the way to safely allow c/storage.Layer.UncompressedSize to be unknown during operations. But if someone has good idea for solving this, please let me know.

cc: @siscia @giuseppe

metadata := make(map[string]string)
found := false
if additionalLayer := d.additionalLayer(id); additionalLayer != "" {
metadata = make(map[string]string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not needed, since it happens two lines above.

}
rootUID = int(st.UID())
rootGID = int(st.GID())
} // TODO: rootUID & rootGID of additional layer?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are ignoring the error, Should you at least log it?

for _, p := range d.AdditionalLayerStores() {
dirPath = path.Join(p, id)
if _, err := os.Stat(dirPath); err == nil {
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should do a return dirpath here, making the code easier to read.

@ktock ktock force-pushed the additional-layer-store branch from 273afd2 to f8d8659 Compare December 28, 2020 05:21
Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a drive-by comment; I don’t yet understand how this works at all.

layers.go Outdated Show resolved Hide resolved
@ktock ktock force-pushed the additional-layer-store branch from f8d8659 to dfb7332 Compare January 12, 2021 14:37
@ktock
Copy link
Contributor Author

ktock commented Jan 12, 2021

Request for comments (in the above PR description)

Known limitation and request for comments

Some operations (e.g. save) requires correct value to be set to c/storage.Layer.UncompressedSize. This field seems to be the size of the layer but without compression (i.e. the size of the tar-archived format of that layer). For registry-backed ALS, getting this information is difficult because neither of OCI/Docker image nor registry API provides the way to get the uncompressed size of layers. We cannot get this information without actually pull and decompress the layer, which is not lazy pulling that this PR aims to.

I'll check the codebase deeper to come up with the way to get this information from somewhere or the way to safely allow c/storage.Layer.UncompressedSize to be unknown during operations. But if someone has good idea for solving this, please let me know.

@ktock ktock force-pushed the additional-layer-store branch from dfb7332 to 01176fb Compare January 12, 2021 14:48
@ktock
Copy link
Contributor Author

ktock commented Jan 18, 2021

Can we move this forward?

store.go Outdated
return nil, "", ErrLayerUnknown
}
name := base64.StdEncoding.EncodeToString([]byte(k + "=" + v))
if name == "diff" || name == "info" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name is the base64 encoding of the concatenation of a string + "=" + another string.

How could that ever be equal to "diff" or "info"?

drivers/overlay/overlay.go Outdated Show resolved Hide resolved
store.go Outdated Show resolved Hide resolved
store.go Outdated Show resolved Hide resolved
drivers/driver.go Outdated Show resolved Hide resolved
@ktock ktock force-pushed the additional-layer-store branch 3 times, most recently from cb3527d to eed23f2 Compare January 25, 2021 15:23
@ktock
Copy link
Contributor Author

ktock commented Jan 25, 2021

@giuseppe Fixed the path design based on #795 (comment).

The default layer search path on ALS:

<ALS root>/<layer-digest>/
-- diff
-- info
  • diff shows the extracted layer diff.
  • info contains *c/storage.Layer struct.

When ref option is specified for ALS through layerstore option (e.g. "/tmp/storage:ref"), the image reference is contained to that path as well. This is needed for registry-backed file systems (eStargz/zstd:chunked) that require image references for querying the layer contents.

<ALS root>/<image-reference>/<layer-digest>/
-- diff
-- info

Now we can pull and run the image with reusing layer from ALS. The example implementation of the filesystem for ALS is https://github.com/ktock/stargz-snapshotter/tree/als-pool-example/cmd/registry-storage.

The remaining limitation is that ALS layers currently cannot be exported (i.e. saved or pushed) because:

  • It's difficult to reproduce the tar blob that has the exact same digest written in the manifest (".tar-split.gz" files aren't provided by ALS) from the diff directory.
  • Registry-backed filesystem cannot fill c/storage.Layer.UncompressedSize of info file (as described in "Known limitation and request for comments" section in the above PR description).
    • Possible workaround is to contain UncompressedSize in the eStargz/zstd:chunked image as OCI Annotations.

The current workaround for this limitation:

  • Podman: When we export layers using podman, we need to remove the image and re-pull this without --layerstore option.
  • CRI-O: This won't be an issue for CRI-O because we don't export layers on Kubernetes.

I think we can work on this issue on the following PRs but please tell me if we should work on them in this PR as well.

@ktock
Copy link
Contributor Author

ktock commented Feb 1, 2021

Can we move this forward?

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually experiment with new features touching containers/image and containers/storage in skopeo.

Do you have any example on how the new API is going to be used?

EDIT: we don't need to support any existing file system, it is enough to create the remote directory manually, as long as we have a clear idea of how it can be used

drivers/overlay/overlay.go Outdated Show resolved Hide resolved
drivers/overlay/overlay.go Outdated Show resolved Hide resolved
@ktock ktock force-pushed the additional-layer-store branch from eed23f2 to 83ce465 Compare February 2, 2021 03:36
@ktock
Copy link
Contributor Author

ktock commented Feb 2, 2021

@giuseppe

Do you have any example on how the new API is going to be used?

I list examples of the current patch with c/storage-related commands.

Manually preparing additional layer stores

We can manually create an additional layers store using something like the following script:

#!/bin/bash

set -euo pipefail

ORG="${1}"
STORE="${2}"

if [ "${1}" == "--ref" ] ; then
  ORG="${2}"
  STORE="${3}/$(echo -n ${2} | base64)"
fi

OCI="$(mktemp -d)"
skopeo copy docker://${ORG} oci://${OCI}
cat ${OCI}/blobs/sha256/$(cat ${OCI}/index.json | jq -r '.manifests[0].digest' | sed 's/sha256://') \
  | jq -r '.layers[].digest' | sed 's/sha256://' | while read DGST ; do
      mkdir -p ${STORE}/${DGST}/diff && \
        tar -xf ${OCI}/blobs/sha256/${DGST} -C ${STORE}/${DGST}/diff && \
        cat <<EOF > ${STORE}/${DGST}/info
{
  "compressed-diff-digest": "sha256:${DGST}",
  "compressed-size": $(stat --printf="%s" ${OCI}/blobs/sha256/${DGST}),
  "diff-digest": "sha256:$(cat ${OCI}/blobs/sha256/${DGST} | gunzip | sha256sum | sed -E 's/([^ ]*).*/\1/g')",
  "diff-size": $(cat ${OCI}/blobs/sha256/${DGST} | gunzip | wc -c),
  "compression": 2
}
EOF
done

Here we prepare two stores.
The first one is ALS with default path (<ALSroot>/<digest>) and the second one contains the image reference in the path (<ALSroot>/base64(<reference>)/<digest>).

/create_store.sh ghcr.io/stargz-containers/postgres:13.1-esgz /tmp/teststore
/create_store.sh --ref ghcr.io/stargz-containers/postgres:13.1-esgz /tmp/teststore_ref

config for default layer store:

cat <<EOF > /etc/containers/storage.conf
[storage]
driver = "overlay"

[storage.options]
additionallayerstores = ["/tmp/teststore"]
EOF

config for the layer store with "ref" option:

cat <<EOF > /etc/containers/storage.conf
[storage]
driver = "overlay"

[storage.options]
additionallayerstores = ["/tmp/teststore_ref:ref"]
EOF

Running c/storage, c/image -based tools with the additional layer stores

For both of above configurations, we can use these stores with cri-o, podman, skopeo, etc.

CRI-O

# crio
# time crictl pull ghcr.io/stargz-containers/postgres:13.1-esgz
...
real	0m5.502s
user	0m0.021s
sys	0m0.004s

Podman

# podman system prune -a
# time podman pull ghcr.io/stargz-containers/postgres:13.1-esgz
...
real	0m4.433s
user	0m0.222s
sys	0m0.137s
# podman run --rm -it ghcr.io/stargz-containers/postgres:13.1-esgz /bin/echo 'Hello, World!'
Hello, World!

NOTE: Export-related commands (e.g. podman save) are not supported.

# podman save ghcr.io/stargz-containers/postgres:13.1-esgz > /tmp/test.tar
Error: unable to save "ghcr.io/stargz-containers/postgres:13.1-esgz": error copying image to the remote destination: Error writing blob: error happened during read: Digest did not match, expected sha256:0167446b81cb115e90d668017ed409b449aee3a8fade44c7a96d36ebb0937ca5, got sha256:0a45da8cdff8388828644ad2052e91595733e558067b60b9f4bd290035964193

Skopeo

# skopeo copy docker://ghcr.io/stargz-containers/postgres:13.1-esgz containers-storage:myimage
...
real	0m4.335s
user	0m0.216s
sys	0m0.138s
# skopeo inspect containers-storage:myimage
# skopeo inspect --config containers-storage:myimage

NOTE: Export-related commands (e.g. scopeo copy from storage to somewhere) are not supported.

# skopeo copy containers-storage:myimage oci:///tmp/ocidump
FATA[0004] Error writing blob: error happened during read: Digest did not match, expected sha256:0167446b81cb115e90d668017ed409b449aee3a8fade44c7a96d36ebb0937ca5, got sha256:0a45da8cdff8388828644ad2052e91595733e558067b60b9f4bd290035964193 

Mounting and using PoC filesystem

https://github.com/ktock/stargz-snapshotter/tree/als-pool-example/cmd/registry-storage

The following mounts current PoC filesystem (esgz-based) to /tmp/teststore_ref.
NOTE: the implementation is highly experimental

# git clone -b als-pool-example https://github.com/ktock/stargz-snapshotter /tmp/registry-storage
# cd /tmp/registry-storage
# make registry-storage
# mkdir /tmp/registry
# ./out/registry-storage /tmp/registry

This mounted store can be used in the same ways as manually-created stores.
Registry-backed stores (esgz/zstd:chunked-based ones) require image reference in the path (thus ref option is used here) for querying diff contents to the registry API that requires image ref string in addition to the digest.

# cat <<EOF > /etc/containers/storage.conf
[storage]
driver = "overlay"

[storage.options]
additionallayerstores = ["/tmp/registry:ref"]
EOF
# crio
# crictl pull ghcr.io/stargz-containers/python:3.7-esgz

@giuseppe
Copy link
Member

giuseppe commented Feb 2, 2021

thanks!

Do I need any patch for Podman? I've tried the suggestion above but "podman pull" still pulls the entire image from the registry

@ktock
Copy link
Contributor Author

ktock commented Feb 2, 2021

Do I need any patch for Podman?

No patch is needed to c/podman. It just need to be compiled with the patched c/image (containers/image#1109) and the patched c/storage (#795) so go.mod and Makefile need to be modified a bit.

@giuseppe
Copy link
Member

giuseppe commented Feb 2, 2021

No patch is needed to c/podman. It just need to be compiled with the patched c/image (containers/image#1109) and the patched c/storage (#795) so go.mod and Makefile need to be modified a bit.

thanks, so it seems the :ref version doesn't work for me, while the version without :ref works

$ ls /tmp/teststore_ref/
'Z2hjci5pby9zdGFyZ3otY29udGFpbmVycy9wb3N0Z3JlczoxMy4xLWVzZ3o='

$ grep -B1 additionallayerstores ~/.config/containers/storage.conf 
  [storage.options]
  additionallayerstores = ["/tmp/teststore_ref:ref"]

EDIT:

also the version without ref seems to not work sometimes, if I repeat the same command multiple times "podman system reset && podman pull ...` sometimes the ALS is ignored

@giuseppe
Copy link
Member

giuseppe commented Feb 2, 2021

I'll play more with it and update with more details, it is probably a mistake on my side

@giuseppe
Copy link
Member

giuseppe commented Feb 2, 2021

sorry for the noise, the issue was on my side. It works fine now.

Do you think containers/storage should signal somehow that the layer is not used anymore? A rmdir would not cause the directory to go away when it is empty, but a FUSE file system could catch the syscall and decide whether to delete or not the layer.

What do you think?

@ktock
Copy link
Contributor Author

ktock commented Feb 2, 2021

@giuseppe I agree with the notification can be done using rmdir. c/storage/drivers/overlay.Remove() will be a good place to do this.

However, the layer directory possibly is shared (symlinked) by multiple storages. So isn't it hard to determine whether it's safe to cleanup this only with "cleanup" notification?
I think another type of notification is also needed when the driver starts to use a layer (e.g. on c/storage/drivers/overlay.CreateFromDirectory()). Or, xattr might be used as a reference counter.

WDYT?

@mtrmac
Copy link
Collaborator

mtrmac commented Feb 3, 2021

It's difficult to reproduce the tar blob that has the exact same digest written in the manifest (".tar-split.gz" files aren't provided by ALS) from the diff directory.

If the magic backing filesystem can read blobs on-demand, can’t it just read all of it and provide that stream?


The lack of UncompressedSize is somewhat of an issue anyway, if TryReusingBlob returns incorrect data, it could show up in that incorrect data recorded in c/storage image manifests.

Copy link
Collaborator

@mtrmac mtrmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most importantly, how does digest verification happen?

(Reminder: I know very little about c/storage .)

drivers/overlay/overlay.go Outdated Show resolved Hide resolved
drivers/overlay/overlay.go Show resolved Hide resolved
drivers/overlay/overlay.go Outdated Show resolved Hide resolved
layers.go Outdated Show resolved Hide resolved
layers.go Outdated
Comment on lines 636 to 661
compressedsums := make(map[digest.Digest][]string)
uncompressedsums := make(map[digest.Digest][]string)
names := make(map[string]*Layer)
layer = copyLayer(layerInfo)
layer.ID = id
layer.Parent = parent
layer.Created = time.Now().UTC()
if layer.CompressedDigest != "" {
compressedsums[layer.CompressedDigest] = append(compressedsums[layer.CompressedDigest], layer.ID)
}
if layer.UncompressedDigest != "" {
uncompressedsums[layer.UncompressedDigest] = append(uncompressedsums[layer.UncompressedDigest], layer.ID)
}
for _, name := range layer.Names {
if conflict, ok := names[name]; ok {
r.removeName(conflict, name)
}
names[name] = layer
}
// TODO: check if necessary fields are filled
r.layers = append(r.layers, layer)
r.idindex.Add(id)
r.byid[id] = layer
r.byname = names
r.bycompressedsum = compressedsums
r.byuncompressedsum = uncompressedsums
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICS this discards a lot of data about all the other layers. How does that work??!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks better now (but needs a review by someone who knows c/storage, i.e. not me).

store.go Outdated Show resolved Hide resolved
store.go Outdated Show resolved Hide resolved
@ktock ktock force-pushed the additional-layer-store branch from 83ce465 to 781e49b Compare February 4, 2021 11:59
@ktock
Copy link
Contributor Author

ktock commented Apr 7, 2021

How is this going?
Are there anything should be fixed?

Signed-off-by: Kohei Tokunaga <[email protected]>
@ktock ktock force-pushed the additional-layer-store branch from d02abea to 64f0181 Compare April 7, 2021 08:53
@rhatdan
Copy link
Member

rhatdan commented Apr 7, 2021

LGTM

Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rhatdan rhatdan merged commit 5dc7256 into containers:master Apr 8, 2021
@giuseppe
Copy link
Member

giuseppe commented Apr 8, 2021

@ktock merged!

@ktock
Copy link
Contributor Author

ktock commented Apr 8, 2021

Thanks! 🎉

@ktock ktock deleted the additional-layer-store branch April 8, 2021 11:02
@chvish
Copy link

chvish commented May 6, 2021

Hey @ktock, this is super cool. I have been looking for something like this for a while now.

Quick question: Does the demo from above still work? I have been trying to get podman to pull an image from the ALS, but it stills pulls from the remote registry even when the ALS is properly configured.

$ ./bin/podman info -f json | jq .store.graphOptions
{
  "overlay.additionallayerstore": "/tmp/teststore",
  "overlay.mount_program": {
    "Executable": "/usr/bin/fuse-overlayfs",
    "Package": "fuse-overlayfs: /usr/bin/fuse-overlayfs",
    "Version": "fusermount3 version: 3.9.0\nfuse-overlayfs: version 0.7.6\nFUSE library version 3.9.0\nusing FUSE kernel interface version 7.31"
  }
}

@ktock
Copy link
Contributor Author

ktock commented May 7, 2021

@chvish Are you talking about #795 (comment) ? The layout has been updated since then. Could you try the following script instead:

#!/bin/bash

set -euo pipefail

ORG="${1}"
STORE="${2}"

if [ "${1}" == "--ref" ] ; then
    ORG="${2}"
    STORE="${3}/$(echo -n ${2} | base64)"
fi

OCI="$(mktemp -d)"
skopeo copy docker://${ORG} oci://${OCI}
cat ${OCI}/blobs/sha256/$(cat ${OCI}/index.json | jq -r '.manifests[0].digest' | sed 's/sha256://') \
    | jq -r '.layers[].digest' | while read DGST ; do
    ENCODED_DGST=$(echo -n ${DGST} | sed 's/sha256://')
    mkdir -p ${STORE}/${DGST}/diff
    tar -xf ${OCI}/blobs/sha256/${ENCODED_DGST} -C ${STORE}/${DGST}/diff
    touch ${STORE}/${DGST}/use
    cat <<EOF > ${STORE}/${DGST}/info
{
  "compressed-diff-digest": "${DGST}",
  "compressed-size": $(stat --printf="%s" ${OCI}/blobs/sha256/${ENCODED_DGST}),
  "diff-digest": "sha256:$(cat ${OCI}/blobs/sha256/${ENCODED_DGST} | gunzip | sha256sum | sed -E 's/([^ ]*).*/\1/g')",
  "diff-size": $(cat ${OCI}/blobs/sha256/${ENCODED_DGST} | gunzip | wc -c),
  "compression": 2
}
EOF
done

@chvish
Copy link

chvish commented May 7, 2021

@ktock Works like a charm thanks.
Do we have any ETA on when Podman gets a release with the changes in container/storage and containers/image?

@rhatdan
Copy link
Member

rhatdan commented May 7, 2021

Podman 3.2 which is starting it release cycle now, should have this.

@ktock
Copy link
Contributor Author

ktock commented May 7, 2021

@chvish Thank you for trying. CRI-O will also support this after cri-o/cri-o#4850 is merged. So please try it too if you're interested in. The necessary configuration should be the same as Podman.

@rhatdan Thank you for starting the new release cycle!

@giuseppe
Copy link
Member

Runtime cannot verify lazy-pulled chunks

  • Though runtime (Podman/CRI-O) currently passes layer digest to Additional Layer Store for querying a layer, registry-backed store requires manifest digest or TOC digest for verifying each chunk.

  • Possible solution A: Passing manifest digest from c/storage to Additional Layer Store via path element.

    • Adding manifestdigest option to additionallayerstore config. This makes c/storage search the following path on Additional Layer Store.

      • /<store>/<manifestdigest>/<layerdigest>/diff
    • Additional Layer Store can use this digest for verifying each chunk.

  • Possible solution B: Passing manifest digest from Additional Layer Store to c/storage via big data file or something.

    • c/storage makes sure that the passed manifest digest is the expected one.

I think what we can do is to change the LookupAdditionalLayer API to accept also the TOC Digest, so that the c/image change looks like:

diff --git a/storage/storage_dest.go b/storage/storage_dest.go
index b15c9c3b..7a56fb55 100644
--- a/storage/storage_dest.go
+++ b/storage/storage_dest.go
@@ -367,7 +367,7 @@ func (s *storageImageDestination) tryReusingBlobAsPending(blobDigest digest.Dige
 
 	if options.SrcRef != nil {
 		// Check if we have the layer in the underlying additional layer store.
-		aLayer, err := s.imageRef.transport.store.LookupAdditionalLayer(blobDigest, options.SrcRef.String())
+		aLayer, err := s.imageRef.transport.store.LookupAdditionalLayer(blobDigest, options.SrcRef.String(), options.TOCDigest)
 		if err != nil && !errors.Is(err, storage.ErrLayerUnknown) {
 			return false, private.ReusedBlob{}, fmt.Errorf(`looking for compressed layers with digest %q and labels: %w`, blobDigest, err)
 		} else if err == nil {

and then propagate it down, until we can do something like, e.g.:

$ git diff
diff --git a/drivers/overlay/overlay.go b/drivers/overlay/overlay.go
index b2f0e7a94..95a67bc15 100644
--- a/drivers/overlay/overlay.go
+++ b/drivers/overlay/overlay.go
@@ -2475,6 +2475,9 @@ func (al *additionalLayer) CreateAs(id, parent string) error {
        if err := os.WriteFile(path.Join(dir, "additionallayer"), []byte(al.path), 0o644); err != nil {
                return err
        }
+       if err := os.WriteFile(path.Join(dir, "toc-digest"), []byte(al.tocDigest), 0o644); err != nil {
+               return err
+       }
        notifyUseAdditionalLayer(al.path)
        return os.Symlink(filepath.Join(al.path, "diff"), diffDir)
 }

what do you think?

@mtrmac
Copy link
Collaborator

mtrmac commented Mar 18, 2024

AFAICS nothing ever documented that this is related to TOC digests. So, from the c/image side, I’d prefer passing down the full raw value of annotations, which is both:

  • compatible with non-TOC users of ALS
  • forward-compatible with new TOC / compression formats (and with formats c/image is otherwise uninterested in handling, where using the TOCDigest field might be unclear or ambiguous)

And I wonder about compatibility with ALS backend which don’t support this toc-digest file.

But I’ll leave both design aspects to c/storage experts.

@ktock
Copy link
Contributor Author

ktock commented Mar 19, 2024

@giuseppe SGTM, thanks for your suggestion!

In addition to the suggested change, Additional Layer Store implementation needs be fixed to expose the "actual" TOC digest (acquired from the layer data, etc.) via info file of the layer (c/storage accesses to it via *additionalLayer.Info()). Then c/storage should check if the actual TOC digest (passed from Additional Layer Store) == the expected digest (passed from c/image) to make sure that Additional Layer Store uses the expected TOC.
Or, rather than directly using TOC digest, it's also ok to use the manifest digests that contains the TOC digest as a manifest annotation.

If either of the above design is ok to you, I'm willing to implement these changes.

@giuseppe
Copy link
Member

Then c/storage should check if the actual TOC digest (passed from Additional Layer Store) == the expected digest (passed from c/image) to make sure that Additional Layer Store uses the expected TOC.

when would c/storage check for that though? Before using the layer?

I think that the FUSE file system must fail at runtime with EIO if the provided TOC is different than what could be validated.

@ktock
Copy link
Contributor Author

ktock commented Mar 19, 2024

I think that the FUSE file system must fail at runtime with EIO if the provided TOC is different than what could be validated.

How about the following design:

Introduce a new FUSE file path <mountpoint>/base64(imageref)/<layerdigest>/toc-digest for each layer on Additional Layer Store FUSE.

  1. (drivers/overlay).*Driver.LookupAdditionalLayer() writes TOC Digest (expected digest passed from c/image) to toc-digest after confirmed the layer exists in the store (after getAdditionalLayerPath() invocation).
  2. Additional Layer Store checks the provied(expected) toc-digest against the actual one. If not matched, reading <mountpoint>/base64(imageref)/<layerdigest>/info will return EIO since this step.
  3. (drivers/overlay).*Driver.LookupAdditionalLayer() checks if reading on <mountpoint>/base64(imageref)/<layerdigest>/info returns EIO. If EIO, it propagates a proper error to the caller expecting fallback to the non-lazy behaviour. (possibly with a warning log)

Other changes to c/image (/storage/storage_dest.go) are the same as you suggested at #795 (comment)

@mtrmac
Copy link
Collaborator

mtrmac commented Mar 19, 2024

Thinking about this more… from the c/image + c/storage side, every layer only has one identifier; it’s either the compressed blob digest, or the layer digest.

And that identifier is used for deduplication, and it is security-relevant.

So, this needs to be a rather more invasive change.

Historically, ALS was always identifying / deduplicating the layer using the (compressed) blob digest.

If the ALS is actually constructing trust based on the TOC digest, and the compressed blob digest is not relevant / not verified, then the layer must not be identified / deduplicated based on the compressed digest (because an attacker could cause a TOC-based pull with a non-matching compressed digest X, causing future ordinary “victim” pulls of digest X to be deduplicated with the attacker’s layer).


So, I think we need the LookupAdditionalLayer API to

  • already have all lookup inputs (compressed digest, TOC digest if known)
  • look up independently based on one or the other digest
    • it must not look up by compressed digest and then “attach” the TOC digest to the compressed digest if the compressed digest is never validated. (Looking up by the pair of (compressed digest, TOC digest) would be acceptable if it never matched other layers where the TOC digest value is not set or does not match, but in that case using the compressed digest seems to have no benefit)
  • return to the caller an identification of whether the layer was identified by compressed digest / uncompressed digest / TOC digest (or possibly only 2 of the 3), so that AdditionalLayer.PutAs can be called with an appropriate layer ID.
    • In particular, AdditionalLayer.UncompressedDigest must not be returning unverified data; we rely on it being correct for security.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants