Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rfc] OCIv2 implementation #256

Open
cyphar opened this issue Sep 8, 2018 · 29 comments
Open

[rfc] OCIv2 implementation #256

cyphar opened this issue Sep 8, 2018 · 29 comments
Labels
oci-spec Issue directly related to OCI image-spec. upstream
Milestone

Comments

@cyphar
Copy link
Member

cyphar commented Sep 8, 2018

I have some proposal ideas for the OCIv2 image specification (it would actually be OCIv1.1 but that is a less-cool name for the idea) and they primarily involve swapping out the lower levels of the archive format to better designed (along the same lines as restic or borgbackup).

We need to implement this as a PoC in umoci before it's proposed to the image-spec proper so that we don't get stuck in debates over whether it has been tested "in the wild" -- which is something that I imagine any OCI extension is going to go through.

@cyphar cyphar added upstream oci-spec Issue directly related to OCI image-spec. labels Sep 8, 2018
@cyphar cyphar added this to the 0.5.0 milestone Sep 8, 2018
@cyphar
Copy link
Member Author

cyphar commented Sep 8, 2018

As an aside, it looks like copy_file_range(COPY_FR_DEDUP) wasn't merged. But you can use ioctl(FICLONERANGE) or ioctl(FIDEDUPERANGE) (depending on which is the most correct way of doing it -- I think FICLONERANGE is what we want). If it isn't enough we can always revive the patch, as one of the arguments against it was that nobody needed partial-file deduplication -- but we need this now for OCIv2 to have efficient deduplicated storage.

@cyphar
Copy link
Member Author

cyphar commented Sep 8, 2018

FICLONERANGE needs to be block-aligned (unsurprisingly) but unfortunately the block alignment is for both source and destination. This means that if we have different block sizes which are out-of-alignment of the filesystem block size we will have very few alignments.

On the plus side, for small files we can just use reflinks.

@cyphar
Copy link
Member Author

cyphar commented Sep 19, 2018

Some things that should be tested and discussed:

  • How bad is the Merkle tree hit? Should each individual file be linked from a map (or a packfile) of some kind to avoid really tall trees? How deep can a normal distribution's filesystem go? Each de-reference can be quite expensive (especially if it involves a pull -- but I would hope that HTTP/2 server push would resolve this somewhat).

  • What sort of chunk size is optimal?

  • How should we implement that canonical representation checking? This is something that should be a hard failure when trying to use an image, to avoid incompatible tools from doing something wrong.

  • As a point of comparison, looking at how much transfer-deduplication gain we can get from content-defined-chunking would be interesting.

  • Do we need to define a new rootfs type other than layered for this change? Layers are something we should potentially drop -- but maybe we should structure it as a "snapshot" concept in case people still want snapshots.

@vbatts
Copy link
Member

vbatts commented Sep 19, 2018

like your inspiration from https://github.com/restic/restic, I think there is a good argument that the chunks and content addressable storage ought to be compatible with https://github.com/systemd/casync too.

@cyphar
Copy link
Member Author

cyphar commented Sep 20, 2018

I will definitely look into this, though it should be noted (and I think we discussed this in-person in London) that while it is very important for fixed chunking parameters to be strongly recommended in the standard (so that all image builders can create compatible chunks for inter-distribution chunking) I think they should be configurable so that we have the option to transition to different algorithms in the future.

Is there a paper or document that describes how casync's chunking algorithm works? I'm looking at the code and it uses Buzhash (which has a Go implementation apparently) but it's not clear to me what the chunk boundary condition is in shall_break (I can see that it's (v % c->discriminator) == (c->discriminator - 1) but I don't know what that means).

I'm also quite interested in the serialisation format. Lennart describes it as a kind of random-access tar that is also reproducible (and contains all filesystem information in a sane way). I will definitely take a look at it. While I personally like using a Merkle tree because it's what git does and is kind of what makes the most sense IMO (plus it is entirely transparent to the CAS), I do see that having a streamable system might be an improvement too.

@cyphar
Copy link
Member Author

cyphar commented Sep 20, 2018

As an aside, since we are creating a new serialisation format (unless we reuse casync) we will need to implement several debugging tools because now you will no longer be able to use tar for debugging layers.

@giuseppe
Copy link
Member

giuseppe commented Oct 10, 2018

I've already talked with @cyphar about it, but I'll comment here as well so to not lose track of it. The deduplication could also be done only locally (for example on: XFS with reflinks support). So that network deduplication and local storage deduplication could be done separately.

I've played a bit with FIDEDUPERANGE here: https://github.com/giuseppe/containers-dedup

@flx42
Copy link
Contributor

flx42 commented Oct 10, 2018

@cyphar what was the argument against doing simply file-level deduplication? I don't claim to know the typology of all docker images, but on our side (NVIDIA) we have a few large libraries (cuDNN, cuBLAS, cuFFT) which are currently duplicated across multiple images we publish:

  • The files are duplicated if you redo a build even if nothing has changed, since it creates a new layer with the same content.
  • The files are duplicated across CUDA images with different distros: the same library is shipped for our CentOS 6/7, and Ubuntu 14.04/16.04/18.04 tags.

@giuseppe @cyphar it is my understanding that when deduplicating files/blocks at the storage level, we decrease storage space but the two files won't be able to share the same page cache entry. Is that accurate?
Is that an issue that can be solved at this level too? Or will users still need to layer carefully to achieve this sharing?

@vbatts
Copy link
Member

vbatts commented Oct 10, 2018

@flx42 overlayfs has best approach for reusing page cache, since it's the same inode on the same maj/min device

@flx42
Copy link
Contributor

flx42 commented Oct 10, 2018

@vbatts right, and that's what we use today combined with careful layering. I just wanted to clarify if there was a solution at this level, for the cases where you do have the same file but not from the same layer.

@AkihiroSuda
Copy link
Member

The deduplication could also be done only locally (for example on: XFS with reflinks support). So that network deduplication and local storage deduplication could be done separately.

I think we (at least I) have put focus on registry-side storage & network deduplication.

Runtime-side local deduplication is likely to be specific to runtimes and out of scope of OCI Image Spec & Dist Spec?

@cyphar
Copy link
Member Author

cyphar commented Oct 11, 2018

@AkihiroSuda

A few things:

  1. It depends how you define "runtime". If you include everything about the machine that pulls the image, extracts the image, and the runs a container as the "runtime" then you're correct that it's a separate concern. But I would argue that most image users would need to do both pulling and extraction -- so it's clearly an image-spec concern to at least consider it.

  2. Ignoring (or punting on) storage deduplication (when we have the chance to do it) would likely result in suboptimal storage deduplication -- which is something that people want! I would like OCIv2 images to actually replace OCIv1 and if the storage deduplication properties are worse or no better, then that might not happen.

Given that CDC (and separation of metadata, merkle tree or some similar filesystem representation) already solves both "registry-side storage & network deduplication" I think that considering whether it's possible to take advantage of the same features for storage deduplication is reasonable...

@cyphar
Copy link
Member Author

cyphar commented Oct 11, 2018

@flx42

what was the argument against doing simply file-level deduplication?

Small modifications of large files, or files that are substantially similar but not identical (think man pages, shared libraries and binaries shipped by multiple distributions, and so on) would be entirely duplicated. So for the image format I think that using file-level deduplication is flawed, for the same reasons that file-level deduplication in backup systems is flawed.

But for storage deduplication this is a different story. My main reason for wanting to use reflinks is to be able to use less disk space. Unfortunately (as I discovered above) this is not possible for variable-size chunks (unless they are all multiples of the chunk size).

Using file-based deduplication for storage does make some sense (though it does naively double your storage requirement out-of-the gate). My idea for this would be that when you download all of the chunks and metadata into your OCI store, you set up a separate content-addressed store which has files that correspond to each file represented in your OCI store. Then, when constructing a rootfs, you can just reflink (or hardlink if you want) all of the files from the file store into a rootfs (overlayfs would have to be used to make sure you couldn't touch any of the underlying files). Of course, it might be necessary (for fast container "boot" times) to pre-generate the rootfs for any given image -- but benchmarks would have to be done to see if it's necessary.

My main interest in reflinks was to see whether it was possible to use them to remove the need for the copies for the "file store", but given that you cannot easily map CDC chunks to filesystem chunks (the latter being fixed-size) we are pretty much required to make copies I think. You could play with a FUSE filesystem to do it, but that is still slow (though some recent proposals to use eBPF could make it natively fast).

As for the page-cache I'm not sure. Reflinks work by referencing the same extents in the filesystem, so it depends on how the page-cache interacts with extents or whether the page-cache is entirely tied to the particular inode.

@cyphar
Copy link
Member Author

cyphar commented Oct 11, 2018

@flx42

It should be noted that with this proposal there would no longer be a need for layers (because the practical deduplication they provide is effectively zero) though I think that looking into how we can use existing layered filesystems would be very useful -- because they are obviously quite efficient and it makes sense to take advantage of them.

Users having to manually finesse layers is something that doesn't make sense (in my view), because the design of the image format should not be such that it causes problems if you aren't careful about how images are layered. So I would hope that a new design would not repeat that problem.

@flx42
Copy link
Contributor

flx42 commented Oct 11, 2018

@cyphar Thanks for the detailed explanation, I didn't have a clear picture of the full process, especially on how you were planning to assemble the rootfs, but now I understand.

As for the page-cache I'm not sure. Reflinks work by referencing the same extents in the filesystem, so it depends on how the page-cache interacts with extents or whether the page-cache is entirely tied to the particular inode.

I found the following discussion on this topic: https://www.spinics.net/lists/linux-btrfs/msg38800.html
I was able to reproduce their results with btrfs/xfs, indicating that the page cache was not shared. As you mentioned, the solution could be to hardlink files when assembling the final rootfs instead of reflinking. You would need an overlay obviously, but that means you won't be able to leverage the CoW mechanism from the underlying filesystem (which might be fined grain) and instead rely on copy_up which copies the full file AFAIK.

Not necessarily a big deal, but nevertheless an interesting benefit of layer sharing+overlay that would be nice to keep.

@flx42
Copy link
Contributor

flx42 commented Oct 16, 2018

FWIW, I wanted to quantify the difference with block-level vs file-level deduplication on real data, so I wrote a few simple scripts here: https://github.com/flx42/layer-dedup-test

It pulls all the tags from this list (minus the Windows tags that will fail). This was the size of the layer directory after the pull:

+ du -sh /mnt/docker/overlay2
822G    /mnt/docker/overlay2

Using rmlint with hardlinks (file-level deduplication):

+ du -sh /mnt/docker/overlay2
301G	/mnt/docker/overlay2

Using restic with CDC (block-level deduplication):

+ du -sh /tmp/restic
244G    /tmp/restic

This is a quick test, so no guarantee that it worked correctly. But this is a good first approximation. File-level deduplication performed better than I expected, block-level with CDC is indeed better but at the cost of extra complexity and possibly a two-level content store (block then file).

@cyphar
Copy link
Member Author

cyphar commented Nov 7, 2018

Funnily enough, Go 1.11 has changed the default archive/tar output -- something that having a canonical representation would solve. See #269.

@cyphar
Copy link
Member Author

cyphar commented Nov 7, 2018

@flx42

You would need an overlay obviously, but that means you won't be able to leverage the CoW mechanism from the underlying filesystem (which might be fined grain) and instead rely on copy_up which copies the full file AFAIK.

Does overlay share the page cache? It was my understanding that it didn't, but that might be an outdated piece of information.

@flx42
Copy link
Contributor

flx42 commented Nov 8, 2018

@cyphar yes it does:
https://docs.docker.com/storage/storagedriver/overlayfs-driver/#overlayfs-and-docker-performance

Page Caching. OverlayFS supports page cache sharing. Multiple containers accessing the same file share a single page cache entry for that file. This makes the overlay and overlay2 drivers efficient with memory and a good option for high-density use cases such as PaaS.

Also a while back I launched two containers, one pytorch and one tensorflow, using the same CUDA+cuDNN base layers. Then using /proc/<pid>/maps on both containers I was able to verify they loaded the same copy of one library (the same inode).

@cgwalters
Copy link

My idea for this would be that when you download all of the chunks and metadata into your OCI store, you set up a separate content-addressed store which has files that correspond to each file represented in your OCI store. Then, when constructing a rootfs, you can just reflink (or hardlink if you want) all of the files from the file store into a rootfs (overlayfs would have to be used to make sure you couldn't touch any of the underlying files).

This is exactly what libostree is, though today we use a read-only bind mount since we don't want people trying to persist state in /usr. (It's still a mess today how in the Docker ecosystem / is writable by default but best practice is to use Kubernetes PersistentVolumes or equivalent). Though running containers as non-root helps since that will at least deny writes to /usr.

@cyphar
Copy link
Member Author

cyphar commented Jan 21, 2019

Blog post on the tar issues is up. https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar

@vbatts
Copy link
Member

vbatts commented Jan 21, 2019 via email

@Toasterson
Copy link

Interesting Discussion on the image proposal based on files. If you want to see a production grade example have a look at the Image Packaging System (IPS) from illumos. Its based originally on the concept to be used as a package manager inside a Container but one can easily leave dependencies out of a manifest and thus create an image layer so to speak. Manifests are also merged ahead of time and so you only need to download what is needed. Additionally, by using text files to encode any metadata one can simply encode any attributes needed later in the spec.
I was thinking on extending the server with a registry API so that one can download a dynamically generated tarfile and use the file based storage in the background.

While it has a few pythonisms in it, I made a port of the server and manifest code to golang some time ago. Let me know if any of this is interesting to you I can give detailed insights and information about challenges we stumbled upon in the field in the last 10 years.

Original Python Implementation (In use today on OpenIndiana and OmniOS) https://github.com/OpenIndiana/pkg5
Personal port to golang (server side only atm) https://git.wegmueller.it/Illumos/pkg6

@safinaskar
Copy link

So here is my own simplistic parallel casync/desync alternative, written in Rust, which uses fixed sized chunking (which is great for VM images): borgbackup/borg#7674 (comment) . You can also see there benchmark, which compares my tool to casync, desync and other alternatives. And my tool is way faster than all them. (But I cheat by using fixed sized chunking). See whole issue for context and especially this comment borgbackup/borg#7674 (comment) for comparison between casync, desync and other CDC-based tools

@safinaskar
Copy link

Okay, so here is list of Github issues I spammed wrote in last few days on this topic (i. e. fast fixed-sized and CDC-based deduplication). I hope they provide great insight to everyone interested in fast deduplicated storage.
borgbackup/borg#7674
systemd/casync#259
folbricht/desync#243
ipfs/specs#227
dpc/rdedup#222
#256

@ariel-miculas
Copy link

I'm working on puzzlefs which shares goals with the OCIv2 design draft. It's written in Rust and it uses the FastCDC algorithm to chunk filesystems.
Here's a summary of the saved space compared to the traditional OCIv1 format.
I will also present it at upcoming Open Source Summit Europe in September.

@safinaskar
Copy link

@ariel-miculas, cool! Let me share some thoughts.

@ariel-miculas
Copy link

Thanks for your feedback!

  • Interesting issue with rdedup, puzzlefs is using the fastcdc crate which I've noticed it's not used by rdedup. Some benchmarks with puzzlefs would sure be useful.
  • I wasn't aware of the FUSE issue with suspend. However, I'm also working on a kernel driver for puzzlefs, see version 1, version 2 and also this github issue.

@safinaskar
Copy link

Hi, @cyphar and everybody else!
Here are a few my ideas about implementing proper format. I. e. on tar alternatives and on deduplication and compressing.
Yes, I already wrote to this bug report, but I want to write again, because what I'm proposing is nearly ready solution.

First of all, there is great alternative to tar called catar by Poettering: https://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html .
catar is plain tar replacement, it has no deduplication and compressing. Catar is able to store nearly all file metadata supported by Linux kernel. And in deterministic way. And you can decide, what metadata you want to store, and what metadata you don't want to store (e. g. --with=xattrs --without=acl).
Catar is base for casync, which adds content defined chunking (CDC) based deduplication and compressing.
Unfortunately, casync seems to be unmaintained, but I think catar format and tools for creating/extracting it are stable and finished. Also catar has a lot of forks and re-implementations. Beware: some of forks/re-implementations don't sort file names when storing them in catar, thus defying reproducibility.

(Yes, everybody here are aware about casync/catar, but I still want to emphasize that casync/catar (and especially catar) is exactly what we need. It is deterministic. And you can choose what metadata to store.)

Second: I did speed/size comparison of many deduplication solutions (borg, casync, etc): borgbackup/borg#7674 . The whole bug report ( borgbackup/borg#7674 ) is great read. It contains a lot of information about optimizations needed for creating fast deduplication solution. Also, I created my own deduplication solution called azwyon, which beats in terms of speed everything else, here is the code: https://paste.gg/p/anonymous/c5b7b841a132453a92725b41272092ab (assume public domain). Azwyon does deduplication and compressing and nothing else (deduplication between different blobs is supported). I. e. it is replacement for borg and casync, but without file trees support and without remote support.

Here is summary on what is wrong with other solutions and why they are so slow: https://lobste.rs/s/0itosu/look_at_rapidcdc_quickcdc#c_ygqxsl .

Okay, why my solution is so fast? Reasons are so:

(As well as I understand, puzzlefs is not parallel. So I suppose my solution is faster than it, but I didn't test.)

My solution (azwyon) is minimalist. It doesn't support storing file trees, it stores data blobs only (azwyon splits the blobs to chunks, deduplicates them and compresses them). But of course, you can combine azwyon with tar/catar.

Also, azwyon doesn't do CDC. It splits blobs to fixed-sized chunks instead. So comparison of azwyon with CDC-based tools is unfair. Also, compress ratio of azwyon is 25% worse than other solutions, because of lack of CDC. But, of course, you can add CDC to azwyon.

So, in short: add catar and CDC to azwyon and you will get solution, which will be blazingly fast. In current benchmarks azwyon extracts data 4.7 times faster (!) than CDC-based casync (again: I compare fixed sized chunks with CDC, which is unfair), while having 25% worse compression ratio.

And azwyon creates snapshot 35 times faster (!!!) that CDC-based casync.

(Full benchmark is here: borgbackup/borg#7674 (comment) )

All these makes some use cases possible, which was impossible before. My personal goal was to store 10 GiB VM images. Other solutions (borg, casync, etc) were too slow for this. So they were simply impractical. Azwyon made this use case possible. Azwyon stores snapshot of 10 GiB VM image for 2.7 s (with compression and deduplication) and extracts it back for 4.9 s. 12 different snapshots of single 10 GiB VM (nearly 120 GiB in total) give you 2.8 Gb storage (vs 2.1 Gb for casync, but casync is many times slower.)

Again: if you add CDC to azwyon you will regain compression ratio benefits, while still having speed benefits.

I already wrote about this in this thread. But I decided to write again, because what I'm talking about is nearly finished solution. Just go "rust + blake3 + zstd + CDC + pariter + catar" and you will get solution, which is deterministic and blazingly fast. It will be so fast, that it will enable use cases not possible before! Like storing 10 GiB image to deduplicating data storage for 2.7 s and extracting it back for 4.9 s.

(Everywhere in this post when I say "compression ratio" I mean ratio between source uncompressed data and deduplicated compressed data. Everywhere I say "storing time", "compressing time", "snapshot creating time", etc, I mean time to both deduplicate and compress data combined. Everywhere I say "extracting time", "uncompressing time", etc, I mean time to extract data from deduplicated compressed storage.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oci-spec Issue directly related to OCI image-spec. upstream
Projects
None yet
Development

No branches or pull requests

9 participants