-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rfc] OCIv2 implementation #256
Comments
As an aside, it looks like |
On the plus side, for small files we can just use reflinks. |
Some things that should be tested and discussed:
|
like your inspiration from https://github.com/restic/restic, I think there is a good argument that the chunks and content addressable storage ought to be compatible with https://github.com/systemd/casync too. |
I will definitely look into this, though it should be noted (and I think we discussed this in-person in London) that while it is very important for fixed chunking parameters to be strongly recommended in the standard (so that all image builders can create compatible chunks for inter-distribution chunking) I think they should be configurable so that we have the option to transition to different algorithms in the future. Is there a paper or document that describes how casync's chunking algorithm works? I'm looking at the code and it uses Buzhash (which has a Go implementation apparently) but it's not clear to me what the chunk boundary condition is in I'm also quite interested in the serialisation format. Lennart describes it as a kind of random-access tar that is also reproducible (and contains all filesystem information in a sane way). I will definitely take a look at it. While I personally like using a Merkle tree because it's what |
As an aside, since we are creating a new serialisation format (unless we reuse |
I've already talked with @cyphar about it, but I'll comment here as well so to not lose track of it. The deduplication could also be done only locally (for example on: XFS with reflinks support). So that network deduplication and local storage deduplication could be done separately. I've played a bit with FIDEDUPERANGE here: https://github.com/giuseppe/containers-dedup |
@cyphar what was the argument against doing simply file-level deduplication? I don't claim to know the typology of all docker images, but on our side (NVIDIA) we have a few large libraries (cuDNN, cuBLAS, cuFFT) which are currently duplicated across multiple images we publish:
@giuseppe @cyphar it is my understanding that when deduplicating files/blocks at the storage level, we decrease storage space but the two files won't be able to share the same page cache entry. Is that accurate? |
@flx42 overlayfs has best approach for reusing page cache, since it's the same inode on the same maj/min device |
@vbatts right, and that's what we use today combined with careful layering. I just wanted to clarify if there was a solution at this level, for the cases where you do have the same file but not from the same layer. |
I think we (at least I) have put focus on registry-side storage & network deduplication. Runtime-side local deduplication is likely to be specific to runtimes and out of scope of OCI Image Spec & Dist Spec? |
A few things:
Given that CDC (and separation of metadata, merkle tree or some similar filesystem representation) already solves both "registry-side storage & network deduplication" I think that considering whether it's possible to take advantage of the same features for storage deduplication is reasonable... |
Small modifications of large files, or files that are substantially similar but not identical (think man pages, shared libraries and binaries shipped by multiple distributions, and so on) would be entirely duplicated. So for the image format I think that using file-level deduplication is flawed, for the same reasons that file-level deduplication in backup systems is flawed. But for storage deduplication this is a different story. My main reason for wanting to use reflinks is to be able to use less disk space. Unfortunately (as I discovered above) this is not possible for variable-size chunks (unless they are all multiples of the chunk size). Using file-based deduplication for storage does make some sense (though it does naively double your storage requirement out-of-the gate). My idea for this would be that when you download all of the chunks and metadata into your OCI store, you set up a separate content-addressed store which has files that correspond to each file represented in your OCI store. Then, when constructing a rootfs, you can just reflink (or hardlink if you want) all of the files from the file store into a rootfs (overlayfs would have to be used to make sure you couldn't touch any of the underlying files). Of course, it might be necessary (for fast container "boot" times) to pre-generate the rootfs for any given image -- but benchmarks would have to be done to see if it's necessary. My main interest in reflinks was to see whether it was possible to use them to remove the need for the copies for the "file store", but given that you cannot easily map CDC chunks to filesystem chunks (the latter being fixed-size) we are pretty much required to make copies I think. You could play with a FUSE filesystem to do it, but that is still slow (though some recent proposals to use eBPF could make it natively fast). As for the page-cache I'm not sure. Reflinks work by referencing the same extents in the filesystem, so it depends on how the page-cache interacts with extents or whether the page-cache is entirely tied to the particular inode. |
It should be noted that with this proposal there would no longer be a need for layers (because the practical deduplication they provide is effectively zero) though I think that looking into how we can use existing layered filesystems would be very useful -- because they are obviously quite efficient and it makes sense to take advantage of them. Users having to manually finesse layers is something that doesn't make sense (in my view), because the design of the image format should not be such that it causes problems if you aren't careful about how images are layered. So I would hope that a new design would not repeat that problem. |
@cyphar Thanks for the detailed explanation, I didn't have a clear picture of the full process, especially on how you were planning to assemble the rootfs, but now I understand.
I found the following discussion on this topic: https://www.spinics.net/lists/linux-btrfs/msg38800.html Not necessarily a big deal, but nevertheless an interesting benefit of layer sharing+overlay that would be nice to keep. |
FWIW, I wanted to quantify the difference with block-level vs file-level deduplication on real data, so I wrote a few simple scripts here: https://github.com/flx42/layer-dedup-test It pulls all the tags from this list (minus the Windows tags that will fail). This was the size of the layer directory after the pull:
Using
Using
This is a quick test, so no guarantee that it worked correctly. But this is a good first approximation. File-level deduplication performed better than I expected, block-level with CDC is indeed better but at the cost of extra complexity and possibly a two-level content store (block then file). |
Funnily enough, Go 1.11 has changed the default |
Does overlay share the page cache? It was my understanding that it didn't, but that might be an outdated piece of information. |
@cyphar yes it does:
Also a while back I launched two containers, one pytorch and one tensorflow, using the same CUDA+cuDNN base layers. Then using |
This is exactly what libostree is, though today we use a read-only bind mount since we don't want people trying to persist state in |
Blog post on the |
And so much conversation on The Twitter
…-------- Original Message --------
On Jan 21, 2019, 14:37, Aleksa Sarai [see §317C(6)] wrote:
Blog post on the tar issues is up. https://www.cyphar.com/blog/post/20190121-ociv2-images-i-tar
—
You are receiving this because you were mentioned.
Reply to this email directly, [view it on GitHub](https://github.com/openSUSE/umoci/issues/256#issuecomment-456076873), or [mute the thread](https://github.com/notifications/unsubscribe-auth/AAEF6SxFL6SvI9kPEpoGfeGG8G2LGSrqks5vFcKsgaJpZM4Wf52b).
|
Interesting Discussion on the image proposal based on files. If you want to see a production grade example have a look at the Image Packaging System (IPS) from illumos. Its based originally on the concept to be used as a package manager inside a Container but one can easily leave dependencies out of a manifest and thus create an image layer so to speak. Manifests are also merged ahead of time and so you only need to download what is needed. Additionally, by using text files to encode any metadata one can simply encode any attributes needed later in the spec. While it has a few pythonisms in it, I made a port of the server and manifest code to golang some time ago. Let me know if any of this is interesting to you I can give detailed insights and information about challenges we stumbled upon in the field in the last 10 years. Original Python Implementation (In use today on OpenIndiana and OmniOS) https://github.com/OpenIndiana/pkg5 |
So here is my own simplistic parallel casync/desync alternative, written in Rust, which uses fixed sized chunking (which is great for VM images): borgbackup/borg#7674 (comment) . You can also see there benchmark, which compares my tool to casync, desync and other alternatives. And my tool is way faster than all them. (But I cheat by using fixed sized chunking). See whole issue for context and especially this comment borgbackup/borg#7674 (comment) for comparison between casync, desync and other CDC-based tools |
Okay, so here is list of Github issues I |
I'm working on puzzlefs which shares goals with the OCIv2 design draft. It's written in Rust and it uses the FastCDC algorithm to chunk filesystems. |
@ariel-miculas, cool! Let me share some thoughts.
|
Thanks for your feedback!
|
Hi, @cyphar and everybody else! First of all, there is great alternative to tar called catar by Poettering: https://0pointer.net/blog/casync-a-tool-for-distributing-file-system-images.html . (Yes, everybody here are aware about casync/catar, but I still want to emphasize that casync/catar (and especially catar) is exactly what we need. It is deterministic. And you can choose what metadata to store.) Second: I did speed/size comparison of many deduplication solutions (borg, casync, etc): borgbackup/borg#7674 . The whole bug report ( borgbackup/borg#7674 ) is great read. It contains a lot of information about optimizations needed for creating fast deduplication solution. Also, I created my own deduplication solution called azwyon, which beats in terms of speed everything else, here is the code: https://paste.gg/p/anonymous/c5b7b841a132453a92725b41272092ab (assume public domain). Azwyon does deduplication and compressing and nothing else (deduplication between different blobs is supported). I. e. it is replacement for borg and casync, but without file trees support and without remote support. Here is summary on what is wrong with other solutions and why they are so slow: https://lobste.rs/s/0itosu/look_at_rapidcdc_quickcdc#c_ygqxsl . Okay, why my solution is so fast? Reasons are so:
(As well as I understand, puzzlefs is not parallel. So I suppose my solution is faster than it, but I didn't test.) My solution (azwyon) is minimalist. It doesn't support storing file trees, it stores data blobs only (azwyon splits the blobs to chunks, deduplicates them and compresses them). But of course, you can combine azwyon with tar/catar. Also, azwyon doesn't do CDC. It splits blobs to fixed-sized chunks instead. So comparison of azwyon with CDC-based tools is unfair. Also, compress ratio of azwyon is 25% worse than other solutions, because of lack of CDC. But, of course, you can add CDC to azwyon. So, in short: add catar and CDC to azwyon and you will get solution, which will be blazingly fast. In current benchmarks azwyon extracts data 4.7 times faster (!) than CDC-based casync (again: I compare fixed sized chunks with CDC, which is unfair), while having 25% worse compression ratio. And azwyon creates snapshot 35 times faster (!!!) that CDC-based casync. (Full benchmark is here: borgbackup/borg#7674 (comment) ) All these makes some use cases possible, which was impossible before. My personal goal was to store 10 GiB VM images. Other solutions (borg, casync, etc) were too slow for this. So they were simply impractical. Azwyon made this use case possible. Azwyon stores snapshot of 10 GiB VM image for 2.7 s (with compression and deduplication) and extracts it back for 4.9 s. 12 different snapshots of single 10 GiB VM (nearly 120 GiB in total) give you 2.8 Gb storage (vs 2.1 Gb for casync, but casync is many times slower.) Again: if you add CDC to azwyon you will regain compression ratio benefits, while still having speed benefits. I already wrote about this in this thread. But I decided to write again, because what I'm talking about is nearly finished solution. Just go "rust + blake3 + zstd + CDC + pariter + catar" and you will get solution, which is deterministic and blazingly fast. It will be so fast, that it will enable use cases not possible before! Like storing 10 GiB image to deduplicating data storage for 2.7 s and extracting it back for 4.9 s. (Everywhere in this post when I say "compression ratio" I mean ratio between source uncompressed data and deduplicated compressed data. Everywhere I say "storing time", "compressing time", "snapshot creating time", etc, I mean time to both deduplicate and compress data combined. Everywhere I say "extracting time", "uncompressing time", etc, I mean time to extract data from deduplicated compressed storage.) |
I have some proposal ideas for the OCIv2 image specification (it would actually be OCIv1.1 but that is a less-cool name for the idea) and they primarily involve swapping out the lower levels of the archive format to better designed (along the same lines as restic or borgbackup).
We need to implement this as a PoC in umoci before it's proposed to the image-spec proper so that we don't get stuck in debates over whether it has been tested "in the wild" -- which is something that I imagine any OCI extension is going to go through.
The text was updated successfully, but these errors were encountered: