Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split into workspace/multiple crates #32

Open
cgwalters opened this issue Nov 11, 2024 · 16 comments
Open

split into workspace/multiple crates #32

cgwalters opened this issue Nov 11, 2024 · 16 comments

Comments

@cgwalters
Copy link
Collaborator

In the original composefs crate I tried to keep it to literally just Rust bindings to the core functionality; there was no concept of a global repository, garbage collection etc.

Say that a project like rauc.io or actually what ostree/bootc do today; or maybe rpm/dpkg wants to do things with composefs. I think it makes sense to keep that "core" split, because the APIs and functionality there should be 100% stable. The rest of the things here would be in e.g. a composers-storage crate or so?

@allisonkarlitskaya
Copy link
Collaborator

I think what's missing is to define what the "things" in question are, and what the value-add would be over just shelling out to mkcomposefs, composefs-info, or mount.composefs yourself. I kinda tend to think that it's better when a well-defined CLI is the API, without the need to add a wrapping layer that may or may not be what the person consuming it wants to use.

Aside from those wrappers (which I'd really prefer not to expose as APIs), I'm not sure what the "core" of this repository is, if not the "storage" part.

This is another reason that I'm struggling with the parser side of the dumpfile code from the original crate. I'm not sure why it would make sense to have that unless we're looking to implement our own version of mkcomposefs (which is not planned for the time being). If you have a dumpfile and you want to create a composefs out of it, then you don't need us to help you. That's literally mkcomposefs.

@cgwalters
Copy link
Collaborator Author

cgwalters commented Nov 12, 2024

This is another reason that I'm struggling with the parser side of the dumpfile code from the original crate. I'm not sure why it would make sense to have that unless we're looking to implement our own version of mkcomposefs

Ah, I think that's rooted in the other disagreement. The primary use case here is parsing the output of composefs-info dump (i.e. vs splitstream as source of truth). The secondary use case is accepting untrusted dump files for a system service to run mkcomposefs + mount.composefs so we can do composefs unprivileged.

@allisonkarlitskaya
Copy link
Collaborator

I don't consider splitstream to be a source of truth in any way. It's strictly an implementation detail, and doesn't ever appear "on the wire", and definitely doesn't need to be specified in any way. The source of truth is the layer tarball. Splitstream is just a space-saving way to store it which also happens to have very nice performance characteristics for building images.

@cgwalters
Copy link
Collaborator Author

Yes, we can debate the splitstream one in the other issue. Did the two rationales for having a Rust parser for the dump file format make sense or no?

@allisonkarlitskaya
Copy link
Collaborator

Not really. Would we parse the dumpfile just to print it out again when running mkcomposefs?

@cgwalters
Copy link
Collaborator Author

cgwalters commented Nov 12, 2024

For the "mkcomposefs from untrusted input" indeed that's what I'd like to do because it adds a whole layer of defense before passing potentially malicious input to C code that until recently had never seen fuzzing.

But the other case of parsing composefs-info dump into an in-memory filesystem metadata tree, we need a Rust parser right?

@jeckersb
Copy link

(Just inserting myself here to note that I've written a bunch of rust parsers in the past using nom, if that's something you'd be interested in me knocking out for the dump file format let me know)

@cgwalters
Copy link
Collaborator Author

In this case we're talking about the existing parser code in containers/composefs@7432e8c#diff-f8019245093e26a2f17e5b661cb9040f56327a3a29ac9fe79a3b4c5a5eb7050b that was dropped and I think has use cases, not writing new code currently.

@allisonkarlitskaya
Copy link
Collaborator

I still don't believe that composefs-info dump makes sense at all here. I'd actually have an easier time accepting tarfile-turned-into-dumpfile itself as an intermediate format and save us the roundtrip through composefs. But even this, I'd still consider to be inferior to the current code which is capable of reproducing the original tar stream bit-for-bit. In fact, I'm not even sure what creating an image would look like in that case, since the first thing we do is to verify that the entire image matches the sha256 checksum we're expecting (which involves checking the sha256 of the layers, which we'd need to be able to reliably reproduce).

@cgwalters
Copy link
Collaborator Author

OK let's try to spend the 30 mins tomorrow going through the splitstream bits.

But: AFAIK rauc.io today is using the composefs blobs as source-of-truth for GC and it is parsing them via the libcomposefs C code. Remember not every use case for composefs needs splitstream; they may not have the dual constraints/features imposed by OCI around checksums of tarballs and the desire to be able to re-push content. So for people that (reasonably) want to use composefs blobs as truth, a dump file parser is helpful. (Of course it'd be quite a bit more efficient to directly read the erofs either by binding the C library or a rewrite but that's its own heavy lift...)

So I think it makes sense to have the dump file parser available.

Are you getting hung up on using it in the existing code or merely how it should be structured/placed?

@cgwalters
Copy link
Collaborator Author

and the desire to be able to re-push content.

(And if they do have that desire, for projects that chose not-tar they may already have a "normal form" that they expect to use anyways; e.g. ostree is very much designed to have such a normal form; though there's different complexities in trying to have a composefs blob as source-of-truth for ostree, but that's tangential)

@allisonkarlitskaya
Copy link
Collaborator

But: AFAIK rauc.io today is using the composefs blobs as source-of-truth for GC and it is parsing them via the libcomposefs C code.

composefs-rs is doing that as well, except via parsing the (very simple) output of composefs-info objects.

@cgwalters
Copy link
Collaborator Author

composefs-rs is doing that as well, except via parsing the (very simple) output of composefs-info objects.

Yeah, that's a fair point. But then what I'd still say is I think we should have an API that is like get_composefs_objects(f: AsFd) -> Result<HashSet<Sha256>> or so right? That's "core functionality" as distinct from code that operates on an object directory.

That said it'd make total sense to me to have an struct ObjectsDirectory that is a thin abstraction over objects/<digest> etc and can may have a high level function like fn gc(&self, roots: HashSet<Sha256>) or so.

That leaves the higher level Repository and OCI type stuff in a distinct composers-storage crate or so that would have way more dependencies (e.g. I think the core crate probably shouldn't pull in tokio necessarily) and functionality.

@cgwalters
Copy link
Collaborator Author

In the very short term it's fine, we can basically publish mostly as is after some cleanups, there's no one that I know of using the previous crate version; this is more about setting expectations for more medium term direction.

@allisonkarlitskaya
Copy link
Collaborator

The one thing that I think absolutely makes sense to split out is the oci functionality. It's been kept in a separate subdirectory and it could (and should) be its own crate. Then we'd implement cfsctl as a separate crate that depends on composefs and composefs-oci.

The example you give — tokio — is a dependency only of the oci stuff, and only because the containers-image-proxy is async. There's a tonne more examples like that.

@allisonkarlitskaya
Copy link
Collaborator

Yeah, that's a fair point. But then what I'd still say is I think we should have an API that is like get_composefs_objects(f: AsFd) -> Result<HashSet> or so right? That's "core functionality" as distinct from code that operates on an object directory.

I still disagree about the usefulness of these kinds of APIs. This is basically a very thin wrapper on top of composefs-info objects. You can write that for yourself in a dozen or so lines of code and it might be that the decisions we make about how to format that output (return HashSet instead of call a callback, parsed Sha256 instead of hex digest string, or even split out xx/yyyy format strings, etc) are not the ones that the eventual users of the API would want.

In short: if people want these core "composefs" functionalities, they should just use the composefs CLI as an API. That's what it's there for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants