Automatic sharding of big directories #7022

Stebalien · 2020-03-20T18:44:55Z

This is a tracking issue for the sharded directory feature. This feature can be enabled with:

> ipfs config --json Experimental.ShardingEnable true

TODOs:

Make sure we don't shard too soon. At the moment, the cutoff to start sharding is very low.
Ideally, avoid creating sharded directories thar are too "deep". However, this may be tricky to do without breaking changes.

Ideally, we'd fix this issue by switching to unixfs 2.0 and a generalized sharded map implementation. However, that's probably not going to happen in the near future and this feature is used widely enough that we'll likely have to stabilize it in its current form.

achingbrain · 2020-09-23T17:59:41Z

I'd really like to enable sharding by default in js-IPFS to reduce the amount of codepaths and A/B tests.

ipfs.add in js-IPFS takes a shardingSplitThreshold argument that sounds similar to the sharding cutoff above - it's set to 1000 by default when sharding is enabled.

Can go-IPFS take the same option? That should tick off the first item.

cc @aschmahmann

lidel · 2021-02-15T15:25:24Z

We really should fix this. js-ipfs shards dirs >1k of items out-of-the-box and go-ipfs fails flat.

On top of that, the biggest problem in go-ipfs right now is that when you enable sharding, it shards EVERYTING, so even if you have a small dir under 1k of items, its still sharded, which produces different CIDs.

I believe a lot of work happened in IPLD team since we implemented HAMT-sharding in go-ipfs, would be useful to leverage that work when we make sharding enabled by default in go-ipfs.

@ribasushi @warpfork are there any libs / specs related to improved sharding in IPLD/Filecoin which we should be aware of in planning here?

warpfork · 2021-02-15T16:12:30Z

Yes!

We shifted to trying to make sharding "in scope" for IPLD by defining it as something you do with an ADL ("Advanced Data Layout"), which is a plugin-like concept.
A lot of work was done on HAMTs as a map sharding mechanism. There's a spec in https://github.com/ipld/specs/blob/master/data-structures/hashmap.md ; a golang implementation in https://github.com/ipld/go-ipld-adl-hamt ; and a JS implementation in ((hm, @rvagg, give me the latest link?)).
We don't have particular work done on the "don't shard below $X entries" problem. The idea was that we'd more or less ignore this during the implementation of the "guts" of something like a HAMT, and then do it with a small piece of logic at the top level of an ADL. The net effect would be that the plugin versioning problem of an ADL would still thus cover the "when do we start sharding" decision (which is good; answers all in one place).
All in all there's a lot of pieces here. Some assembly still required.

I'll re-tag to @rvagg and @mvdan for more info.

mvdan · 2021-02-15T16:21:02Z

Doesn't a HAMT already "shard" or split the data among multiple blocks, depending on its parameters like bitWidth and bucketSize? If one were to add sharding at the ADL layer on top of go-ipld-adl-hamt, would you not end up with two layers of sharding?

warpfork · 2021-02-15T20:56:20Z

iiuc I think the point of the "shardingSplitThreshold" mentioned above is that there's some value below which any sharding should be skipped, and a simple inline map (potentially, not even a single new block) should be used.

So what I meant by "don't shard below $X entries" is like: less than 1000 (or $X) entries, don't invoke a hamt at all and just use a regular plain map in the same block. Or in other words, I'm treating "sharding" as roughly an alias for using a HAMT (or some other ADL that serves a similar purpose to HAMTs).

My statement about this and how it relates to ADLs is: whatever that condition is for whether to invoke the full HAMT (and all its sharding logic) vs just do an inline plain map... is something that I think should be placed within the boundary of what we call ADLs. Doing so means we have a place to identify what that $X condition is: it's the same place we identify the ADL as a whole. This doesn't mean the problem goes away, but it means two problems can solved at once in one place: we already have the "identify the sharding logic needed here" problem; we might as well combine that with the "what is the shard vs inline threshhold and how do we identify that logical conditon" problem.

mvdan · 2021-02-16T10:42:42Z

Ah, gotcha, that makes sense. I wonder if we could expose that as a generic ADL layer, e.g. "use a basic map if size < N, otherwise use this fancy multi-block map prototype".

rvagg · 2021-02-17T02:47:39Z

JS implementation is https://github.com/rvagg/js-ipld-hashmap which is algorithmically backed by https://github.com/rvagg/iamap, but we're a little out of sync across our HAMT implementations (and spec) as we've tried to (a) integrate Filecoin's formats into our specs and ideally enable our implementations to speak that format, and (b) improve based on some learnings. It wouldn't be hard, but if we want to put the pedal down on UnixFS v2 and a better directory map->hamt version/implementation then we would need to budget the work and get it all sorted out. Otherwise, if we're not intending to reach for UnixFS v1 it seems like we should be using the HAMT stuff that's supposed to come with MFS. I'm not really over that work or how it might differ from the HAMT work we've done outside of MFS and how standardised it is and whether we can break the layout to do it betterer?
@mvdan - there might be yet another HAMT layout to consider for the ipld-prime ADL work you've done, we haven't even talked about MFS in this context before.

achingbrain · 2021-02-17T07:57:47Z

we're a little out of sync across our HAMT implementations

Yes - js-IPFS uses hamt-sharding* and not ipld-hashmap (I didn't know about this new package, if it generates the same HAMTs as go-IPFS maybe you could open a PR refactoring the unixfs-importer and mfs internals to use this instead).

We don't have particular work done on the "don't shard below $X entries" problem

Neither did js-IPFS TBH, I think the 1k figure for $X may have been plucked out of thin air but it seems to work well enough and the user can override it if they choose.

if we want to put the pedal down on UnixFS v2

I want UnixFS v2 but I think tying the resolution of this issue to V2 would be a mistake as it's a much bigger job with an unclear end point as mentioned in the OP.

A more expedient path would be for go-IPFS to just add the shardingSplitThreshold option to ipfs.add to let the user specify $X and shard conditionally based on that.

The js-IPFS import logic will use a regular UnixFS directory until it hits the threshold, then converts that to a shard once the threshold is passed.

The MFS implementation then uses this mechanism to convert existing directories to HAMT shards when adding members takes you over the threshold, and also turns shards back into regular directories when removing members takes you below it.

* hamt-sharding was extracted from ipfs-unixfs-importer with the aim of reuse but I don't think that's been terribly successful 🤷

rvagg · 2021-02-18T06:09:12Z

Excuse my ignorance but I'm not quite grokking it from this thread and this thread is already breaking my assumptions about how it currently works - but don't go-ipfs and js-ipfs both implement the same HAMT algorithm and block structure for large directories? Or is this issue actually saying that go-ipfs doesn't have such support and only js-ipfs does, and therefore the use of the HAMT is not actually standardised or specified?

warpfork · 2021-05-03T15:35:32Z

Quick comment on this that's on my mind right now --

Whatever threshholds are used here... well, there's an interesting tension in what metric to use as an input to the decision.

Option 1: decide based on the number of entries in a map.
- Virtue: simple and predictable.
- Vice: is not directly deciding based on the thing that really matters -- which is that we make sure we don't exceed the block size.
Option 2: decide based on the serial size of map entries.
- Virtue: is a more direct association with the hard limit we actually care about.
- Vice: significantly more expensive to compute this: requires actually doing the serialization work... and potentially, repeatedly; in the worst case, it's a guess-and-check thing which could require multiple guesses.

I think we should steer towards Option 1:

Simplicity and predictability should be the dominant concern.
The guess-and-check costs that Option 2 would require are... I would move to pretty much mark that nonviable. (If it was serialize once: okay, maybe we can swallow that. Repeatedly? No.)
While we don't have a concrete interface for ADLs yet, I think it seems reasonable to bet that: having an ADL make decisions based on entry count is easy. Having the ADL have an API that lets it repeatedly serialize things (without storing them) is... an interesting ask.
The need to pre-calculate estimated sizes of things, and/or make limits to key sizes in order to make that predictable, etc... that's definitely unfortunate.
- but it's also something that filesystems as we know them elsewhere have always ended up doing! (e.g. your linux kernel has some filename length limits, and it has them so things fit in inodes in predictable ways... same problem, same solution.)
- as unfortunate as it may be, it's less unfortunate than having not-really-bounded costs to repeatedly probe how big serial sizes are.

BigLep · 2021-05-03T15:53:33Z

@iand: We want to focus on "Make sure we don't shard too soon. At the moment, the cutoff to start sharding is very low." above.

@aschmahmann will provide pointers on where that code is.

aschmahmann · 2021-05-03T18:34:51Z

Maybe option 1 is a little nicer in the longer run, but going with option 2 (<=1MiB -> Regular Directory, >1MiB -> Sharded Directory) is likely to save us some headaches by limiting the number of people using UnixFSv1 Sharded Directories to the people who actually need to use them and leaving everyone else alone.

There are still a few bugs around sharded directories (e.g. #8072), so given that we want to flip this on by default let's only expose the issues to the people who need to be exposed to them.

Note: If we did option 1 we'd either have to reject file/folder names over a certain size (a potentially breaking change) or still have to serialize to check for block size because if our estimations are wrong then someone could create a block that's too big for Bitswap to deal with.

Stebalien · 2021-05-03T19:46:01Z

One unavoidable (I think) issue is that, given a sharded directory, "guess and check" to see if we should shrink to a non-sharded directory might require downloading a large portion of the directory. This could be... annoying.

However, other than that, guess and check isn't actually too bad, performance wise. We can implement 1 as follows:

Enumerate the directory (sharded or not) summing filename/CID size (ignoring everything else).
If we hit 1MiB (or 256KiB, whatever we pick), we use shared directories.
If we run through all links and don't hit this number, we try encoding it as a non-sharded node.
If we go over the limit, shard.

Most directories will either be small or large, so this should be pretty performant.

There's also a third option: two limits:

If name/cid bytes exceed some threshold (e.g., 256KiB), switch to sharded.
If name/cid bytes are under the threshold, use non-sharded.

The non-sharded node could grow above 256KiB, but won't grow above 1MiB.

aschmahmann · 2021-05-04T19:11:44Z

There was a brief sync discussion around the three options listed above IIRC here were the main points:

Each of these options have some tradeoffs that aren't going to optimal in every case, but we still need to pick something anyway to enable sharding by default
If we decide to change the "cutoff" parameter at a later date that's ok because we'll always be able to read the data. That makes his particular decision not very high stakes. Changing the parameter will mean that inputs to functions like ipfs add could result in different CIDs, but that is already the case with any default we change to ipfs add. Additionally, people should not be relying on ipfs add <file> always returning the same results (especially if not every detail is specified). If people want to export + preserve DAG structure they should use something like .car files
- Editorial Note (i.e. my opinion): If we really felt the need to support something like ipfs-pack in the future it could potentially be done by embedding the cutoff logic and supporting configuring that in go-ipfs, but this does not IMO seem like something we should be supporting/encouraging.
We decided on option 3 since it was more accurate than option 1 but fairly cheap and lower cost than 2
- We were deciding between 256KiB and 512KiB as we wanted the total blocksize to be strictly less than 1MiB
- 256KiB seemed like plenty of room for large directories (e.g. the XKCD example in explore.ipld.io with 1800+ links is around 100KiB). It also means that reading a single link doesn't require reading quite as much data (at the tradeoff of an extra link resolution), and we have plenty of space for the estimate to be off and still be under 1MiB.

BigLep · 2021-05-05T16:59:45Z

This is being done in #8106

ianopolous · 2021-05-07T09:09:23Z

Commenting here at the request of @warpfork to discuss how we implemented this in Peergos.

We have a structure to publish capabilities which is semantically a map<path, cap>. This is stored in a champ which is an inode based file system (inspired by chats with @whyrusleeping). So the keys in this outer champ are inodes (an integer + path element), and the values are DirectoryInodes. A DirectoryInode can either be an inline list of child inode caps, or a nested champ of them. In our case the logic we went with is a maximum number of children before switching to champ which is decided by the maximum size of an inode cap. Our caps are < 256 bytes and our path elements are < 256 bytes (to cover compatibility with all common file systems). So in our case this leads to a fixed max number of inlined children of max block size / 512.

In our case this is further constrained by the outer champ which could itself inline multiple directories into a single champ node. So we have to divide by the branching ratio of the outer champ.

obo20 · 2021-07-19T17:56:09Z

While we wait for automatic sharding / unsharding to get figured out, @Stebalien @aschmahmann is there anything that would prevent an API / CLI parameter for something like this? This way users could simply tell us whether or not they wanted to shard things out on a per upload basis and IPFS doesn't need to do any additional lifting. (or we could automatically if we notice the directory has a certain amount of files in it before uploading to IPFS).

aschmahmann · 2021-07-19T18:54:10Z

Unfortunately, go-unixfs had sharding at a global level (as surfaced in the config https://github.com/ipfs/go-ipfs/blob/master/docs/experimental-features.md#directory-sharding--hamt).

This issue is currently slotted for go-ipfs v0.10.0 anyway, so while we could work around doing this automatically it's not something we currently have a good reason to invest in.

obo20 · 2021-07-19T19:21:52Z

Ahh if this is coming out in 0.10.0, then we'll definitely just wait for that. Thanks for the info!

aschmahmann · 2021-12-03T16:06:50Z

This is in v0.11.0

Stebalien added kind/enhancement A net-new feature or improvement to an existing feature topic/tracking Topic tracking labels Mar 20, 2020

Stebalien mentioned this issue Mar 20, 2020

Object/put truncates objects without error #7018

Closed

lidel added the topic/sharding Topic about Sharding (HAMT etc) label Feb 15, 2021

lidel changed the title ~~Sharded Directories~~ Automatic sharding of big directories May 2, 2021

BigLep mentioned this issue May 3, 2021

ipfs object patch does not work with HAMT-sharded directories #7421

Closed

BigLep assigned BigLep and unassigned BigLep May 3, 2021

aschmahmann mentioned this issue May 4, 2021

Tracking issue for UnixFS automatic sharding #8106

Closed

BigLep added the status/in-progress In progress label May 5, 2021

BigLep assigned schomatis May 5, 2021

achingbrain mentioned this issue May 7, 2021

Align sharding threshold detection ipfs/js-ipfs-unixfs#149

Closed

aschmahmann mentioned this issue May 19, 2021

Tracking issue for UnixFS automatic unsharding #8148

Closed

aschmahmann closed this as completed Dec 3, 2021

aschmahmann mentioned this issue Feb 8, 2022

Regression: Adding a lot of files to MFS will slow ipfs down significantly #8694

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic sharding of big directories #7022

Automatic sharding of big directories #7022

Stebalien commented Mar 20, 2020

achingbrain commented Sep 23, 2020

lidel commented Feb 15, 2021 •

edited

Loading

warpfork commented Feb 15, 2021

mvdan commented Feb 15, 2021

warpfork commented Feb 15, 2021 •

edited

Loading

mvdan commented Feb 16, 2021

rvagg commented Feb 17, 2021

achingbrain commented Feb 17, 2021

rvagg commented Feb 18, 2021

warpfork commented May 3, 2021 •

edited

Loading

BigLep commented May 3, 2021

aschmahmann commented May 3, 2021

Stebalien commented May 3, 2021

aschmahmann commented May 4, 2021

BigLep commented May 5, 2021

ianopolous commented May 7, 2021

obo20 commented Jul 19, 2021 •

edited

Loading

aschmahmann commented Jul 19, 2021

obo20 commented Jul 19, 2021

aschmahmann commented Dec 3, 2021

Automatic sharding of big directories #7022

Automatic sharding of big directories #7022

Comments

Stebalien commented Mar 20, 2020

achingbrain commented Sep 23, 2020

lidel commented Feb 15, 2021 • edited Loading

warpfork commented Feb 15, 2021

mvdan commented Feb 15, 2021

warpfork commented Feb 15, 2021 • edited Loading

mvdan commented Feb 16, 2021

rvagg commented Feb 17, 2021

achingbrain commented Feb 17, 2021

rvagg commented Feb 18, 2021

warpfork commented May 3, 2021 • edited Loading

BigLep commented May 3, 2021

aschmahmann commented May 3, 2021

Stebalien commented May 3, 2021

aschmahmann commented May 4, 2021

BigLep commented May 5, 2021

ianopolous commented May 7, 2021

obo20 commented Jul 19, 2021 • edited Loading

aschmahmann commented Jul 19, 2021

obo20 commented Jul 19, 2021

aschmahmann commented Dec 3, 2021

lidel commented Feb 15, 2021 •

edited

Loading

warpfork commented Feb 15, 2021 •

edited

Loading

warpfork commented May 3, 2021 •

edited

Loading

obo20 commented Jul 19, 2021 •

edited

Loading