-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to define whether a CID is retrievable? #9
Comments
Cross-posting @rvagg's comment from Slack (with his permission): Re fetchability of CIDs - there’s an interesting question here about what it means for the CID to be fetchable from a Filecoin SP. Does it mean that just the block for that CID is fetchable? Or does it mean the entire DAG under that block is fetchable? If you’re testing based on what the indexer claims then just a per-block fetch would be reasonable, ask for one block, get it, ✅ If you’re testing based on chain or some other deal data, that the SP has provably made a deal with a particular root CID and you’re testing whether you can retrieve it then it’s a bit more nuanced than this:
Wikipedia is always my favourite example when talking about this stuff - it ends up being over 300G worth of IPLD blocks. If you ask lassie to fetch Filecoin deals are a maximum of ~32G (or 64G for some SPs), so you can’t fit all of Wikipedia into a single deal. But you could spread it across many deals. But what are the chances that an SP has all of those deals? Quite likely anyone storing that much data is using many SPs to spread their data around, perhaps they’ve given every SP they’re using the entirety of their data and they’re just duplicating, or perhaps they bundled it into a heap of CARs and shunted it off to web3.storage, Esturary, or Spade or something else to deal with the dealmaking and SP selection and they’re scattered to the wind. An SP that claims to have stored a deal with bafybeiaysi4s6lnjev27ln5icwm6tueaw2vdykrtjkwiphwekaywqhcjze at the root may have ~32G of the start of the wikipedia DAG but they don’t have the rest. You might be able to retrieve ~32G from that SP with lassie starting at that root, but if you limit it to a single SP and they don’t have deals for all of the wikipedia pieces, then it’s going to fail. What does that mean? |
We could also split up a CID (say 2M blocks) into 1k samples, where over the period of a week 1k Stations try receiving a small amount of blocks (assuming we can seek to arbitrary offsets) and we get a probabilistic answer that is good enough. |
The big missing piece from all of this - a TODO item from early Filecoin that never got solved - was being able to record an intent descriptor of some kind with stored data. We went live with: root CID + everything under it; which works for a lot of data, but not all, especially not big data because you blow the 32G maximum for deals. That world was a bit of a graphsync world, where parties agree on a root + a selector descriptor of how to get all the data, it's just that we were using Where we're really at is in an in-between place, where we have both worlds. The bitswap world works, but it's awesome for assembling large DAGs. Graphsync and now HTTP are now in the mix, but they still almost exclusively operate around the notion that the single provider you're talking to has everything under that CID in order to be able to assemble it back together. Graphsync I think is a bit better with affordances for missing pieces of the DAG (we don't have anything like that for the HTTP Trustless Gateway protocol we're using - you have it all or you fail). There's a bit retrieval problem to be solved here, and we'll end up solving it, but it's not going to be straightforward--how do you reassemble a large DAG spread across many providers? Even with bitswap, which in theory is good at doing this, we've not built retrieval tools to be able to figure out that it needs to go asking for an entirely new set of providers once it exhausts the ones its talking to for the DAG it's working on. But, the good news is that this isn't a majority case (yet). Most data being stored is relatively small DAGs (<32G); so most "do you have this CID and the entire DAG under it" queries should be 👍. That's just not universally true and over time will probably erode. |
We had a discussion with @rvagg where he pointed out that there are different ways how to consider a CID as retrievable.
The simplest (but also least useful?) option is to fetch only the root block of the CID.
For end users fetching data from IPFS (e.g. via Saturn), it's most helpful to know that the entire DAG rooted in CID is retrievable. That may be expensive to verify, though, because the DAG can be spread across multiple storage providers.
For building the Reputation score of Storage Providers, we want to check that they are honouring the storage deals. In other words, we want to check that all blocks included in the deal are retrievable. These blocks do not necessarily have to form a single DAG.
We should explore this topic, discuss the matter with other groups interested in SPARK (@willscott & Bedrock, Reputation WG) and decide which approach should SPARK implement to decide whether a CID can be retrieved.
The text was updated successfully, but these errors were encountered: