Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DAG-CBOR encoded entries #119

Open
bajtos opened this issue Feb 13, 2025 · 6 comments · May be fixed by #139
Open

Support DAG-CBOR encoded entries #119

bajtos opened this issue Feb 13, 2025 · 6 comments · May be fixed by #139
Assignees

Comments

@bajtos
Copy link
Member

bajtos commented Feb 13, 2025

Curio serialises advertised entries using DAG-CBOR encoding. (Boost/Venus use DAG-JSON.)

We need to improve piece-indexer to support this encoding too.

@bajtos
Copy link
Member Author

bajtos commented Feb 13, 2025

Example advertisement from Curio:
https://f03303347-market.duckdns.org/ipni-provider/12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5/ipni/v1/ad/baguqeerakqnjugtecgj5hhfmnh46pthk5k3vy6vf4bw3vpmwmudknlatun5a

Example Entries in DAG-CBOR format:
https://f03303347-market.duckdns.org/ipni-provider/12D3KooWJ91c6xQshrNe7QAXPFAaeRrHWq2UrgXGPf8UmMZMwyZ5/ipni/v1/ad/bafyreictdikh363qfxsmjp63i6kup6aukjqfpd4r6wbhbiz2ctuji4bofm

(See #117 for more details about this SP)

Decoded output as produced by @ipld/dag-cbor:

{
  Next: CID(bafyreibpxkmu65ezxy7rynxotbghfz3ktiapjisntepd67hghfn4hde3na),
  Entries: [
    Uint8Array(34) [
       18,  32, 255, 226, 210, 133, 130, 124,
      212, 223,  53, 226, 203, 109,   8, 146,
      171,   8,  53,  27, 224,  29, 209, 233,
      152, 171, 145, 144,  85, 106,  50, 204,
       18,  55
    ],
    Uint8Array(34) [
      18,  32, 255, 226, 211, 176, 137,  50,
      73,  29, 159, 178, 135,  87, 139, 171,
      88, 119,  99, 148, 117, 153, 227, 199,
      79, 239,  42,  10,  74,  96, 165, 160,
      11, 169
    ],
  // etc.

Code dealing with entries in piece-indexer:

let entriesChunk
try {
entriesChunk = await pRetry(
() =>
/** @type {Promise<{
Entries: { '/' : { bytes: string } }[]
}>} */(
fetchCid(providerAddress, entriesCid, { fetchTimeout })
),
{
shouldRetry: (err) =>
err && 'statusCode' in err && typeof err.statusCode === 'number' && err.statusCode >= 500
}
)
} catch (err) {
// We are not able to fetch the advertised entries. Skip this advertisement so that we can
// continue the ingestion of other advertisements.
const errorDescription = describeFetchError(err, providerAddress)
const log = /** @type {any} */(err)?.statusCode === 404 ? debug : console.warn
log(
'Cannot fetch ad %s entries %s: %s',
advertisementCid,
entriesCid,
errorDescription ?? err
)
return {
error: /** @type {const} */('ENTRIES_NOT_RETRIEVABLE'),
previousAdvertisementCid
}
}
debug('entriesChunk %s %j', entriesCid, entriesChunk.Entries.slice(0, 5))
const entryHash = entriesChunk.Entries[0]['/'].bytes
const payloadCid = CID.create(1, 0x55 /* raw */, multihash.decode(Buffer.from(entryHash, 'base64'))).toString()

Notes:

  • CIDs are self-describing, you can tell whether a CID contains DAG-CBOR or DAG-JSON based on the prefix - DAG-JSON CIDs start with bagu, DAG-CBOR CIDs start with bafy
  • After we parse DAG-CBOR, the entries already contain the byte array we need to construct the payload CID. (In DAG-JSON, an entry is represented as {"/": "base64-encoded-bytes"} - we need to convert it to a byte array ourselves.)
  • Example Entries in DAG-JSON format: http://yablufc.ddns.net:3104/ipni/v1/ad/baguqeeras6knuygaeifxgfij6fs5jskr2dafxiz7adf3gie7wcsxcuefkdxa

@bajtos
Copy link
Member Author

bajtos commented Feb 14, 2025

CIDs are self-describing, you can tell whether a CID contains DAG-CBOR or DAG-JSON based on the prefix - DAG-JSON CIDs start with bagu, DAG-CBOR CIDs start with bafy

Rather than relying on a string prefix, we should parse the CID string and look at the multicodec value.

import { CID } from 'multiformats/cid'

const cid = CID.parse('bagu...')
switch (cid.code) {
  // 
}

Is there any library we can use to parse payload based on CID's codec, so that we don't have to implement this ourselves?

@bajtos bajtos moved this to 📥 next in Space Meridian Feb 21, 2025
@bajtos
Copy link
Member Author

bajtos commented Feb 21, 2025

Inside fetchCid(), we assume the response is JSON and call res.json() to read the response body & parse the JSON.

return await res.json()

    return await res.json()

To support DAG-CBOR, we need to change this line to support other codecs.

A high-level proposal:

// at the top of the file
import { decode as decodeDagCbor } from '@ipld/dag-cbor'

// in `fetchCid()`
const codec = CID.parse(cid).code
switch (codec) {
  case 0x0129 /* DAG-JSON */: 
    return await res.json()

  case 0x71 /* DAG-CBOR */:
    const bytes = await res.bytes()
    return decodeDagCbor(bytes)
  
  default:
    throw new Error(`Unsupported CID codec ${codec}`)
}

References:

@NikolasHaimerl
Copy link
Contributor

Implementation Plan for DAG-CBOR Support in Piece Indexer

Context

The piece-indexer currently only processes DAG-JSON encoded entries (CIDs starting with "bagu"). However, some providers like Curio use DAG-CBOR encoding (CIDs starting with "bafy"). We need to support both formats to ensure complete indexing of all providers.

Key Format Differences

  • DAG-JSON format (codec 0x0129/297): Entry structure is { "/": { "bytes": "base64-encoded-multihash" } }
  • DAG-CBOR format (codec 0x71/113): Entry structure is direct Uint8Array objects containing the multihash bytes

Implementation Approach

1. Add a new processEntries function

This function will handle the encoding and decoding of both DAG-JSON and DAG-CBOR formats.

function processEntries(entriesCid, entriesChunk) {
  if (!entriesChunk.Entries || !entriesChunk.Entries.length) {
    throw new Error('No entries found in response')
  }
  
  const parsedCid = CID.parse(entriesCid)
  const codec = parsedCid.code
  let entryBytes
  
  switch (codec) {
    case 297: { // DAG-JSON (0x0129)
      const entry = entriesChunk.Entries[0]
      if (!entry || typeof entry !== 'object' || !('/' in entry)) {
        throw new Error('DAG-JSON entry must have a "/" property')
      }
      if (!entry['/'] || typeof entry['/'] !== 'object' || !('bytes' in entry['/'])) {
        throw new Error('DAG-JSON entry\'s "/" property must be a CID object with a bytes property')
      }
      
      entryBytes = Buffer.from(entry['/'].bytes, 'base64')
      break
    }
    case 113: // DAG-CBOR (0x71)
      entryBytes = entriesChunk.Entries[0]
      break
    default:
      throw new Error(`Unsupported codec ${codec}`)
  }
  
  assert(entryBytes, 'Entry bytes must be set')
  return CID.create(1, 0x55 /* raw */, multihash.decode(entryBytes)).toString()
}

2. Modify fetchCid function

We will need to modify the fetchCid function to handle the cbor encoding.

async function fetchCid(providerBaseUrl, cid, { fetchTimeout } = {}) {
  const url = new URL(cid, new URL('/ipni/v1/ad/_cid_placeholder_', providerBaseUrl))
  
  try {
    const res = await fetch(url, { signal: AbortSignal.timeout(fetchTimeout ?? 30_000) })
    await assertOkResponse(res)
    
    const parsedCid = CID.parse(cid)
    const codec = parsedCid.code
    
    if (codec === 113) { // DAG-CBOR (0x71)
      const buffer = await res.arrayBuffer()
      return cbor.decode(new Uint8Array(buffer))
    } else {
      // Default to JSON (includes DAG-JSON, codec 0x0129)
      return await res.json()
    }
  } catch (err) {
    if (err && typeof err === 'object') {
      Object.assign(err, { url })
    }
    throw err
  }
}

3. Update fetchAdvertisedPayload

// Replace existing entriesChunk processing with:
let payloadCid
try {
  payloadCid = processEntries(entriesCid, entriesChunk)
} catch (err) {
  debug('Error processing entries: %s', err)
  return {
    error: 'ENTRIES_NOT_RETRIEVABLE',
    previousAdvertisementCid
  }
}

return {
  previousAdvertisementCid,
  entry: { pieceCid, payloadCid }
}

The implementation includes new test suites for both functions:

  1. New processEntries Tests
    These tests verify:
  • Correct processing of DAG-JSON entries (codec 297)
  • Correct processing of DAG-CBOR entries (codec 113)
  • Error handling for missing or empty entries
  • Error handling for unsupported codecs
  • Proper CID creation from entry data
  • Handling of data from CBOR encoding/decoding
  • Graceful handling of malformed base64 in DAG-JSON
  • Graceful handling of invalid multihash in DAG-CBOR
  1. New fetchCid Tests
    These tests verify:
  • Correct parsing of DAG-JSON responses (using real CID with codec 297)
  • Correct parsing of DAG-CBOR responses (using real CID with codec 113)
  • Appropriate error handling for unknown codecs
  • Correct method selection (json() vs arrayBuffer()) based on codec
  • Error propagation with URL attachment
  1. Integration Testing
    Note: We cannot yet comprehensively test processNextAdvertisement with DAG-CBOR specifically because we need the changes from issue Extract PieceCID from ContextID #118 first, which will provide access to a Curio test provider.
    We will also need to change fetchCID to call the correct URL.
    The current implementation:
  const url = new URL(cid, new URL('/ipni/v1/ad/_cid_placeholder_', providerBaseUrl))

does not account for http-path. In that case the entire path of the URL is already included in providerBaseUrl, see #117
Once that's available, we can add tests that:

Verify end-to-end integration with a Curio provider
Test with mixed chains containing both DAG-JSON and DAG-CBOR entries
Confirm proper error handling for real-world DAG-CBOR data

Summary
This implementation enables the processing of both DAG-JSON and DAG-CBOR encoded entries with minimal changes to the existing codebase. The code detects the codec type from the CID (0x0129/297 for DAG-JSON, 0x71/113 for DAG-CBOR) and applies the appropriate processing logic. The new test suite provides comprehensive coverage for both formats, ensuring compatibility with all provider types while maintaining robust error handling.

@bajtos bajtos moved this from 📥 next to 📋 planned in Space Meridian Mar 3, 2025
@NikolasHaimerl NikolasHaimerl linked a pull request Mar 3, 2025 that will close this issue
@bajtos
Copy link
Member Author

bajtos commented Mar 3, 2025

The high-level plan looks good to me. 👍🏻

Re: Integration Testing is dependent on support for http-path. What plan do you propose to ensure we don't forget to implement the integration tests?

@NikolasHaimerl
Copy link
Contributor

NikolasHaimerl commented Mar 3, 2025

The high-level plan looks good to me. 👍🏻

Re: Integration Testing is dependent on support for http-path. What plan do you propose to ensure we don't forget to implement the integration tests?

I included the http-path implementation in the PR for this issue. Separating it into its own PR would be very tedious to test as we cannot yet decode the data that is returned even upon successfully calling the URL that stems from the http-path.

The integration test will be done once this PR is merged and a new PR for the second issue for the piece indexer is created. Only then can we test the entirety of the curio functionality in the piece-indexer.

To answer your question: The integration test which tests the function processNextAdvertisement will be tested in the PR for the issue which adds support for context IDs processing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 planned
Development

Successfully merging a pull request may close this issue.

2 participants