From 80268349d927b0781ccd4202de3cd5e15feac7f3 Mon Sep 17 00:00:00 2001 From: bumblefudge Date: Mon, 5 Aug 2024 17:39:17 +0200 Subject: [PATCH 1/6] add Prior Art and Translation section, update deprecation FAQ entry --- README.md | 68 ++++++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 60 insertions(+), 8 deletions(-) diff --git a/README.md b/README.md index 25089c2..b11b92f 100644 --- a/README.md +++ b/README.md @@ -18,14 +18,24 @@ It is useful to write applications that future-proof their use of hashes, and al - [Format](#format) - [Implementations](#implementations) - [Table for Multihash](#table-for-multihash) - - [Other Tables](#other-tables) +- [Prior Art And Translation](#prior-art-and-translation) + - [Named Information Hash](#named-information-hash) + - [Translation from multihash to named-information hash](#translation-from-multihash-to-named-information-hash) + - [Namespaced UUIDs](#namespaced-uuids) - [Notes](#notes) - [Multihash and randomness](#multihash-and-randomness) - [Insecure / obsolete hash functions](#insecure--obsolete-hash-functions) - [Non-cryptographic hash functions](#non-cryptographic-hash-functions) - [Visual Examples](#visual-examples) -- [Maintainers](#maintainers) + - [Consider these 4 different hashes of same input](#consider-these-4-different-hashes-of-same-input) + - [Same length: 256 bits](#same-length-256-bits) + - [Different hash functions](#different-hash-functions) + - [Idea: self-describe the values to distinguish](#idea-self-describe-the-values-to-distinguish) + - [Multihash: fn code + length prefix](#multihash-fn-code--length-prefix) + - [Multihash: a pretty good multiformat](#multihash-a-pretty-good-multiformat) + - [Multihash: has a bunch of implementations already](#multihash-has-a-bunch-of-implementations-already) - [Contribute](#contribute) +- [References](#references) - [License](#license) ## Example @@ -126,18 +136,49 @@ Yes, but we already have to agree on functions, so this is not hard. The table e ## Table for Multihash -We use a single [Multicodec](https://github.com/multiformats/multicodec) table across all of our multiformat projects. The shared namespace reduces the chances of accidentally interpreting a code in the wrong context. Multihash entries are identified with a `multihash` value in the `tag` column. +We use a single [Multicodec][] table across all of our multiformat projects. The shared namespace reduces the chances of accidentally interpreting a code in the wrong context. Multihash entries are identified with a `multihash` value in the `tag` column. The current table lives [here](https://github.com/multiformats/multicodec/blob/master/table.csv) -### Other Tables +## Prior Art And Translation -Cannot find a good standard on this. Found some _different_ IANA ones: +In IETF's corpus of normative protocols, there are two partial overlaps worth knowing about to ensure a safe implementation: -- https://www.iana.org/assignments/tls-parameters/tls-parameters.xhtml#tls-parameters-18 -- http://tools.ietf.org/html/rfc6920#section-9.4 +* "Named Information Hash", a.k.a. [RFC-6920](https://datatracker.ietf.org/doc/html/rfc6920), defines an heirarchical URI scheme for content-identifiers, partitioned by enumerated hash functions. The [NIH registry][] at IANA contains all of these. +* UUIDv5, aka "Namespaced UUIDs", defined in [RFC-9562](https://datatracker.ietf.org/doc/html/rfc9562#uuidv5), does the inverse, defining a universal namespace for one hash function, partitioned by the application of that function to multiple URI schemes (i.e. DNS names, valid URLs, etc.) +* The IANA [NIH registry][] has a similar shape and governance mode to the IANA [hashAlgorithm registry][] that TLS 1.2 implementations use to compactly signal supported hash+signature combinations. Since the former has different entries for some hash functions based on output length and the latter does not, the two registries are not alignable. However, given their different contexts, collisions between the two would not be a practical concern for users of either. -They disagree. :( +### Named Information Hash + +The "Named Information Hash" URI scheme allows for minimally self-describing hash strings to serve as content-identifiers for arbitrary binary inputs. +This lightweight identifier scheme is defined in [RFC-6920](https://datatracker.ietf.org/doc/html/rfc6920) and the supported hash-context prefixes live in an IANA registry named ["https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg"](https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg). +Its syntactic similarity to HTTP headers and [support for](https://datatracker.ietf.org/doc/html/rfc6920#section-3.1), MIME content-types makes it potentially useful for web use-cases, but use-cases are not constrained by URI scheme, only hinted at by the specification in sections 3 through 7. + +#### Translation from multihash to named-information hash + +Translating from a bare, binary multihash (i.e., a hash value in `unsigned_varint`, a.k.a. ULEB128 format) to a named-information hash in binary format is fairly easy to do insofar as a generic tag for self-describing multihashes was proposed to the [NIH registry][] by [Appendix B](https://www.ietf.org/archive/id/draft-multiformats-multihash-03.html#appendix-D.2) in the 2021 [multihash internet draft](https://www.ietf.org/archive/id/draft-multiformats-multihash-03.html): + +1. Strip the prefix bytes from the hash value and use the prefix bytes to identity the hash function used from the [Multicodec][] table +2. If multihash prefix corresponds to any tags in the [NIH registry][]: + 1. translate multicodec tag to NIH tag, i.e., if `0x12` (`sha2-256`) in `multicodec` registry, then `0x01` (`sha256`) in `named-information` registry + 2. transcode the hash value from ULEB128 to standard MSB binary + 3. (for binary form:) reattach new prefix to transcoded hash value + 4. (for ASCII form:) convert prefix to URL format, i.e., `ni:///sha-256;` for `0x01`, and reattach to base64-encoded transcoded hash value +3. If multihash prefix does NOT map cleanly to a registered value in [NIH registry][]: + 1. (for binary form:) prefix existing binary multihash with `0x42` to designate that what follows is a multicodec prefix followed by an ULEB128 hash value. + 2. (for ASCII form:) convert the `0x42` prefix to URL format, i.e., `ni:///mh;` and then append a base64url, no-padding encoding of the entire binary multihash with prefix (and _without_ adding the additional base-64-url-no-padding prefix, `u`, if using a [multibase][] library for this base-encoding). + +Note that raw multihashes (i.e. multihashes directly taken from hashing inputs) are not commonly used in IPFS implementations, since inputs are usually broken up into an intermediary form before being hashed. +Only "single-block" CIDs, which are directly produced from inputs without file-system conversion, can be converted as described above; these are usually used for blobs below a certain size, typically using `raw` or `json` or other non-IPLD tags to mark their referents as only one-layer deep. +To translate between CIDs that dereference to an IPLD graph or other recursive structure, you must first reconstruct the inputs and re-encode a new CID using `raw` codec and no chunking structure, indirection, recursion, or outer envelope. + +### Namespaced UUIDs + +Since the "Named Information Hash" URI scheme conforms to URL syntax (with or without an authority), each valid Named Information Hash URI can be assumed to be unique within the namespace of all valid URLs. +As such, any `ni://` URL (with or without an authority) can be hashed and used as a [UUIDv5](https://datatracker.ietf.org/doc/html/rfc9562#uuidv5) in the URL namespace, i.e. `6ba7b811-9dad-11d1-80b4-00c04fd430c8` (See [section 6.6](https://datatracker.ietf.org/doc/html/rfc9562#namespaces)). + +Since this approach relies on SHA-1, and discards all but the most significant 128 bits of the hash output, its security may not be adequate for all applications, as noted in the specification. +Alternative ways of using a bounded namespace could include a novel namespace registration for UUIDv5, or a UUIDv8 approach, to content-address arbitrary information with namespaced UUID variants. ## Notes @@ -149,6 +190,9 @@ They disagree. :( **Obsolete and deprecated hash functions are included** in this list. [MD4](https://en.wikipedia.org/wiki/MD4), [MD5](https://en.wikipedia.org/wiki/MD5) and [SHA-1](https://en.wikipedia.org/wiki/SHA-1) should no longer be used for cryptographic purposes, but since many such hashes already exist they are included in this specification and may be implemented in multihash libraries. +MD5 and SHA-1 were previously used in TLS and DTLS protocols version 1.2, as defined in [RFC5246](https://www.rfc-editor.org/rfc/rfc5246#section-1.2), but were later deprecated by [RFC9155](https://www.rfc-editor.org/rfc/rfc9155.html). +MD4 seems to have gone out of favor even before TLS 1.2 was finalized at IETF, and was officially deprecated by [RFC-6150](https://www.rfc-editor.org/rfc/rfc6150). + ### Non-cryptographic hash functions Multihash is intended for *"well-established cryptographic hash functions"* as **non-cryptographic hash functions are not suitable for content addressing systems**. However, there may be use-cases where it is desireable to identify non-cryptographic hash functions or their digests by use of a multihash. Non-cryptographic hash functions are identified in the [Multicodec table](https://github.com/multiformats/multicodec/blob/master/table.csv) with a tag `hash` value in the `tag` column. @@ -195,6 +239,14 @@ Check out our [contributing document](https://github.com/multiformats/multiforma Small note: If editing the README, please conform to the [standard-readme](https://github.com/RichardLitt/standard-readme) specification. +## References + +The [Prior Art and Translation](#prior-art-and-translation) section is heavily indebted to an earlier 2024 blog post, ["The Secret of NIMHs: Naming Things with Multihashes](https://bengo.is/blogging/the-secret-of-nimhs/), by github user @gobengo . + +[multicodec]: https://github.com/multiformats/multicodec +[NIH registry]: https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg +[hashAlgorith registry]: https://www.iana.org/assignments/tls-parameters/tls-parameters.xhtml#tls-parameters-18 + ## License This repository is only for documents. All of these are licensed under the [CC-BY-SA 3.0](https://ipfs.io/ipfs/QmVreNvKsQmQZ83T86cWSjPu2vR3yZHGPm5jnxFuunEB9u) license © 2016 Protocol Labs Inc. Any code is under a [MIT](LICENSE) © 2016 Protocol Labs Inc. From 25352b588d20a55c3f68e8999bdc53836d6c63a0 Mon Sep 17 00:00:00 2001 From: Bumblefudge Date: Mon, 12 Aug 2024 11:25:53 +0200 Subject: [PATCH 2/6] typo (thanks vmx) Co-authored-by: Volker Mische --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index b11b92f..d807605 100644 --- a/README.md +++ b/README.md @@ -144,7 +144,7 @@ The current table lives [here](https://github.com/multiformats/multicodec/blob/m In IETF's corpus of normative protocols, there are two partial overlaps worth knowing about to ensure a safe implementation: -* "Named Information Hash", a.k.a. [RFC-6920](https://datatracker.ietf.org/doc/html/rfc6920), defines an heirarchical URI scheme for content-identifiers, partitioned by enumerated hash functions. The [NIH registry][] at IANA contains all of these. +* "Named Information Hash", a.k.a. [RFC-6920](https://datatracker.ietf.org/doc/html/rfc6920), defines an hierarchical URI scheme for content-identifiers, partitioned by enumerated hash functions. The [NIH registry][] at IANA contains all of these. * UUIDv5, aka "Namespaced UUIDs", defined in [RFC-9562](https://datatracker.ietf.org/doc/html/rfc9562#uuidv5), does the inverse, defining a universal namespace for one hash function, partitioned by the application of that function to multiple URI schemes (i.e. DNS names, valid URLs, etc.) * The IANA [NIH registry][] has a similar shape and governance mode to the IANA [hashAlgorithm registry][] that TLS 1.2 implementations use to compactly signal supported hash+signature combinations. Since the former has different entries for some hash functions based on output length and the latter does not, the two registries are not alignable. However, given their different contexts, collisions between the two would not be a practical concern for users of either. From 155e4461317d6cc4fc2f51ac477c7dda79d74481 Mon Sep 17 00:00:00 2001 From: Bumblefudge Date: Mon, 12 Aug 2024 11:26:56 +0200 Subject: [PATCH 3/6] typo (thanks rvagg) Co-authored-by: Rod Vagg --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index d807605..eae9d65 100644 --- a/README.md +++ b/README.md @@ -152,7 +152,7 @@ In IETF's corpus of normative protocols, there are two partial overlaps worth kn The "Named Information Hash" URI scheme allows for minimally self-describing hash strings to serve as content-identifiers for arbitrary binary inputs. This lightweight identifier scheme is defined in [RFC-6920](https://datatracker.ietf.org/doc/html/rfc6920) and the supported hash-context prefixes live in an IANA registry named ["https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg"](https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg). -Its syntactic similarity to HTTP headers and [support for](https://datatracker.ietf.org/doc/html/rfc6920#section-3.1), MIME content-types makes it potentially useful for web use-cases, but use-cases are not constrained by URI scheme, only hinted at by the specification in sections 3 through 7. +Its syntactic similarity to HTTP headers and [support for MIME content-types](https://datatracker.ietf.org/doc/html/rfc6920#section-3.1) makes it potentially useful for web use-cases, but use-cases are not constrained by URI scheme, only hinted at by the specification in sections 3 through 7. #### Translation from multihash to named-information hash From 4dde9c47e300aec760ba08032c2ed23839147d27 Mon Sep 17 00:00:00 2001 From: bumblefudge Date: Mon, 12 Aug 2024 11:39:59 +0200 Subject: [PATCH 4/6] Specify unsigned_varint (not just ULEB128) throughout (thanks rvagg) --- README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index b11b92f..eaf29f3 100644 --- a/README.md +++ b/README.md @@ -156,12 +156,12 @@ Its syntactic similarity to HTTP headers and [support for](https://datatracker.i #### Translation from multihash to named-information hash -Translating from a bare, binary multihash (i.e., a hash value in `unsigned_varint`, a.k.a. ULEB128 format) to a named-information hash in binary format is fairly easy to do insofar as a generic tag for self-describing multihashes was proposed to the [NIH registry][] by [Appendix B](https://www.ietf.org/archive/id/draft-multiformats-multihash-03.html#appendix-D.2) in the 2021 [multihash internet draft](https://www.ietf.org/archive/id/draft-multiformats-multihash-03.html): +Translating from a bare, binary multihash (i.e., a hash value in [`unsigned_varint`](https://github.com/multiformats/unsigned-varint) format, i.e. a minimally-encoded ULEB128 under 64 bits in total length) to a named-information hash in binary format is fairly easy to do insofar as a generic tag for self-describing multihashes was proposed to the [NIH registry][] by [Appendix B](https://www.ietf.org/archive/id/draft-multiformats-multihash-03.html#appendix-D.2) in the 2021 [multihash internet draft](https://www.ietf.org/archive/id/draft-multiformats-multihash-03.html): 1. Strip the prefix bytes from the hash value and use the prefix bytes to identity the hash function used from the [Multicodec][] table 2. If multihash prefix corresponds to any tags in the [NIH registry][]: 1. translate multicodec tag to NIH tag, i.e., if `0x12` (`sha2-256`) in `multicodec` registry, then `0x01` (`sha256`) in `named-information` registry - 2. transcode the hash value from ULEB128 to standard MSB binary + 2. transcode the hash value from [`unsigned varint`](https://github.com/multiformats/unsigned-varint) to standard MSB binary 3. (for binary form:) reattach new prefix to transcoded hash value 4. (for ASCII form:) convert prefix to URL format, i.e., `ni:///sha-256;` for `0x01`, and reattach to base64-encoded transcoded hash value 3. If multihash prefix does NOT map cleanly to a registered value in [NIH registry][]: From a0b23da567d931724bca2018bd6255314a6651db Mon Sep 17 00:00:00 2001 From: bumblefudge Date: Wed, 14 Aug 2024 18:53:48 +0200 Subject: [PATCH 5/6] remove section about IPLD compatibility --- README.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/README.md b/README.md index ab935d1..92c3f44 100644 --- a/README.md +++ b/README.md @@ -168,10 +168,6 @@ Translating from a bare, binary multihash (i.e., a hash value in [`unsigned_vari 1. (for binary form:) prefix existing binary multihash with `0x42` to designate that what follows is a multicodec prefix followed by an ULEB128 hash value. 2. (for ASCII form:) convert the `0x42` prefix to URL format, i.e., `ni:///mh;` and then append a base64url, no-padding encoding of the entire binary multihash with prefix (and _without_ adding the additional base-64-url-no-padding prefix, `u`, if using a [multibase][] library for this base-encoding). -Note that raw multihashes (i.e. multihashes directly taken from hashing inputs) are not commonly used in IPFS implementations, since inputs are usually broken up into an intermediary form before being hashed. -Only "single-block" CIDs, which are directly produced from inputs without file-system conversion, can be converted as described above; these are usually used for blobs below a certain size, typically using `raw` or `json` or other non-IPLD tags to mark their referents as only one-layer deep. -To translate between CIDs that dereference to an IPLD graph or other recursive structure, you must first reconstruct the inputs and re-encode a new CID using `raw` codec and no chunking structure, indirection, recursion, or outer envelope. - ### Namespaced UUIDs Since the "Named Information Hash" URI scheme conforms to URL syntax (with or without an authority), each valid Named Information Hash URI can be assumed to be unique within the namespace of all valid URLs. From 51c9607717dd471c7a4e502846c63af640cbd28b Mon Sep 17 00:00:00 2001 From: bumblefudge Date: Thu, 3 Oct 2024 17:29:10 +0200 Subject: [PATCH 6/6] typo --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 92c3f44..50f7680 100644 --- a/README.md +++ b/README.md @@ -241,7 +241,7 @@ The [Prior Art and Translation](#prior-art-and-translation) section is heavily i [multicodec]: https://github.com/multiformats/multicodec [NIH registry]: https://www.iana.org/assignments/named-information/named-information.xhtml#hash-alg -[hashAlgorith registry]: https://www.iana.org/assignments/tls-parameters/tls-parameters.xhtml#tls-parameters-18 +[hashAlgorithm registry]: https://www.iana.org/assignments/tls-parameters/tls-parameters.xhtml#tls-parameters-18 ## License