Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nip45: add hyperloglog relay response #1561

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 40 additions & 4 deletions 45.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,29 +29,65 @@ In case a relay uses probabilistic counts, it MAY indicate it in the response wi

Whenever the relay decides to refuse to fulfill the `COUNT` request, it MUST return a `CLOSED` message.

## Examples
## HyperLogLog

### Followers count
Relays may return an HyperLogLog value together with the count, hex-encoded.

```
["COUNT", <subscription_id>, {"kinds": [3], "#p": [<pubkey>]}]
["COUNT", <subscription_id>, {"count": 238}]
["COUNT", <subscription_id>, {"count": <integer>, "hll": "<hex>"}]
```

This is so it enables merging results from multiple relays and yielding a reasonable estimate of reaction counts, comment counts and follower counts, while saving many millions of bytes of bandwidth for everybody.

### Algorithm

This section describes the steps a relay should take in order to return HLL values to clients.

1. Upon receiving a filter, if it has a single `#e`, `#p`, `#a` or `#q` item, read its 32th ascii character as a nibble (a half-byte, a number between 0 and 16) and add `8` to it to obtain an `offset` -- in the unlikely case that the filter doesn't meet these conditions, set `offset` to the number `16`;
2. Initialize 256 registers to `0` for the HLL value;
3. For all the events that are to be counted according to the filter, do this:
1. Read byte at position `offset` of the event `pubkey`, its value will be the register index `ri`;
2. Count the number of leading zero bits starting at position `offset+1` of the event `pubkey` and add `1`;
3. Compare that with the value stored at register `ri`, if the new number is bigger, store it.
Comment on lines +44 to +51
Copy link
Collaborator

@Semisol Semisol Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why all this complexity?

SipHash is a really fast and simple algorithm for use cases like this: collision resistant when the key is not known.

A new extension can be added to the COUNT filter that allows specifying a key for SipHash. Then, you could calculate SipHash(id, key) and use the bits for selecting the registers and counting.

This also means you could send different REQ filters to relay and still be able to aggregate, or other things that you may need to aggregate multiple filters for.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Why all this complexity?"

"A new extension can be added..."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except one is unnecessary (processing filters to find an index) and one is easy to implement and reduces inaccuracies at lower counts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I would do it if the user provides a some random data in the new or modified COUNT command is to hash (or just xor) that data with the resultant event pubkey and start from bit 0 (no offset). BUT this eliminates the possibility of the relay caching the result.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that relay caching is of pretty insignificant value, and adds complexity. Caching of certain "note clusters" derived from REQs in general (such as, all reactions/renotes to notes, all notes from who someone follows, etc) is a better strategy than caching COUNTs, as some clients will want all reactions either way.

There's also the likelyhood clients may want to only query reactions by their follows, for example, which diminishes the benefit of caching COUNTs more.


That is all that has to be done on the relay side, and therefore the only part needed for interoperability.

On the client side, these HLL values received from different relays can be merged (by simply going through all the registers in HLL values from each relay and picking the highest value for each register, regardless of the relay).

And finally the absolute count can be estimated by running some methods I don't dare to describe here in English, it's better to check some implementation source code (also, there can be different ways of performing the estimation, with different quirks applied on top of the raw registers).

### Attack vectors

One could mine a pubkey with a certain number of zero bits in the exact place where the HLL algorithm described above would look for them in order to artificially make its reaction or follow "count more" than others. For this to work a different pubkey would have to be created for each different target (event id, followed profile etc). This approach is not very different than creating tons of new pubkeys and using them all to send likes or follow someone in order to inflate their number of followers. The solution is the same in both cases: clients should not fetch these reaction counts from open relays that accept everything, they should base their counts on relays that perform some form of filtering that makes it more likely that only real humans are able to publish there and not bots or artificially-generated pubkeys.

### `hll` encoding

The value `hll` value must be the concatenation of the 256 registers, each being a uint8 value (i.e. a byte). Therefore `hll` will be a 512-character hex string.
fiatjaf marked this conversation as resolved.
Show resolved Hide resolved

## Examples

### Count posts and reactions

```
["COUNT", <subscription_id>, {"kinds": [1, 7], "authors": [<pubkey>]}]
["COUNT", <subscription_id>, {"count": 5}]
```


### Count posts approximately

```
["COUNT", <subscription_id>, {"kinds": [1]}]
["COUNT", <subscription_id>, {"count": 93412452, "approximate": true}]
```

### Followers count with HyperLogLog

```
["COUNT", <subscription_id>, {"kinds": [3], "#p": [<pubkey>]}]
["COUNT", <subscription_id>, {"count": 16578, "hll": "0607070505060806050508060707070706090d080b0605090607070b07090606060b0705070709050807080805080407060906080707080507070805060509040a0b06060704060405070706080607050907070b08060808080b080607090a06060805060604070908050607060805050d05060906090809080807050e0705070507060907060606070708080b0807070708080706060609080705060604060409070a0808050a0506050b0810060a0908070709080b0a07050806060508060607080606080707050806080c0a0707070a080808050608080f070506070706070a0908090c080708080806090508060606090906060d07050708080405070708"}]
```

### Relay refuses to count

```
Expand Down