-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nip45: add hyperloglog relay response #1561
base: master
Are you sure you want to change the base?
Conversation
Well, that's pretty mind-blowing. 🤯 🤔 It weirdly incentives PoW on the middle of the ID. The relay is responsible for preventing that. Existing measures (WoT) might be enough for some relays. But the default behavior is to trust not just the relay operator but also the users of that relay to not PoW the middle of the ID. |
Why shouldn't the relay just generate a random number every time instead of inspecting the event itself? Then it's not idempotent but it prevents abuse. |
You can't generate a random number because each distinct item must yield the same number, therefore you need a hash of the item. The event id is already that hash, but yes, people can do PoW to make their reaction count more or something like that. I considered letting each relay pick a different random offset of the id to count from, which would mitigate this, but in my tests that often overshoots the results by a lot when merging from multiple different sources that have used different offsets. One thing we can do make all relays use the same offset for each query by making the offset be given by a hash of the Another idea, maybe better, is to use a fixed offset like proposed here, but instead of using the event id, use the author pubkey. Although this makes it so some pubkeys will always have more weight than others when liking posts or following people, which is like creating a funny caste system within Nostr. |
OK, I think I got the solution: make the offset dependent on the filter, not the subscription id. Something like
Just checking for the first "#e", #p", "#a" or "#q" tag (in this order) will cover 99% of use cases, and if there is no unambiguous way to determine the offset, use a predefined hardcoded value. |
wow, nice! |
Very cool! But I do think this is highly gamefiable because Nostr is so open and the ID is not random at all. Even with the latest filter-based offsets, it's possible to run a PoW procedure on expected filters and significantly change the count for that filter. The procedure must use the same offset for all relays for the merging math to work. How about a sha256(id + subID) with the instruction that the subscription ID must be random and the same for all counting relays? |
If you're using the subid there is no need to hash anything, you can just assume it will be random enough. If it's not that's the client's choice and their problem. But I would still like something deterministic to allow for the storageless counter relay. I think the latest procedure plus using the event |
This is very cool. I'm not sure the mod 24 part is doing what you intended. The hex characters are lowercase so 'a'-'f' gives you 1-6 overlapping with what '1'-'6' gives you... which doesn't hurt but is less effective I think. |
Maybe something like this instead:
EDIT: oops, I guess you can't have an offset that far in, or there are no zeroes to count. Anyhow, you get the idea.
True. But if someone manages to get a pubkey with a hell of a lot of zeroes, many things that they react to will seem to be highly reacted to. I think one more step fixes this. Instead of counting zeroes in the pubkey directly, you count zeroes in the XOR of the pubkey and some hash that is generated from the filter query. EDIT: and in this case we no longer need an offset. The first 8 bits of the hash are the bucket, the next 248 bits could be a count of zeroes but we really probably shouldn't bother counting past the next 32 bits... or 64 if we are bold. The hash could be sha256() of the filter somehow made reproducable (e.g. not just JSON), maybe the '#e' tag contents, but maybe other parts of the filter matter? limit should probably be discarded. |
That's why I think the hash is needed (subid may not be random). It would get rid of any PoW made for the event ID and whatever comes in as subID creates enough of a variance that the whole ID changes. |
@vitorpamplona If we take a hash they will just PoW the hash. |
Take a look at HLL-RB (HyperLogLog with random buckets). I think relays may be able to use a random number... even for the same query twice. EDIT: I think I have been fooled by A.I. I can't find any such thing except as the output of AI. ;-( I was thinking we could use randomness, but that the harmonic mean might no longer be the right way to combine. |
We can get the offset from something like a xor between the pubkey and some part of the filter (I'd rather not take the filter JSON verbatim because that would have been preprocessed already in most cases, it would complicate implementations), then read the stuff from the event id based on that offset. |
I think it's impossible to create a deterministic identifier that can't be mined. |
Anonymous Bitcoiners would offer Nostr SEO packages where you zap them 5000 sats and they do proof of work on hyperloglog reactions they publish so you always have thousands of likes and reposts on everything. I suppose they can do that a lot already by just generating thousands of events. |
Yep, that's why I suggested hash(event id + subid). Sub id just needs to be random enough. |
We are talking about mining fresh pubkeys that would like your post. Relays can use their spam prevention logic to reject such fresh pubkeys. I think that adding randomness would not mess up the count (on a given relay) but it would make it impossible to combine counts between multiple relays (if you naively did, you would overshoot). |
Exactly, so we are not adding any new weaknesses. Relying on reaction counts from relays that will accept anything is already a very bad idea.
Agreed. |
@mikedilger I don't fully get this, but the mod 24 was in relation to the bytes, not to the hex characters. It was maximum 24 in order to leave room for 8 bytes at the end of the id/pubkey from which we would read from. I guess we should also skip the first 8 characters since they're so often used for generic PoW, that leaves us 16 possible values for the |
Ok then I think it is worded in a confusing way. It sounded like you were taking the hex characters as ascii, and doing a mod 24 on that. |
I couldn't get this to work when trying to translate the https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf (see the top of page 140). The original 2007 algorithm uses Relays could each use their own different estimation algorithms, but the 256 counts will have to be normalized to either be a count of zeroes, or the position of the first 1 (a count of zeroes +1) in order to be interoperable. And it seems to me that a count of zeroes loses data. |
There are different implementations out there, I tried to do the simplest possible way that would work and wrote it on the NIP. We will have to agree on something, we can't have each relay implementing it in a different way.
This is what my implementation does too, it counts zeroes and adds 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a restatement of my prior comments as edits
45.md
Outdated
|
||
This section describes the steps a relay should take in order to return HLL values to clients. | ||
|
||
1. Upon receiving a filter, if it has a single `#e`, `#p`, `#a` or `#q` item, read its 32th ascii character as a byte and take its modulo over 24 to obtain an `offset` -- in the unlikely case that the filter doesn't meet these conditions, set `offset` to the number 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Upon receiving a filter, if it has a
#e
,#p
,#a
or#q
item with a single value (checked in that order in case there are multiple of these), then covert that value into 32 bytes (for the 'a' tag use the pubkey part only) and take the 16th byte modulo 24 as an 'offset'. In the unlikely case that the filter doesn't meet these conditions, setoffset
to the number 16;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eh, for the 'a' tag I guess it doesn't matter, the kind number isn't too long so chars 32 and 33 are still inside the hex part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So instead of fixing this I doubled down on my error and took the opportunity to exclude the first 8 bytes of all pubkeys. If you say this is stupid I'll do as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me. The offset isn't that important. The original algorithm uses offset=0 on hash outputs.
Because NIP-45 COUNT counts the number of events, but hll counts the number of distinct pubkeys, I think perhaps the key should be "hll_pubkeys" instead of just "hll" to at least give a hint that it isn't counting the events that match the filter, and to leave room for an "hll" that does. |
I think in all cases so far what we actually want is distinct pubkeys, so we can be pragmatic about it. If we ever make a different thing for the the number of distinct ids we can come up with a different name, like I would prefer a JSON array over an object with keys because then we wouldn't have these discussions. |
Can't it work for filters with |
What is the expected general relay behavior? Is the relay upon receiving a reaction supposed to cache Then it would do the same for kind:1 replies, kind:1111 comments, reposts, zaps and follow lists? Or should it calculate up to the 256 registers using stored events and start caching just after someone asks for a specific count? Or should it recalculate everytime and maybe cache just recent ones? |
The straightforward solution is:
Caching is hard because
IMHO counting should already be very fast since it uses indexes and isn't copying data or allocating memory, especially if the data being counted is mmapped. |
This section describes the steps a relay should take in order to return HLL values to clients. | ||
|
||
1. Upon receiving a filter, if it has a single `#e`, `#p`, `#a` or `#q` item, read its 32th ascii character as a nibble (a half-byte, a number between 0 and 16) and add `8` to it to obtain an `offset` -- in the unlikely case that the filter doesn't meet these conditions, set `offset` to the number `16`; | ||
2. Initialize 256 registers to `0` for the HLL value; | ||
3. For all the events that are to be counted according to the filter, do this: | ||
1. Read byte at position `offset` of the event `pubkey`, its value will be the register index `ri`; | ||
2. Count the number of leading zero bits starting at position `offset+1` of the event `pubkey` and add `1`; | ||
3. Compare that with the value stored at register `ri`, if the new number is bigger, store it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why all this complexity?
SipHash is a really fast and simple algorithm for use cases like this: collision resistant when the key is not known.
A new extension can be added to the COUNT
filter that allows specifying a key for SipHash. Then, you could calculate SipHash(id, key)
and use the bits for selecting the registers and counting.
This also means you could send different REQ filters to relay and still be able to aggregate, or other things that you may need to aggregate multiple filters for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Why all this complexity?"
"A new extension can be added..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except one is unnecessary (processing filters to find an index) and one is easy to implement and reduces inaccuracies at lower counts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way I would do it if the user provides a some random data in the new or modified COUNT command is to hash (or just xor) that data with the resultant event pubkey and start from bit 0 (no offset). BUT this eliminates the possibility of the relay caching the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that relay caching is of pretty insignificant value, and adds complexity. Caching of certain "note clusters" derived from REQs in general (such as, all reactions/renotes to notes, all notes from who someone follows, etc) is a better strategy than caching COUNTs, as some clients will want all reactions either way.
There's also the likelyhood clients may want to only query reactions by their follows, for example, which diminishes the benefit of caching COUNTs more.
Caching is harder, but I think it can be done specially if you want to store counts for a huge number of events using almost no disk space. Storing millions of reactions is incredibly wasteful. |
For a free public relay, we could assume event posting would simply be rate-limited by IP. Enabling hll caching comes with the downside that a malicious user can game the count. The conclusion is that with 256 registers, someone controlling 256 IPs could instantaneously skyrocket any event counter (the cacheable counters - those that aren't filtered by user follows). With just one IP, within the timeframe a regular count would be gamed to be +256, with hll it would be +millions. Hope I'm wrong. |
You aren't wrong. |
The usage of registers has nothing to do with IP addresses. Any user will be able to skyrocket the number if they can generate infinite number of pubkeys and issue likes/follows/etc with all these pubkeys. If you can generate 3000 pubkeys you can make the follow count go up by ~3000. That is true today, will be true with this change, will always be true. We have no way around it except
If HLL values are cached on the relay side or counted at request time is irrelevant for the client and the results will be exactly the same. |
Edit: got triggered by a cat avatar and wrote too much. Don't want to flood the PR discussion. |
In order to game this system, you have to produce close to 256 pubkeys each with effective PoW of about 24 (to reach 8 million likes, for example) and these newly mined pubkeys would have to be accepted by the target relay and not seen as spam-pubkeys. IMHO this is not too hard to game. It just takes a little effort and ingenuity. But how severe is that? So what if it looks like millions of people upvoted you. Reactions are stupid anyways. It's not like they are forging your events or reading your DMs. And if clients look at the first results from the COUNT commands they can quite easily detect the HLL abuse. Still... as this is quite easy to repair as I've said by having some kind of randomness (perhaps supplied by the client with the COUNT is best to make it consistent across relays without being known to the attacker gaming things). But again the downside is that relays can't cache things anymore (or they can, but probably won't get cache hits anymore) and they have to keep the original reaction events. But at least clients don't need those events anymore. I don't have a strong opinion either way, but I lean towards having the client-supplied random thing, because I have a penchant for reliability and trustability moreso than efficiency. I think I've said all I can say about this so I'll stop commenting on it until or unless we start debating particulars after this choice is made. |
Why can't the relay count and ALSO run HLL to make sure values match? Relays MUST guarantee that the HLL is under x% of the actual number and not return an HLL if it goes over. Relays can do whatever they want to do to fix the HLL estimation. This seems like a better protection than these shenanigans to try to block gamification. |
Here's a nice colorful video explanation of HyperLogLog: https://www.youtube.com/watch?v=lJYufx0bfpw
And here's a very interesting article with explanations, graphs and other stuff: http://antirez.com/news/75
If relays implement this we can finally get follower counts that do not suck and without having to use a single relay (aka relay.nostr.band) as the global source of truth for the entire network -- at the same time as we save the world by consuming an incomparably small fraction of the bandwidth.
Even if one was to download 2 reaction events in order to display a silly reaction count number in a UI that would already be using more bytes than this HLL value does (actually considering deflate compression the COUNT response with the HLL value is already smaller than a single reaction EVENT response).
This requires trusting relays to not lie about the counts and the HLL values, but this NIP always required that anyway, so no change there.
HyperLogLog can be implement in multiple ways, with different parameters and whatnot. Luckily most of the customizations (for example, the differences between HyperLogLog++ and HyperLogLog) can be applied at the final step, so it is a client choice. This NIP only describes the part that is needed for interoperability, which is how relays should compute the values and then return them to clients.
Because implementations would have to agree on parameters such as the number of registers to use, this NIP also fixes that number in 256 for simplicity's sake (makes it simpler implement since it's the maximum value of one byte) and also because it is a reasonable amount.
These are some random estimations I did, to showcase how efficient those 256 bytes can be:
As you can see they are almost perfect for small counts, but still pretty good for giant counts.