Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow querying for a partial number #263

Closed
MagiX13 opened this issue Jan 31, 2025 · 14 comments
Closed

Allow querying for a partial number #263

MagiX13 opened this issue Jan 31, 2025 · 14 comments

Comments

@MagiX13
Copy link

MagiX13 commented Jan 31, 2025

The feature

Hello,

I discovered SpamBlocker today and via that also Phoneblock. I wish I had discovered both at an earlier time to save me some sanity and time.

With the Phoneblock integration, I noticed that the full phone number is sent to the service. This comes with some privacy implications of the caller (and some potentially associated GDPR issues).

Would it be possible to allow for a partial request and then filter locally (see for example the have i been pwned password check implementation)?

I could imagine that passing {domestic:X} to only pass along the first X digits of the domestic number and then filtering for the full {domestic} in the following ParseQueryResult call would be a reasonable approach. That way, the full number would not be passed to the respective service - preserving the privacy of the caller - while still being able to block spammers.

I have also opened an issue on the PhoneBlock side to ask for querying with a prefix: haumacher/phoneblock#139

@aj3423
Copy link
Owner

aj3423 commented Jan 31, 2025

Interesting, thank you for the suggestion.

... passing {domestic:X} to only pass along the first X digits of the domestic number ...

There will be a problem if we only use the number prefix. Spammers usually "get" their numbers in a same number range (same prefix). For example, if you search 123456***, the server will probably return 1000 numbers, that'll be less performant and bandwith consuming(for both server and user).

From your link, they use the prefix of the SHA1 hash, and

hashes are fairly uniformally distributed

I think using hash-prefix would be better than number-prefix. And since the phone numbers and passwords have similarities that they are both about 8~12 characters, that algorithm should also work for us.

I just come up with another solution, unlike the passwords, in our particular case, there're only numbers, we can use both prefix + suffix. For example, the number 1234567890, we query 123*890. I don't think there will be too many numbers have same prefix and suffix, and maybe this is easier for the server to implement. But there is a problem with short numbers.

So I think the best soultion is hash-prefix.

To support this, I just need to add some new tags, something like: {k_anonymity({domestic})} or {sha1_prefix_5({domestic})}. And I'd expect the API to return something like:

{
  "hash1": { "votes": 10, "rating": "C_POLL", ... },
  "hash2": { "votes": 20, "rating": "A_LEGITIMATE", ... },
}

It can be parsed using JsonPath in the ParseQueryResult.
I'll add this when the upstream API is ready.

@haumacher
Copy link

I also prefer hashes over prefixes, DB load and bandwith is not acceptable for prefix search.

But from PhoneBlock's perspective, measuring SPAM number activity is essential. Therefore, I'll only offer an API for querying with full hashes. From those, it is sufficiently hard to guess the number, if it is not yet in the DB. But it is possible to measure activity if the number is already listed (which means that there is a SPAM suspicion against that number).

@aj3423
Copy link
Owner

aj3423 commented Jan 31, 2025

But from PhoneBlock's perspective, measuring SPAM number activity is essential. Therefore, I'll only offer an API for querying with full hashes.

Got it, I agree.

it is sufficiently hard to guess the number, if it is not yet in the DB.

Maybe we can improve the sufficiently hard to almost impossible by hashing it more times. I mean, they hash the password only once because they send it partially. If we send the entire hash, we should hash it like 100 times, many websites or hash databases can easily solve single-round hashes, but not 100-round hashes.

@haumacher
Copy link

Multiple hash rounds and algorithms requiring massive amount of memory, e.g. ARGON2 - this is a good choice when storing password hashes. But in my opinion this is overkill for phone numbers. PhoneBlock was designed to serve 50000 users from a RaspberryPI - even if it's no longer running on my desk, I'm not willing to waste resources for almost no benefit.

SHA1 is a good choice - resource efficient and does the job for providing some privacy assuming no maliciousness.

@aj3423
Copy link
Owner

aj3423 commented Jan 31, 2025

Maybe 3 rounds would surfice, the difference is negligible between 3 and 100 rounds but significant between 1 and 3, and it won't impact the performance.

Just disscussing the possibilities here, I'm fine with single round SHA1.

@haumacher
Copy link

I think there is no benefit with a small number of rounds, but it makes the required hash computation harder to describe and implement, since it is non-standard then. This hash computation must be implemented by all API users.

I've got a test version up and running: https://phoneblock.net/pb-test/api/

The new API is /check and there is also a /hash API for testing and debugging the hashing algorithm.

@aj3423
Copy link
Owner

aj3423 commented Jan 31, 2025

I think there is no benefit with a small number of rounds, but it makes the required hash computation harder to describe and implement, since it is non-standard then. This hash computation must be implemented by all API users.

I agree, actually, that + you added before the number is awesome, reducing the chance of number recovery.

I've got a test version up and running: https://phoneblock.net/pb-test/api/

That was quick... wasn't expecting that, I'll test it tomorrow.

@MagiX13
Copy link
Author

MagiX13 commented Jan 31, 2025

That was indeed super quick! Many thanks.

On the activity info: I somewhat understand and hope at some point the collection/project gets big enough that this will no longer be needed. In the end, some users will never be able to make use of such an approach (e.g. Fritzbox/ab) but those that rely on other tools could benefit.

On the DB load: The full blocklist is not that large is it? I downloaded it from /api/blocklist today and it had ~20k values and is just around 1MB. This could easily be kept in memory and updated with all (relevant, e.g. new additions/votes/removals) incoming requests so that the DB is only queried once on restart of the application and kept in sync with all further updates, or even caching upon each /api/blocklist call might be an option. Even if the project grew significantly over night, the memory footprint of the full blocklist wouldn't be that large/grow that fast.

If this is stored in memory, the query time for the /api/blocklist, the /api/check or even the prefixed call would be super quick as it wouldn't need to query the database itself... I played around on a Raspberry Pi 4 and got ~150k individual items/attempts per second when querying a sparsely populated python dict with 1M key-values. I think that sort of performance would suffice for quite some time.

On the bandwidth: you could require a minimum prefix length to be requested. I took a quick look and with 4 (phone number) digits around 25% of all requests would give just one result and for 5 digits this would be 40%. With hash-prefixes, requiring just the first three digits should have ~5 phone numbers on average, with 4 you would already have unique/no results on average.

@aj3423
Copy link
Owner

aj3423 commented Feb 1, 2025

It's done in the action build: https://github.com/aj3423/SpamBlocker/actions/runs/13086130495
The preset API PhoneBlock uses sha1 hash by default.

@haumacher The test API works fine, the action apk would work when the production API is ready.

with 4 (phone number) digits around 25% of all requests would give just one result and for 5 digits this would be 40%.

The average result for the number-prefix is pointless, they are not uniformally distributed, usually crowded with same prefix. Not sure if it's a DDoS vunerability, one can "report" lots of fake numbers with same prefix(with long comments), then do a massive query.

@MagiX13
Copy link
Author

MagiX13 commented Feb 1, 2025

The average result for the number-prefix is pointless, they are not uniformally distributed, usually crowded with same prefix.

Hashes make whatever non-random distribution phone numbers have become basically uniformly distributed.
So let's take a look at the maximum results for different hash prefix length. For a hash prefix of the database, the maximum number of phone numbers with a 5 character hash prefix is 3, for 4 characters it is 5 and for 3 characters it's 14. Even looking at medians of those hashes that I get from the database (for longer prefix length it will be super sparse, so actual medians are probably 0, 0 and 5) gives just 1, 1 and 5 respectively.

I however understand that the activity information is more relevant for now and will shut up 😄

@aj3423
Copy link
Owner

aj3423 commented Feb 1, 2025

Hashes make whatever non-random distribution phone numbers have become basically uniformly distributed.

It will also have the DDoS issue. For their password solution, they only allow querying, people can't commit new password. In our case, we allow reporting new numbers, one can report lots of numbers that have same hash prefix, such numbers can be easily generated with a python script:

import hashlib

prefix = "abcde"
number = 1000000000
while True:
    s = str(number)
    hash_value = hashlib.sha1(s.encode()).hexdigest()

    number += 1
    if hash_value.startswith(prefix):
        print(number, hash_value)

It generates 1 number per second with only 1 CPU core.

The full hash solution seems to be our best bet.

I however understand that the activity information is more relevant for now and will shut up 😄

I also forgot about that, I'll also shut up 😄

@haumacher
Copy link

The hash-lookup change is live.

For the SpamBlocker-PhoneBlock integration, I've got another suggestion, to make things safer and easier:

I now allow to generate API-Keys from the PhoneBlock settings page.. These API-Keys can be used for API calls as Bearer-Tokens. This does not require to enter the PhoneBlock user name and password to other apps and prevents transmitting theses credentials in HTTP basic auth requests.

For SpamBlocker, using an API-Key instead of the username/password combination makes also the setup process easier, since only a single information must be copied from the website to the app. Please consider updating your setup helper to request an API-Key instead of a username/password combination.

@aj3423
Copy link
Owner

aj3423 commented Feb 2, 2025

@haumacher Glad you've made that change, tomorrow I'll apply it to the PhoneBlock preset.

@aj3423
Copy link
Owner

aj3423 commented Feb 3, 2025

@haumacher Done, now it uses API Key instead of username/password. https://github.com/aj3423/SpamBlocker/actions/runs/13112930951

@aj3423 aj3423 closed this as completed Feb 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants