feat: Added simple, configurable rate limit for lightpush and store-query #2390

NagyZoltanPeter · 2024-02-01T14:44:32Z

Description

As a first step for rate and bandwidth caping of non relay protocol, hereby a very simple request rate / period limitation is added to light-push and store protocols.

Changes

with use of chronos TokenBucket type of rate limit check is added for lightpush
with use of chronos TokenBucket type of rate limit check is added for store query
global cli configuration is added to define request limit and the associated time period (same applied for both protocols) - defaults now set to disable rate limitation in wakunode
more tests from REST side
add metrics for total requests and rejected counts.
add new dashboard for wakunode grafana
add status code response for light push (might be better cover it in separate PR.
waku docs update on new config.

issue

#2032

github-actions · 2024-02-01T14:44:49Z

This PR may contain changes to configuration options of one of the apps.

If you are introducing a breaking change (i.e. the set of options in latest release would no longer be applicable) make sure the original option is preserved with a deprecation note for 2 following releases before it is actually removed.

Please also make sure the label release-notes is added to make sure any changes to the user interface are properly announced in changelog and release notes.

github-actions · 2024-02-01T14:49:16Z

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:2390-rln-v2-false

Built from 18992b2

NagyZoltanPeter · 2024-02-02T15:30:00Z

Notice:
I issued a bug for nim-chronos as I found an opinionated bug in TokenBucket implementation that is used for rate limiting.
It prevented testing rate limit capping through REST API, due to the more natural timing than in other tests.
I think TokenBucket implementation is wrong with updating timestamp for calculating budget that allows to overrun the required budget.

cc: @chair28980 Even I will be able to implement the feature, it may cause some delays as we need to fix nim-chronos either by ourselves or by Jacek or Eugen or someone else.
I'm happy to create a PR for it as follows, but first I would like someone to review my bug report.

gabrielmer

Looks amazing! 🔥 🔥 🔥 Thanks so much!

Added some small questions :))

waku/waku_lightpush/common.nim

apps/wakunode2/external_config.nim

waku/common/ratelimit.nim

NagyZoltanPeter · 2024-04-10T09:32:10Z

waku-org/js-waku#1952 covers one issue caused by changing protocol interface in js-waku tests.

…uery Adjust lightpush rest response to rate limit, added tests ann some fixes Add rest store query test for rate limit checks and proper error response Update apps/wakunode2/external_config.nim Co-authored-by: gabrielmer <[email protected]> Fixing typos and PR comments Fix tests Move chronos/tokenbucket to nwaku codebasee with limited and fixed feature set Add meterics to lightpush rate limits Fixes for code and test + test reliability with timing

gabrielmer

Thanks so much! Added some small questions about points that I still don't understand completely

waku/common/tokenbucket.nim

waku/waku_store/protocol.nim

waku/waku_store/rpc.nim

gabrielmer

Amaaaazing! Thanks so much!

Ivansete-status

Masterpiece PR! Thanks for it! 💯
Just added few comments that I hope you find useful.
Something that we might need to consider in the future is to expose byte rate metrics.

For now, we expose all the libp2p in/out transmission metrics from:
https://github.com/vacp2p/nim-libp2p/blob/09b3e11956459ad4c364b353b7f1067f42267997/libp2p/stream/chronosstream.nim#L112
and
https://github.com/vacp2p/nim-libp2p/blob/09b3e11956459ad4c364b353b7f1067f42267997/libp2p/stream/chronosstream.nim#L133

It will be very interesting to know the percentage (and absolute values) of the bandwidth (bytes/sec) consumed by each protocol. Something to consider in upcoming PRs :P

Cheers

tests/waku_lightpush/test_ratelimit.nim

Ivansete-status · 2024-04-13T08:46:25Z

tests/waku_lightpush/test_ratelimit.nim

+    await sleepAsync(1000.millis)
+


Beautiful test!
nitpick: I wonder if we could make the test a bit faster, i.e. 500ms

Yes, it was in my mind. But found that the whole test is very sensitive to timing. The base of this problem that there is a quite large gap between the test case starts a timer and make assumption on it and when the exact timing will occur on the protocol an token bucket side.
This is normally not a problem as user of the protocol will not make assumptions normally on timing the request. This is mostly a testing problem.
But agree to avoid such sleeps as much as possible, but this rate limit tests are time sensitive ones.

tests/waku_store/test_wakunode_store.nim

tests/wakunode_rest/test_rest_store.nim

waku/common/tokenbucket.nim

Ivansete-status · 2024-04-13T09:10:37Z

waku/factory/external_config.nim

+    requestRateLimit* {.
+      desc:
+        "Number of requests to serve by each service in the specified period. Set it to 0 for unlimited",
+      defaultValue: 0,
+      name: "request-rate-limit"
+    .}: int
+
+    ## Currently default to switch of rate limit until become official
+    requestRatePeriod* {.
+      desc: "Period of request rate limitation in seconds. Set it to 0 for unlimited",
+      defaultValue: 0,
+      name: "request-rate-period"
+    .}: int64


Maybe you already commented and I might be missing something but these params seem not to be used :)

Seems washed out during rebase in relation with refactoring the node initialization.

waku/common/tokenbucket.nim

NagyZoltanPeter · 2024-04-15T07:13:30Z

Masterpiece PR! Thanks for it! 💯 Just added few comments that I hope you find useful. Something that we might need to consider in the future is to expose byte rate metrics.

For now, we expose all the libp2p in/out transmission metrics from: https://github.com/vacp2p/nim-libp2p/blob/09b3e11956459ad4c364b353b7f1067f42267997/libp2p/stream/chronosstream.nim#L112 and https://github.com/vacp2p/nim-libp2p/blob/09b3e11956459ad4c364b353b7f1067f42267997/libp2p/stream/chronosstream.nim#L133

It will be very interesting to know the percentage (and absolute values) of the bandwidth (bytes/sec) consumed by each protocol. Something to consider in upcoming PRs :P

Cheers

Yes, we have a separate issue tracking that: #1945

alrevuelta

Some comments:

We are missing some DoS attack vectors. For example this allows an attacker to DoS the node, since it will naively read from the connection and return. We should also count read bytes, not just requests made. And not only on valid requests but also when RPC decode is not successful. This can be used for inspiration. I would suggest something similar, where if a given peer has sent more than x bytes/unit of time, readLp is not called + we disconnect from that peer without reading from the connection. In short: This code allows an attacker to flood with random bytes the peer without any limits.

let buf = await conn.readLp(MaxRpcSize.int)

let decodeRes = HistoryRPC.decode(buf)
if decodeRes.isErr():
  error "failed to decode rpc", peerId = $conn.peerId
  waku_store_errors.inc(labelValues = [decodeRpcFailure])
  # TODO: Return (BAD_REQUEST, cause: "decode rpc failed")
  return

Unless I'm missing something, the rate limit is applied globally (eg to all peers), which means that a given peer can trigger this rate limit and won't allow other nodes to get any service from the node. imho the rate limit should be per peerId (or per IP address) so that we allow x amount of usage (bytes or requests) per peer instead of globally. Without this, we are still exposed to DoS attacks.
Can we split this PR into different ones? I see 3 clear different PRs here: i) introduce tocken bucket, ii) rate limit lightpush, iii) rate limit store.
Do we really need rate limits at the rest API level? afaik we can assume that the REST API is exposed to trusted parties so not sure if we need to rate limit there. imho we should focus on the libp2p protocol, which is the service exposed globally.

alrevuelta · 2024-04-15T08:34:05Z

waku/common/tokenbucket.nim

+
+import chronos
+
+## This is an extract from chronos/ratelimit.nim due to the found bug in the original implementation.


Mind pointing to the bug and elaborating more? Especially why it can't be fixed upstream.

In libp2p chronos' TockenBucket implementation is used but a bit differently than our expectation. It has a kind of time statistics replenish method.
It heavily depends on the usage and can ruin our usage. Unfortunately I could not - yet - fix it that may securely not affect libp2p usage. That fix needs a bit more wide testing for that I did not want to block this feature in nwaku.
I have an idea how to fix that and already agreed the aproach with Jacek... he encouraged us to take actions and co-work with other projects more easy.

NagyZoltanPeter · 2024-04-15T12:17:04Z

Some comments:

We are missing some DoS attack vectors. For example this allows an attacker to DoS the node, since it will naively read from the connection and return. We should also count read bytes, not just requests made. And not only on valid requests but also when RPC decode is not successful. This can be used for inspiration. I would suggest something similar, where if a given peer has sent more than x bytes/unit of time, readLp is not called + we disconnect from that peer without reading from the connection. In short: This code allows an attacker to flood with random bytes the peer without any limits.
let buf = await conn.readLp(MaxRpcSize.int)

let decodeRes = HistoryRPC.decode(buf)
if decodeRes.isErr():
  error "failed to decode rpc", peerId = $conn.peerId
  waku_store_errors.inc(labelValues = [decodeRpcFailure])
  # TODO: Return (BAD_REQUEST, cause: "decode rpc failed")
  return
Unless I'm missing something, the rate limit is applied globally (eg to all peers), which means that a given peer can trigger this rate limit and won't allow other nodes to get any service from the node. imho the rate limit should be per peerId (or per IP address) so that we allow x amount of usage (bytes or requests) per peer instead of globally. Without this, we are still exposed to DoS attacks.

Can we split this PR into different ones? I see 3 clear different PRs here: i) introduce tocken bucket, ii) rate limit lightpush, iii) rate limit store.

Do we really need rate limits at the rest API level? afaik we can assume that the REST API is exposed to trusted parties so not sure if we need to rate limit there. imho we should focus on the libp2p protocol, which is the service exposed globally.

Thanks for these comments, all really valuable!
The first item is a real catch. To explain a bit more detailed with this first PR we only intend to introduce a basic simple rate limit support, first for two non-relay protocols to be testable.
Other tasks will follow in different ways.

extend rate limitation to other non-relay protocols.
fine tune rate limiting with better scalable thresholds.
per peer tracking of request
bandwidth limit to be introduced (as you linked like libp2p has it) per shard.
- Tracking bw on each shard may allow a node to dynamically decide to support a shard or not based upon its bandwidth profiles.

Of course there are some open questions along this road as rate and bandwidth limits may contradict each other, at least in priority level.
Within all these tasks - as I see - we try to hit two target, giving control over resource usage for node operators and protect nodes from different kind of flood attacks.
If you would agree this PR should go as is for now as a initial step that will be followed by more fine tuned solution.
Your fist point is an absolute threat to be handled.
The item 4 is something to be considered, but in this current solution there is no different from where the request is coming as these non-relay protocol level rate limiting is on full node protocol level.
The only question is if we want to support to run a full node and yet let the owner to address rest calls (store/lightpush/...) directly without using a lightnode?

@alrevuelta : WDYT?

NagyZoltanPeter · 2024-04-15T13:25:27Z

Some comments:

We are missing some DoS attack vectors. For example this allows an attacker to DoS the node, since it will naively read from the connection and return. We should also count read bytes, not just requests made. And not only on valid requests but also when RPC decode is not successful. This can be used for inspiration. I would suggest something similar, where if a given peer has sent more than x bytes/unit of time, readLp is not called + we disconnect from that peer without reading from the connection. In short: This code allows an attacker to flood with random bytes the peer without any limits.
let buf = await conn.readLp(MaxRpcSize.int)

let decodeRes = HistoryRPC.decode(buf)
if decodeRes.isErr():
  error "failed to decode rpc", peerId = $conn.peerId
  waku_store_errors.inc(labelValues = [decodeRpcFailure])
  # TODO: Return (BAD_REQUEST, cause: "decode rpc failed")
  return
Unless I'm missing something, the rate limit is applied globally (eg to all peers), which means that a given peer can trigger this rate limit and won't allow other nodes to get any service from the node. imho the rate limit should be per peerId (or per IP address) so that we allow x amount of usage (bytes or requests) per peer instead of globally. Without this, we are still exposed to DoS attacks.

Can we split this PR into different ones? I see 3 clear different PRs here: i) introduce tocken bucket, ii) rate limit lightpush, iii) rate limit store.

Do we really need rate limits at the rest API level? afaik we can assume that the REST API is exposed to trusted parties so not sure if we need to rate limit there. imho we should focus on the libp2p protocol, which is the service exposed globally.
Thanks for these comments, all really valuable! The first item is a real catch. To explain a bit more detailed with this first PR we only intend to introduce a basic simple rate limit support, first for two non-relay protocols to be testable. Other tasks will follow in different ways.

extend rate limitation to other non-relay protocols.

fine tune rate limiting with better scalable thresholds.

per peer tracking of request

bandwidth limit to be introduced (as you linked like libp2p has it) per shard.

Tracking bw on each shard may allow a node to dynamically decide to support a shard or not based upon its bandwidth profiles.

Of course there are some open questions along this road as rate and bandwidth limits may contradict each other, at least in priority level. Within all these tasks - as I see - we try to hit two target, giving control over resource usage for node operators and protect nodes from different kind of flood attacks. If you would agree this PR should go as is for now as a initial step that will be followed by more fine tuned solution. Your fist point is an absolute threat to be handled. The item 4 is something to be considered, but in this current solution there is no different from where the request is coming as these non-relay protocol level rate limiting is on full node protocol level. The only question is if we want to support to run a full node and yet let the owner to address rest calls (store/lightpush/...) directly without using a lightnode?

@alrevuelta : WDYT?

Findings are now address in #2589 for future handle them

Agreed to track those findings in following PRs + track them in #2589

… store-query (#2390)" This reverts commit a00f350.

NagyZoltanPeter requested review from chaitanyaprem, gabrielmer and Ivansete-status February 1, 2024 14:44

NagyZoltanPeter self-assigned this Feb 1, 2024

gabrielmer reviewed Feb 5, 2024

View reviewed changes

waku/waku_lightpush/common.nim Outdated Show resolved Hide resolved

apps/wakunode2/external_config.nim Outdated Show resolved Hide resolved

waku/common/ratelimit.nim Outdated Show resolved Hide resolved

waku/common/ratelimit.nim Show resolved Hide resolved

NagyZoltanPeter force-pushed the feat-rate-limit-lightpush-store-phase1 branch 2 times, most recently from b96ab38 to 67339fd Compare March 25, 2024 09:05

NagyZoltanPeter force-pushed the feat-rate-limit-lightpush-store-phase1 branch from 76cf038 to 7dd66d0 Compare April 10, 2024 09:35

NagyZoltanPeter marked this pull request as ready for review April 10, 2024 09:36

NagyZoltanPeter requested a review from gabrielmer April 10, 2024 09:36

gabrielmer reviewed Apr 10, 2024

View reviewed changes

waku/common/tokenbucket.nim Show resolved Hide resolved

waku/waku_store/protocol.nim Outdated Show resolved Hide resolved

waku/waku_store/protocol.nim Show resolved Hide resolved

waku/waku_store/rpc.nim Show resolved Hide resolved

gabrielmer self-requested a review April 10, 2024 15:06

gabrielmer approved these changes Apr 10, 2024

View reviewed changes

Fix counter bucket naming

c0c1981

Ivansete-status approved these changes Apr 13, 2024

View reviewed changes

alrevuelta previously requested changes Apr 15, 2024

View reviewed changes

alrevuelta reviewed Apr 15, 2024

View reviewed changes

Fix PR findings

8e680f6

Merge branch 'master' into feat-rate-limit-lightpush-store-phase1

994251a

NagyZoltanPeter merged commit a00f350 into master Apr 15, 2024
13 checks passed

NagyZoltanPeter deleted the feat-rate-limit-lightpush-store-phase1 branch April 15, 2024 13:28

gabrielmer added a commit that referenced this pull request Apr 25, 2024

Revert "feat: Added simple, configurable rate limit for lightpush and…

10fea69

… store-query (#2390)" This reverts commit a00f350.

NagyZoltanPeter mentioned this pull request Apr 26, 2024

feat: set limit on store responses #1809

Closed

chair28980 mentioned this pull request Apr 27, 2024

feat: Enforce service specific rate limits #2032

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Added simple, configurable rate limit for lightpush and store-query #2390

feat: Added simple, configurable rate limit for lightpush and store-query #2390

NagyZoltanPeter commented Feb 1, 2024 •

edited

Loading

github-actions bot commented Feb 1, 2024

github-actions bot commented Feb 1, 2024 •

edited

Loading

NagyZoltanPeter commented Feb 2, 2024 •

edited

Loading

gabrielmer left a comment

NagyZoltanPeter commented Apr 10, 2024

gabrielmer left a comment

gabrielmer left a comment

Ivansete-status left a comment

Ivansete-status Apr 13, 2024

NagyZoltanPeter Apr 15, 2024

Ivansete-status Apr 13, 2024

NagyZoltanPeter Apr 15, 2024

NagyZoltanPeter commented Apr 15, 2024

alrevuelta left a comment

alrevuelta Apr 15, 2024

NagyZoltanPeter Apr 15, 2024

NagyZoltanPeter commented Apr 15, 2024 •

edited

Loading

NagyZoltanPeter commented Apr 15, 2024


		import chronos

		## This is an extract from chronos/ratelimit.nim due to the found bug in the original implementation.

feat: Added simple, configurable rate limit for lightpush and store-query #2390

feat: Added simple, configurable rate limit for lightpush and store-query #2390

Conversation

NagyZoltanPeter commented Feb 1, 2024 • edited Loading

Description

Changes

issue

github-actions bot commented Feb 1, 2024

github-actions bot commented Feb 1, 2024 • edited Loading

NagyZoltanPeter commented Feb 2, 2024 • edited Loading

gabrielmer left a comment

Choose a reason for hiding this comment

NagyZoltanPeter commented Apr 10, 2024

gabrielmer left a comment

Choose a reason for hiding this comment

gabrielmer left a comment

Choose a reason for hiding this comment

Ivansete-status left a comment

Choose a reason for hiding this comment

Ivansete-status Apr 13, 2024

Choose a reason for hiding this comment

NagyZoltanPeter Apr 15, 2024

Choose a reason for hiding this comment

Ivansete-status Apr 13, 2024

Choose a reason for hiding this comment

NagyZoltanPeter Apr 15, 2024

Choose a reason for hiding this comment

NagyZoltanPeter commented Apr 15, 2024

alrevuelta left a comment

Choose a reason for hiding this comment

alrevuelta Apr 15, 2024

Choose a reason for hiding this comment

NagyZoltanPeter Apr 15, 2024

Choose a reason for hiding this comment

NagyZoltanPeter commented Apr 15, 2024 • edited Loading

NagyZoltanPeter commented Apr 15, 2024

NagyZoltanPeter commented Feb 1, 2024 •

edited

Loading

github-actions bot commented Feb 1, 2024 •

edited

Loading

NagyZoltanPeter commented Feb 2, 2024 •

edited

Loading

NagyZoltanPeter commented Apr 15, 2024 •

edited

Loading