Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Added simple, configurable rate limit for lightpush and store-query #2390

Merged
merged 4 commits into from
Apr 15, 2024

Conversation

NagyZoltanPeter
Copy link
Contributor

@NagyZoltanPeter NagyZoltanPeter commented Feb 1, 2024

Description

As a first step for rate and bandwidth caping of non relay protocol, hereby a very simple request rate / period limitation is added to light-push and store protocols.

Changes

  • with use of chronos TokenBucket type of rate limit check is added for lightpush
  • with use of chronos TokenBucket type of rate limit check is added for store query
  • global cli configuration is added to define request limit and the associated time period (same applied for both protocols) - defaults now set to disable rate limitation in wakunode
  • more tests from REST side
  • add metrics for total requests and rejected counts.
  • add new dashboard for wakunode grafana
  • add status code response for light push (might be better cover it in separate PR.
  • waku docs update on new config.

issue

#2032

Copy link

github-actions bot commented Feb 1, 2024

This PR may contain changes to configuration options of one of the apps.

If you are introducing a breaking change (i.e. the set of options in latest release would no longer be applicable) make sure the original option is preserved with a deprecation note for 2 following releases before it is actually removed.

Please also make sure the label release-notes is added to make sure any changes to the user interface are properly announced in changelog and release notes.

Copy link

github-actions bot commented Feb 1, 2024

You can find the image built from this PR at

quay.io/wakuorg/nwaku-pr:2390-rln-v2-false

Built from 18992b2

@NagyZoltanPeter NagyZoltanPeter added release-notes Issue/PR needs to be evaluated for inclusion in release notes highlights or upgrade instructions E:1.3: Node bandwidth management mechanism See https://github.com/waku-org/pm/issues/66 for details E:3.2: Basic DoS protection in production See https://github.com/waku-org/pm/issues/70 for details and removed E:1.3: Node bandwidth management mechanism See https://github.com/waku-org/pm/issues/66 for details labels Feb 1, 2024
@NagyZoltanPeter
Copy link
Contributor Author

NagyZoltanPeter commented Feb 2, 2024

Notice:
I issued a bug for nim-chronos as I found an opinionated bug in TokenBucket implementation that is used for rate limiting.
It prevented testing rate limit capping through REST API, due to the more natural timing than in other tests.
I think TokenBucket implementation is wrong with updating timestamp for calculating budget that allows to overrun the required budget.

cc: @chair28980 Even I will be able to implement the feature, it may cause some delays as we need to fix nim-chronos either by ourselves or by Jacek or Eugen or someone else.
I'm happy to create a PR for it as follows, but first I would like someone to review my bug report.

Copy link
Contributor

@gabrielmer gabrielmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks amazing! 🔥 🔥 🔥 Thanks so much!

Added some small questions :))

waku/waku_lightpush/common.nim Outdated Show resolved Hide resolved
apps/wakunode2/external_config.nim Outdated Show resolved Hide resolved
waku/common/ratelimit.nim Outdated Show resolved Hide resolved
waku/common/ratelimit.nim Show resolved Hide resolved
@NagyZoltanPeter NagyZoltanPeter force-pushed the feat-rate-limit-lightpush-store-phase1 branch 2 times, most recently from b96ab38 to 67339fd Compare March 25, 2024 09:05
@NagyZoltanPeter
Copy link
Contributor Author

waku-org/js-waku#1952 covers one issue caused by changing protocol interface in js-waku tests.

…uery

Adjust lightpush rest response to rate limit, added tests ann some fixes

Add rest store query test for rate limit checks and proper error response

Update apps/wakunode2/external_config.nim

Co-authored-by: gabrielmer <[email protected]>

Fixing typos and PR comments

Fix tests

Move chronos/tokenbucket to nwaku codebasee with limited and fixed feature set

Add meterics to lightpush rate limits

Fixes for code and test + test reliability with timing
@NagyZoltanPeter NagyZoltanPeter force-pushed the feat-rate-limit-lightpush-store-phase1 branch from 76cf038 to 7dd66d0 Compare April 10, 2024 09:35
@NagyZoltanPeter NagyZoltanPeter marked this pull request as ready for review April 10, 2024 09:36
Copy link
Contributor

@gabrielmer gabrielmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much! Added some small questions about points that I still don't understand completely

waku/common/tokenbucket.nim Show resolved Hide resolved
waku/waku_store/protocol.nim Outdated Show resolved Hide resolved
waku/waku_store/protocol.nim Show resolved Hide resolved
waku/waku_store/rpc.nim Show resolved Hide resolved
@gabrielmer gabrielmer self-requested a review April 10, 2024 15:06
Copy link
Contributor

@gabrielmer gabrielmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amaaaazing! Thanks so much!

Copy link
Collaborator

@Ivansete-status Ivansete-status left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Masterpiece PR! Thanks for it! 💯
Just added few comments that I hope you find useful.
Something that we might need to consider in the future is to expose byte rate metrics.

For now, we expose all the libp2p in/out transmission metrics from:
https://github.com/vacp2p/nim-libp2p/blob/09b3e11956459ad4c364b353b7f1067f42267997/libp2p/stream/chronosstream.nim#L112
and
https://github.com/vacp2p/nim-libp2p/blob/09b3e11956459ad4c364b353b7f1067f42267997/libp2p/stream/chronosstream.nim#L133

It will be very interesting to know the percentage (and absolute values) of the bandwidth (bytes/sec) consumed by each protocol. Something to consider in upcoming PRs :P

Cheers

tests/waku_lightpush/test_ratelimit.nim Outdated Show resolved Hide resolved
Comment on lines 146 to 147
await sleepAsync(1000.millis)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful test!
nitpick: I wonder if we could make the test a bit faster, i.e. 500ms

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was in my mind. But found that the whole test is very sensitive to timing. The base of this problem that there is a quite large gap between the test case starts a timer and make assumption on it and when the exact timing will occur on the protocol an token bucket side.
This is normally not a problem as user of the protocol will not make assumptions normally on timing the request. This is mostly a testing problem.
But agree to avoid such sleeps as much as possible, but this rate limit tests are time sensitive ones.

tests/waku_store/test_wakunode_store.nim Outdated Show resolved Hide resolved
tests/waku_store/test_wakunode_store.nim Outdated Show resolved Hide resolved
tests/wakunode_rest/test_rest_store.nim Outdated Show resolved Hide resolved
waku/common/tokenbucket.nim Outdated Show resolved Hide resolved
waku/common/tokenbucket.nim Show resolved Hide resolved
waku/common/tokenbucket.nim Show resolved Hide resolved
Comment on lines +594 to +606
requestRateLimit* {.
desc:
"Number of requests to serve by each service in the specified period. Set it to 0 for unlimited",
defaultValue: 0,
name: "request-rate-limit"
.}: int

## Currently default to switch of rate limit until become official
requestRatePeriod* {.
desc: "Period of request rate limitation in seconds. Set it to 0 for unlimited",
defaultValue: 0,
name: "request-rate-period"
.}: int64
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you already commented and I might be missing something but these params seem not to be used :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems washed out during rebase in relation with refactoring the node initialization.

waku/common/tokenbucket.nim Show resolved Hide resolved
@NagyZoltanPeter
Copy link
Contributor Author

Masterpiece PR! Thanks for it! 💯 Just added few comments that I hope you find useful. Something that we might need to consider in the future is to expose byte rate metrics.

For now, we expose all the libp2p in/out transmission metrics from: https://github.com/vacp2p/nim-libp2p/blob/09b3e11956459ad4c364b353b7f1067f42267997/libp2p/stream/chronosstream.nim#L112 and https://github.com/vacp2p/nim-libp2p/blob/09b3e11956459ad4c364b353b7f1067f42267997/libp2p/stream/chronosstream.nim#L133

It will be very interesting to know the percentage (and absolute values) of the bandwidth (bytes/sec) consumed by each protocol. Something to consider in upcoming PRs :P

Cheers

Yes, we have a separate issue tracking that: #1945

Copy link
Contributor

@alrevuelta alrevuelta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments:

  1. We are missing some DoS attack vectors. For example this allows an attacker to DoS the node, since it will naively read from the connection and return. We should also count read bytes, not just requests made. And not only on valid requests but also when RPC decode is not successful. This can be used for inspiration. I would suggest something similar, where if a given peer has sent more than x bytes/unit of time, readLp is not called + we disconnect from that peer without reading from the connection. In short: This code allows an attacker to flood with random bytes the peer without any limits.
let buf = await conn.readLp(MaxRpcSize.int)

let decodeRes = HistoryRPC.decode(buf)
if decodeRes.isErr():
  error "failed to decode rpc", peerId = $conn.peerId
  waku_store_errors.inc(labelValues = [decodeRpcFailure])
  # TODO: Return (BAD_REQUEST, cause: "decode rpc failed")
  return
  1. Unless I'm missing something, the rate limit is applied globally (eg to all peers), which means that a given peer can trigger this rate limit and won't allow other nodes to get any service from the node. imho the rate limit should be per peerId (or per IP address) so that we allow x amount of usage (bytes or requests) per peer instead of globally. Without this, we are still exposed to DoS attacks.

  2. Can we split this PR into different ones? I see 3 clear different PRs here: i) introduce tocken bucket, ii) rate limit lightpush, iii) rate limit store.

  3. Do we really need rate limits at the rest API level? afaik we can assume that the REST API is exposed to trusted parties so not sure if we need to rate limit there. imho we should focus on the libp2p protocol, which is the service exposed globally.


import chronos

## This is an extract from chronos/ratelimit.nim due to the found bug in the original implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind pointing to the bug and elaborating more? Especially why it can't be fixed upstream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In libp2p chronos' TockenBucket implementation is used but a bit differently than our expectation. It has a kind of time statistics replenish method.
It heavily depends on the usage and can ruin our usage. Unfortunately I could not - yet - fix it that may securely not affect libp2p usage. That fix needs a bit more wide testing for that I did not want to block this feature in nwaku.
I have an idea how to fix that and already agreed the aproach with Jacek... he encouraged us to take actions and co-work with other projects more easy.

@NagyZoltanPeter
Copy link
Contributor Author

NagyZoltanPeter commented Apr 15, 2024

Some comments:

  1. We are missing some DoS attack vectors. For example this allows an attacker to DoS the node, since it will naively read from the connection and return. We should also count read bytes, not just requests made. And not only on valid requests but also when RPC decode is not successful. This can be used for inspiration. I would suggest something similar, where if a given peer has sent more than x bytes/unit of time, readLp is not called + we disconnect from that peer without reading from the connection. In short: This code allows an attacker to flood with random bytes the peer without any limits.
let buf = await conn.readLp(MaxRpcSize.int)

let decodeRes = HistoryRPC.decode(buf)
if decodeRes.isErr():
  error "failed to decode rpc", peerId = $conn.peerId
  waku_store_errors.inc(labelValues = [decodeRpcFailure])
  # TODO: Return (BAD_REQUEST, cause: "decode rpc failed")
  return
  1. Unless I'm missing something, the rate limit is applied globally (eg to all peers), which means that a given peer can trigger this rate limit and won't allow other nodes to get any service from the node. imho the rate limit should be per peerId (or per IP address) so that we allow x amount of usage (bytes or requests) per peer instead of globally. Without this, we are still exposed to DoS attacks.
  2. Can we split this PR into different ones? I see 3 clear different PRs here: i) introduce tocken bucket, ii) rate limit lightpush, iii) rate limit store.
  3. Do we really need rate limits at the rest API level? afaik we can assume that the REST API is exposed to trusted parties so not sure if we need to rate limit there. imho we should focus on the libp2p protocol, which is the service exposed globally.

Thanks for these comments, all really valuable!
The first item is a real catch. To explain a bit more detailed with this first PR we only intend to introduce a basic simple rate limit support, first for two non-relay protocols to be testable.
Other tasks will follow in different ways.

  • extend rate limitation to other non-relay protocols.
  • fine tune rate limiting with better scalable thresholds.
  • per peer tracking of request
  • bandwidth limit to be introduced (as you linked like libp2p has it) per shard.
    • Tracking bw on each shard may allow a node to dynamically decide to support a shard or not based upon its bandwidth profiles.

Of course there are some open questions along this road as rate and bandwidth limits may contradict each other, at least in priority level.
Within all these tasks - as I see - we try to hit two target, giving control over resource usage for node operators and protect nodes from different kind of flood attacks.
If you would agree this PR should go as is for now as a initial step that will be followed by more fine tuned solution.
Your fist point is an absolute threat to be handled.
The item 4 is something to be considered, but in this current solution there is no different from where the request is coming as these non-relay protocol level rate limiting is on full node protocol level.
The only question is if we want to support to run a full node and yet let the owner to address rest calls (store/lightpush/...) directly without using a lightnode?

@alrevuelta : WDYT?

@NagyZoltanPeter
Copy link
Contributor Author

Some comments:

  1. We are missing some DoS attack vectors. For example this allows an attacker to DoS the node, since it will naively read from the connection and return. We should also count read bytes, not just requests made. And not only on valid requests but also when RPC decode is not successful. This can be used for inspiration. I would suggest something similar, where if a given peer has sent more than x bytes/unit of time, readLp is not called + we disconnect from that peer without reading from the connection. In short: This code allows an attacker to flood with random bytes the peer without any limits.
let buf = await conn.readLp(MaxRpcSize.int)

let decodeRes = HistoryRPC.decode(buf)
if decodeRes.isErr():
  error "failed to decode rpc", peerId = $conn.peerId
  waku_store_errors.inc(labelValues = [decodeRpcFailure])
  # TODO: Return (BAD_REQUEST, cause: "decode rpc failed")
  return
  1. Unless I'm missing something, the rate limit is applied globally (eg to all peers), which means that a given peer can trigger this rate limit and won't allow other nodes to get any service from the node. imho the rate limit should be per peerId (or per IP address) so that we allow x amount of usage (bytes or requests) per peer instead of globally. Without this, we are still exposed to DoS attacks.
  2. Can we split this PR into different ones? I see 3 clear different PRs here: i) introduce tocken bucket, ii) rate limit lightpush, iii) rate limit store.
  3. Do we really need rate limits at the rest API level? afaik we can assume that the REST API is exposed to trusted parties so not sure if we need to rate limit there. imho we should focus on the libp2p protocol, which is the service exposed globally.

Thanks for these comments, all really valuable! The first item is a real catch. To explain a bit more detailed with this first PR we only intend to introduce a basic simple rate limit support, first for two non-relay protocols to be testable. Other tasks will follow in different ways.

  • extend rate limitation to other non-relay protocols.

  • fine tune rate limiting with better scalable thresholds.

  • per peer tracking of request

  • bandwidth limit to be introduced (as you linked like libp2p has it) per shard.

    • Tracking bw on each shard may allow a node to dynamically decide to support a shard or not based upon its bandwidth profiles.

Of course there are some open questions along this road as rate and bandwidth limits may contradict each other, at least in priority level. Within all these tasks - as I see - we try to hit two target, giving control over resource usage for node operators and protect nodes from different kind of flood attacks. If you would agree this PR should go as is for now as a initial step that will be followed by more fine tuned solution. Your fist point is an absolute threat to be handled. The item 4 is something to be considered, but in this current solution there is no different from where the request is coming as these non-relay protocol level rate limiting is on full node protocol level. The only question is if we want to support to run a full node and yet let the owner to address rest calls (store/lightpush/...) directly without using a lightnode?

@alrevuelta : WDYT?

Findings are now address in #2589 for future handle them

@NagyZoltanPeter NagyZoltanPeter dismissed alrevuelta’s stale review April 15, 2024 13:26

Agreed to track those findings in following PRs + track them in #2589

@NagyZoltanPeter NagyZoltanPeter merged commit a00f350 into master Apr 15, 2024
13 checks passed
@NagyZoltanPeter NagyZoltanPeter deleted the feat-rate-limit-lightpush-store-phase1 branch April 15, 2024 13:28
gabrielmer added a commit that referenced this pull request Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E:3.2: Basic DoS protection in production See https://github.com/waku-org/pm/issues/70 for details release-notes Issue/PR needs to be evaluated for inclusion in release notes highlights or upgrade instructions
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants