Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NRG (2.11): Add ability to move cluster traffic into asset accounts #5466

Merged
merged 1 commit into from
Sep 13, 2024

Conversation

neilalexander
Copy link
Member

@neilalexander neilalexander commented May 23, 2024

This PR adds the ability to move NRG traffic out of the system account and into the asset accounts. Particularly in heavily multi-tenanted systems, this can help cases where head-of-line blocking may happen in the system account due to large amounts of replication traffic.

This needs to be enabled on a per-account basis on each participating server. Servers advertise their capability to support account NRG to each other. If a Raft group detects a peer coming up without support, they will revert back to using the system account automatically.

Requires nats-io/jwt#223 and then go.mod updating.

Signed-off-by: Neil Twigg [email protected]

Copy link
Contributor

@ripienaar ripienaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also expose this in $SYS.REQ.SERVER.PING.VARZ? Be nice if nats server info could easily show this.

Also in JSZ.

Might also be worth to ask @mprimi to get into audit a check that all machines have this enabled and alert if not symetrical

server/opts.go Outdated Show resolved Hide resolved
@mprimi
Copy link
Contributor

mprimi commented May 23, 2024

@ripienaar what's a good way to track requested/suggested checks and other improvements for audit?
Should i open an issue in that project?

@neilalexander I'm not familiar with this feature or the subsystems around it, however just from your description i have a comment/concern:

enabled on all servers participating in the cluster.
You cannot know if you know all members of the cluster, there may be some unknown unknowns.

How about instead having the the cluster meta-leader is turning this on or off based on the same (ephemeral, possibly outdated) view?

@ripienaar
Copy link
Contributor

@mprimi you are welcome to open them here if you wish, add a label (or ask me) so you can find them easily

@ripienaar
Copy link
Contributor

IF a new server joins a cluster that is in this mode and actively all happy but that new node is not configured in this manner what happens?

Can we make that node disable JetStream with a error in the logs?

Is there any concerns with Leafnodes?

@derekcollison
Copy link
Member

I will try to take a look at this next week when I am back.. Will need to take my time.

Do we have a plan for visibility into $SYS via nats traffic when they is opted in?

@derekcollison
Copy link
Member

@neilalexander let's sync this week on this one and get it merged. Agree we should be able to place in any account.

@neilalexander neilalexander force-pushed the neil/accountnrg branch 6 times, most recently from 5a8c26e to 1f50821 Compare August 1, 2024 13:10
@neilalexander neilalexander changed the title Add ability to move NRG traffic into asset accounts NRG: Add ability to move cluster traffic into asset accounts Aug 1, 2024
@neilalexander neilalexander changed the title NRG: Add ability to move cluster traffic into asset accounts NRG (2.11): Add ability to move cluster traffic into asset accounts Aug 1, 2024
@neilalexander neilalexander force-pushed the neil/accountnrg branch 2 times, most recently from 3611289 to 76d570f Compare August 7, 2024 13:21
@neilalexander neilalexander force-pushed the neil/accountnrg branch 3 times, most recently from 81bbac3 to f4b966a Compare September 9, 2024 12:19
neilalexander added a commit to nats-io/jwt that referenced this pull request Sep 10, 2024
For supporting nats-io/nats-server#5466.

For now expected values would be:
* `"system"` - use the system account (default, as today)
* `"owner"` - use the same account that owns the asset
* `"account:ACCNAME"` - use a third specified `ACCNAME`

Signed-off-by: Neil Twigg <[email protected]>
@neilalexander neilalexander marked this pull request as ready for review September 10, 2024 14:00
@neilalexander neilalexander requested a review from a team as a code owner September 10, 2024 14:00
go.mod Outdated Show resolved Hide resolved
server/accounts.go Outdated Show resolved Hide resolved
server/opts.go Outdated Show resolved Hide resolved
server/server.go Show resolved Hide resolved
@wallyqs
Copy link
Member

wallyqs commented Sep 11, 2024

got a panic in one of the runs:

=== RUN   TestJetStreamClusterSnapshotBeforePurgeAndCatchup
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xe90e12]
goroutine 67601 [running]:
github.com/nats-io/nats-server/v2/server.(*raft).recreateInternalSubsLocked(0xc00bfff808)
	/home/travis/build/nats-io/nats-server/server/raft.go:553 +0x92
github.com/nats-io/nats-server/v2/server.(*raft).RecreateInternalSubs(0xc00bfff808)
	/home/travis/build/nats-io/nats-server/server/raft.go:548 +0x89
github.com/nats-io/nats-server/v2/server.(*Server).updateNRGAccountStatus(0xc001944008)
	/home/travis/build/nats-io/nats-server/server/events.go:1725 +0x2f5
github.com/nats-io/nats-server/v2/server.(*Server).remoteServerUpdate(0xc001944008, 0x0?, 0xc00d5a4138?, 0xc001b1c000?, {0xc00082de40?, 0x446685?}, {0xc009036868?, 0xc009036830?}, {0x0, 0x0, ...}, ...)
	/home/travis/build/nats-io/nats-server/server/events.go:1654 +0x736
github.com/nats-io/nats-server/v2/server.(*Server).internalReceiveLoop(0xc001944008, 0xc009036800)
	/home/travis/build/nats-io/nats-server/server/events.go:436 +0x464
created by github.com/nats-io/nats-server/v2/server.(*Server).setSystemAccount in goroutine 67583
	/home/travis/build/nats-io/nats-server/server/server.go:1785 +0x945
FAIL	github.com/nats-io/nats-server/v2/server	373.110s

@neilalexander
Copy link
Member Author

Panic fixed BTW.

Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are very close. Just a general question when we see a change, do we want to redo internal state of existing node or should we delete old node and create a new one with same name?

server/raft.go Show resolved Hide resolved
@neilalexander
Copy link
Member Author

For now have disabled being able to select a third account, so now limited to system and owner.

@neilalexander neilalexander force-pushed the neil/accountnrg branch 2 times, most recently from 2070574 to 3b9d3b2 Compare September 13, 2024 10:42
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - one nit but not a blocker.

server/events.go Outdated Show resolved Hide resolved
This adds a new account NRG capability into statsz so that we can
detect when all servers in the cluster support moving traffic into
the asset account, instead of all being in the system account.

Signed-off-by: Neil Twigg <[email protected]>
Copy link
Member

@derekcollison derekcollison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison dismissed ripienaar’s stale review September 13, 2024 16:01

Can pick up as needed in separate PR.

@derekcollison derekcollison merged commit 1d8f32c into main Sep 13, 2024
5 checks passed
@derekcollison derekcollison deleted the neil/accountnrg branch September 13, 2024 16:01
case "owner":
a.js.nrgAccount = a.Name
default:
s.Errorf("Account claim for %q has invalid value %q for cluster traffic account", a.Name, ac.ClusterTraffic)
Copy link
Member

@aricart aricart Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR for the validation of the JWT, if this is the case, the JWT can be validated after you decode. See nats-io/jwt#224

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants