Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stream mgmt: expose CANCEL_MOVE nicely #724

Open
philpennock opened this issue Mar 2, 2023 · 8 comments
Open

stream mgmt: expose CANCEL_MOVE nicely #724

philpennock opened this issue Mar 2, 2023 · 8 comments

Comments

@philpennock
Copy link
Member

Request from you-know-who via me: give nats stream a subcommand or option there-of to syntactically sugar the cancel a move:

nats req '$JS.API.ACCOUNT.STREAM.CANCEL_MOVE.${ACCT}.${STREAM}' ''

Tested and confirmed this works in both system and stream-owner accounts.

@jarretlavallee
Copy link

We have hit a use case for this in some instances. This would be nice to have.

@ripienaar
Copy link
Collaborator

Question, how are you initiating moves? CLI also does not allow initiating moves using the move API.

The reason these 2 APIs are not in the nats cli is because they are just really badly designed and reusing data structures they should not, generally just dont like them.

So asking a bit more about how its used so I can see what I can do.

@ripienaar
Copy link
Collaborator

Ah, a stream update in the account can initiate the move. This API is really bad, I might need to look at getting some fixes into the server before I am happy to add it to the CLI. Thing like when the server isn't clustered it will just timeout etc

@jarretlavallee
Copy link

The common way that I have seen to move a stream is to update the placement tags.

nats stream edit testStream --tags cloud:was

@ripienaar
Copy link
Collaborator

I would simply not use this feature tbh.

If you initiate a stream move due to other things like changing the placement cluster, and you cancel the move the stream config reflects the new cluster cofig but the stream gets put back in the original cluster with now an inconsistent config vs reality.

So I am not adding this to the CLI till we revisit how this work sorry, @jarretlavallee if you could please make some issue in the server?

$ nats s edit --cluster sfo LON_MIRROR -f
$ nats req '$JS.API.ACCOUNT.STREAM.CANCEL_MOVE.one.LON_MIRROR' ''
$ nats s info LON_MIRROR
Information for Stream LON_MIRROR created 2023-02-02 07:53:52

              Replicas: 3
               Storage: File
     Placement Cluster: sfo
....
Cluster Information:

                  Name: nyc
         Cluster Group: S-R3F-uw46mpks
                Leader: n3-nyc
               Replica: n1-nyc, current, seen 88ms ago
               Replica: n2-nyc, current, seen 88ms ago

I really would not advise using this :)

@jarretlavallee
Copy link

That is fair. The only times I have had to use it was in response to stalled moves and other failures. It has been a recovery option in those scenarios. I understand that if it is in the CLI, it could be used operationally and cause the issues above.

@ripienaar
Copy link
Collaborator

Some other observed problems with this, after initiating a move from lon to sfo and cancelling it, some sfo servers still had the stream on disk and recovered it at start

Jan 30 13:54:12 n2-sfo nats-server[2500521]: [1] [INF]   Starting restore for stream 'ADM6CMOXUMFKRJTPGLFY5DGYUJNLQV5SGFZMXCMTWV3CKB6Z43GQ3L6C > BIG'
Jan 30 13:54:12 n2-sfo nats-server[2500521]: [1] [INF]   Restored 61,440 messages for stream 'ADM6CMOXUMFKRJTPGLFY5DGYUJNLQV5SGFZMXCMTWV3CKB6Z43GQ3L6C > BIG' in 1ms

But then appeared to delete it.

I also saw a lot of read loop warnings AFTER cancelling on the origin cluster which suggests something was still happening there after the cancel

Jan 30 13:50:35 n2-lon nats-server[2273043]: [1] [WRN] 157.245.38.117:56350 - cid:759 - Readloop processing time: 2.001583862s
Jan 30 13:50:37 n2-lon nats-server[2273043]: [1] [WRN] 157.245.38.117:56350 - cid:759 - Readloop processing time: 2.329004291s
Jan 30 13:50:41 n2-lon nats-server[2273043]: [1] [WRN] 157.245.38.117:56350 - cid:759 - Readloop processing time: 4.070602157s
Jan 30 13:50:47 n2-lon nats-server[2273043]: [1] [WRN] 157.245.38.117:56350 - cid:759 - Readloop processing time: 5.041428719s
Jan 30 13:50:50 n2-lon nats-server[2273043]: [1] [WRN] 157.245.38.117:56350 - cid:759 - Readloop processing time: 3.518233028s

@ripienaar
Copy link
Collaborator

ripienaar commented Jan 30, 2025

The origin cluster logged a lot of this AFTER the cancel

Jan 30 14:00:24 n2-lon nats-server[2273043]: [1] [WRN] Catchup for stream 'ADM6CMOXUMFKRJTPGLFY5DGYUJNLQV5SGFZMXCMTWV3CKB6Z43GQ3L6C > BIG' stalled
Jan 30 14:00:27 n2-lon nats-server[2273043]: [1] [WRN] Catchup for stream 'ADM6CMOXUMFKRJTPGLFY5DGYUJNLQV5SGFZMXCMTWV3CKB6Z43GQ3L6C > BIG' stalled
Jan 30 14:00:27 n2-lon nats-server[2273043]: [1] [WRN] Catchup for stream 'ADM6CMOXUMFKRJTPGLFY5DGYUJNLQV5SGFZMXCMTWV3CKB6Z43GQ3L6C > BIG' stalled

but

Cluster Information:

                    Name: lon
           Cluster Group: S-R3F-7ELRENun
                  Leader: n2-lon
                 Replica: n1-lon, current, seen 20ms ago
                 Replica: n3-lon, current, seen 19ms ago

But info showed the cluster up to date, I think internally the stream isnt fully aware the move is cancelled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants