-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
noobaa-db-pg pod when migrated, doesn't allow the new users or new buckets to be created #6869
Comments
With Alex on the Live debug session, to come out of this situation of unable to create the new buckets or new users on the system, Work-around step are as follows
|
@nimrod-becker , Would like someone to investigate as the restart of noobaa-core pod has resulted in working env. This is a priority defect IMO |
thank you! |
@baum |
@rkomandu can you please edit the issue description to better follow the issue template? |
@dannyzaken , defect has been moved from the previous one as you could see 6853. Will try to see if i can update it again. |
@rkomandu, this issue can not be reproduced on our side: In dev env. using kind cluster, DB failover could not be tested due to kind storage system limitations. DB failover is tested in RH QE though with CEPH. To troubleshoot this issue, to rule out the network issues, I will provide you with a 'toolbox' image that performs RPC using cluster internal management address. The idea is to reproduce this issue and see if NooBaa RPC such as list accounts API is working using the toolbox, after noobaa-db-pg pod migration. |
@baum , Am not sure what network issue you are referring to. this is related to the noobaa-core restart as we have seen Live on the system the things started working with the response of Accounts, new users and mb create, |
noobaa-db has come into working state, however the noobaa api calls haven't responded. So we have done the restart of the Noobaa-core after looking from different angles and then system started working. |
@rkomandu, could you try the following procedure once noobaa-db comes into a working state, but the noobaa RPC API calls do not respond. Please see the attached archive, it includes a simple RPC test, calling account.list_accounts() using the internal cluster address of noobaa-core. The archive includes
Sample run➜ oc create -f toolbox.yaml
pod/toolbox created
➜ oc exec -ti toolbox -- bash
bash-4.4$ cd /root/node_modules/noobaa-core/src/test/rpc/
bash-4.4$ node rpc.js
load_config_local: NO LOCAL CONFIG
OpenSSL 1.1.1l 24 Aug 2021 setting up
init_rand_seed: starting ...
read_rand_seed: opening /dev/random ...
Feb-7 13:02:21.185 [/15] [LOG] CONSOLE:: auth params { email: '[email protected]', password: 'LAHsRAwCRRDIrlgC4q3f0w==', system: 'noobaa' }
Feb-7 13:02:21.187 [/15] [LOG] CONSOLE:: rpc url http://10.98.31.25:8080
(node:15) Warning: Accessing non-existent property 'RpcError' of module exports inside circular dependency
(Use `node --trace-warnings ...` to show where the warning was created)
Feb-7 13:02:21.196 [/15] [LOG] CONSOLE:: read_rand_seed: reading 32 bytes from /dev/random ...
(node:15) [DEP0066] DeprecationWarning: OutgoingMessage.prototype._headers is deprecated
Feb-7 13:02:21.208 [/15] [LOG] CONSOLE:: read_rand_seed: got 32 bytes from /dev/random, total 32 ...
Feb-7 13:02:21.208 [/15] [LOG] CONSOLE:: read_rand_seed: closing fd ...
Feb-7 13:02:21.209 [/15] [LOG] CONSOLE:: init_rand_seed: seeding with 32 bytes
rand_seed: OpenSSL 1.1.1l 24 Aug 2021 seeding randomness
Feb-7 13:02:21.210 [/15] [LOG] CONSOLE:: init_rand_seed: done
Feb-7 13:02:21.825 [/15] [LOG] CONSOLE::
Feb-7 13:02:21.922 [/15] [LOG] CONSOLE:: accounts {
accounts: [ { name: SENSITIVE-b8e720282050fed7, email: SENSITIVE-d4bc82999e444a8c, access_keys: [ { access_key: SENSITIVE-3bcd74c0d5fb9444, secret_key: SENSITIVE-6fce8fe68278e012 } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'system-internal-storage-pool-61f2797668568e002a078531', can_create_buckets: true, systems: [ { name: 'noobaa', roles: [ 'operator' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } }, { name: SENSITIVE-a2dee06b3ad853df, email: SENSITIVE-a2dee06b3ad853df, access_keys: [ { access_key: SENSITIVE-8eb41064e84395a9, secret_key: SENSITIVE-57fac7497d77e859 } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: false, permission_list: [ SENSITIVE-6f3b29f1f7bb9970 ] }, default_resource: 'noobaa-default-backing-store', can_create_buckets: false, bucket_claim_owner: SENSITIVE-6f3b29f1f7bb9970, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } }, { name: SENSITIVE-b8e720282050fed7, email: SENSITIVE-9cf0fd89409efef8, access_keys: [ { access_key: SENSITIVE-72c9bd3c1b81e8e0, secret_key: SENSITIVE-593f1fadf1c863ee } ], has_login: true, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'noobaa-default-backing-store', can_create_buckets: true, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } }, { name: SENSITIVE-fdc12e472642f831, email: SENSITIVE-fdc12e472642f831, access_keys: [ { access_key: SENSITIVE-79f98fc256854ccf, secret_key: SENSITIVE-c729e136639d9471 } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: false, permission_list: [ SENSITIVE-7338c355a2d6ab1b ] }, default_resource: 'noobaa-default-backing-store', can_create_buckets: false, bucket_claim_owner: SENSITIVE-7338c355a2d6ab1b, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } }, { name: SENSITIVE-732bc7447032b841, email: SENSITIVE-732bc7447032b841, access_keys: [ { access_key: SENSITIVE-b3786c6c15bf6932, secret_key: SENSITIVE-80d4b2c36d0ea7cb } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: false, permission_list: [ SENSITIVE-bdfbaaeb3c14fe04 ] }, default_resource: 'noobaa-default-backing-store', can_create_buckets: false, bucket_claim_owner: SENSITIVE-bdfbaaeb3c14fe04, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } } ]
} |
I am trying to recreate today, but another unrelated issue with CSI and GUI communication which is being debugged. Hopefully will give it a try once the current situation is fixed on the system. This didn't happen earlier for me hence IBM GUI/CSI team are looking into my system. |
Tried the above toolbox setup instructions when the noobaa-db pod comes into running state. Please see below ODF-4.9.2-9 build Noobaa db restart was in Init state and then got to Running state once the worker2 node got into Ready state
Tried the @baum mentioned the toolbox
However the new account when created using noobaa api call hangs, where it is able to show when list in the DB `[[email protected] ODF-4.9.2]# cat account-create-s3user54.sh checked from other terminal, it shows that is is created , how is this created an entry into DB [[email protected] ODF-4.9.2]# noobaa api account_api list_accounts {} | grep s3user54 noobaa api account_api read_account '{"email":"[email protected]"}'
Even the new bucket creation hangs as shown below ,
In summary System is Live, can show you this one |
@rkomandu, thank you for the reply. The toolbox test might indicate the issue with the MetalLB layer.
The main difference between For better comparison, list_accounts() should be used with The bottom line, it sounds like a good idea to take a closer look at the MetalLB load balancer layer. WDYT? |
@baum, When we are doing rpc just like account create, how MetalLB is coming into picture from the noobaa API..
Where is MetalLB coming into picture here .. |
MetalLB affects the routing and networking of the system. Seems like not a bug in NooBa. If you can reproduce without any LB deployed, using only what NooBaa has out of the box, then it would/cloud be a NooBaa issue |
With @baum , the Live debugging, showed him the new account create doesn't happen. Here is the new toolbox he has provided and when tried it showed the following error |
Posting the toolbox-account rpc.js output
noobaa-core pod logs |
@baum and I did a Live debug session and showed him the noobaa account api hangs with the same toolbox that was provided and the noobaa-core logs are uploaded in the above comment. However the noobaa db has the entry as shown below
@baum is looking into the noobaa core logs and then come back. Summary: This is like the noobaa-db node was made down and then once it got to Running state, the noobaa api calls are used to create new user accounts which is not happening as it is hung, but the database has got that entry. Overall the system can't have new accounts and new buckets created as a bottom line |
@rkomandu thank you for additional info! |
As i mentioned earlier and the work-around that we tried earlier , same thing has been tried to come out of this situation. Now as you can see the noobaa-core was restarted and then the account creation can happen.
It is now upto Noobaa team to RCA and fix the problem. |
I have reviewed the noobaa core logs with Liran. There are no errors there for the create_account RPC, according to the logs the call completes successfully. Which also matches list_account output. re RCA, the issue might be either with (a) noobaa RPC code or (b) ClusterIP service issue. @rkomandu could you reproduce with RPC tracing (increase debug level) to check option (a)? |
@baum , i will create in the regular flow as mentioned above using noobaa api call, if you want to do with toolbox, let us do the Live session and get this collected. |
@baum , currently my setup is at noobaa-core (nsfs) log-level only
As per @romayalon , the NSFS would benefit for the noobaa-endpoint logs. As we are trying to create accounts using noobaa-api, am not sure how that would help for the noobaa-core logs. |
@baum, let me know how this debug nsfs, would help at the noobaa-core log level, as this was already the case on my system as mentioned in above update (for last recreate itself as I did noobaa-core pod restart) once the noobaa-db interactions hung |
@nimrod-becker , @jeniawhite @baum Current work-around in my 3M+3W node environment was to restart the noobaa-core pod. |
I disagree that a simple node is down is the scenario, that specific scenario is being tested and it works. There is something else here on this issue |
@nimrod-becker , when you say it works, the noobaa-db migration happens and gets to Running state. However the creation of accounts / new buckets shouldn't hung / error out, once noobaa-db pod is Running. |
@baum , please let me know how to patch the above noobaa-core image |
Hello @rkomandu
For a fresh install, use the --noobaa-image install option:
Best regards |
@baum , let me try the edit noobaa cr, as we don't use the noobaa CLI in downstream |
@baum , tried editing the noobaa cr, it doesn't seem to allow me , it is resetting to the odf4-mcg-core Actual one: Tried with editing to the above noobaa-core@shavalue.. Noobaa-core doesn't restart
|
@rkomandu, sounds like the upper-level operator overrides. Could you try editing the |
this process didn't help. Found a different method to patch the noobaa-core and the patching worked Interestingly, in this new image of noobaa-core the new users can be created and the new buckets can be created which wasn't the case earlier.. Am not sure what changes are done here, Posting a step-wise process of the steps executed on this new image for the defect recreate. Uploading the steps of the Account creation noobaa-bug-create-db-hung-6869.txt.txt |
Uploading the MG image when the (worker2) is down as did for the above re-create. MG gave an error at the end, which is expected since the worker2 is not in Ready state as of MG running |
@rkomandu. This is the codebase for the change. Let me know if you need any further assistance. |
@romayalon , posted above the steps performed. nothing changed on my system |
There are updates from the CSI team on the known limitations and having the CSI attacher replicaset for now. It has been tested with CNSA 513, CSI (2.5.0) + ODF - 4.9.5-4 d/s builds + Latest DAS operator. Noobaa-db pod node when go down, it would take approximately around 6m 30sec - 7m and in that interim, no new accounts,exports/buckets can be created. Post the noobaa-db pod coming into Running state, Business is as-usual. For now closing this defect. There is an enhancement planned by CSI team for future releases. |
Environment info
Noobaa version is RC code of the ODF 4.9.0
noobaa status
INFO[0000] CLI version: 5.9.0
INFO[0000] noobaa-image: quay.io/rhceph-dev/mcg-core@sha256:6ce2ddee7aff6a0e768fce523a77c998e1e48e25d227f93843d195d65ebb81b9
INFO[0000] operator-image: quay.io/rhceph-dev/mcg-operator@sha256:cc293c7fe0fdfe3812f9d1af30b6f9c59e97d00c4727c4463a5b9d3429f4278e
INFO[0000] noobaa-db-image: registry.redhat.io/rhel8/postgresql-12@sha256:b3e5b7bc6acd6422f928242d026171bcbed40ab644a2524c84e8ccb4b1ac48ff
INFO[0000] Namespace: openshift-storage
oc version
Client Version: 4.9.5
Server Version: 4.9.5
Kubernetes Version: v1.22.0-rc.0+a44d0f0
Actual behavior
Note: This defect is created taking from the comments in the #6853
Node down scenario where the noobaa-db is running on a worker node and when it is shutdown the noobaa-db pod has to be migrated, should allow the new IO users and new IO can be spawned. It doesn't seem to be the case.
Expected behavior
Steps to reproduce
`Basic Q here: When a node that is currently running noobaa-db-pg-0 is made down, then the noobaa-db-pg-0 has been moved to other worker node and got into Running state around 6min delay, however after that we can't create
-- new users
-- new buckets (using s3mb)
Step 1:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NOD
E READINESS GATES
noobaa-core-0 1/1 Running 0 20h 10.254.23.179 worker2.rkomandu-ta.cp.fyre.ibm.com
noobaa-db-pg-0 1/1 Running 0 31m 10.254.12.12 worker1.rkomandu-ta.cp.fyre.ibm.com
Step 2: Made worker1 down where noobaa-db-pg-0 is running
worker0.rkomandu-ta.cp.fyre.ibm.com Ready worker 53d v1.22.0-rc.0+a44d0f0
worker1.rkomandu-ta.cp.fyre.ibm.com NotReady worker 53d v1.22.0-rc.0+a44d0f0
worker2.rkomandu-ta.cp.fyre.ibm.com Ready worker 53d v1.22.0-rc.0+a44d0f0
Step 3: noobaa-db-pg-0 moved to worker2 from worker1
noobaa-db-pg-0 0/1 Init:0/2 0 15s worker2.rkomandu-ta.cp.fyre.ibm.com
INFO[0000] ✅ Exists: NooBaa "noobaa"✈️ RPC: account.list_accounts() Request: map[]
INFO[0000] ✅ Exists: Service "noobaa-mgmt"
INFO[0000] ✅ Exists: Secret "noobaa-operator"
INFO[0000] ✅ Exists: Secret "noobaa-admin"
INFO[0000]
WARN[0000] RPC: GetConnection creating connection to wss://localhost:42325/rpc/ 0xc000a996d0
INFO[0000] RPC: Connecting websocket (0xc000a996d0) &{RPC:0xc0004bd130 Address:wss://localhost:42325/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] RPC: Connected websocket (0xc000a996d0) &{RPC:0xc0004bd130 Address:wss://localhost:42325/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
`
This is a bigger problem when we do any Failover testing the new user's can't be created , no new buckets can be created as well
Attaching must-gather logs
must-gather.local-noobaa-db-pg-0.tar.gz
Actual behavior
Expected behavior
Steps to reproduce
More information - Screenshots / Logs / Other output
The text was updated successfully, but these errors were encountered: