Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

noobaa-db-pg pod when migrated, doesn't allow the new users or new buckets to be created #6869

Closed
rkomandu opened this issue Feb 3, 2022 · 39 comments
Labels

Comments

@rkomandu
Copy link
Collaborator

rkomandu commented Feb 3, 2022

Environment info

  • NooBaa Version: VERSION
  • Platform: Kubernetes 1.14.1 | minikube 1.1.1 | OpenShift 4.1 | other: specify

Noobaa version is RC code of the ODF 4.9.0

noobaa status
INFO[0000] CLI version: 5.9.0
INFO[0000] noobaa-image: quay.io/rhceph-dev/mcg-core@sha256:6ce2ddee7aff6a0e768fce523a77c998e1e48e25d227f93843d195d65ebb81b9
INFO[0000] operator-image: quay.io/rhceph-dev/mcg-operator@sha256:cc293c7fe0fdfe3812f9d1af30b6f9c59e97d00c4727c4463a5b9d3429f4278e
INFO[0000] noobaa-db-image: registry.redhat.io/rhel8/postgresql-12@sha256:b3e5b7bc6acd6422f928242d026171bcbed40ab644a2524c84e8ccb4b1ac48ff
INFO[0000] Namespace: openshift-storage

oc version
Client Version: 4.9.5
Server Version: 4.9.5
Kubernetes Version: v1.22.0-rc.0+a44d0f0

Actual behavior
Note: This defect is created taking from the comments in the #6853

Node down scenario where the noobaa-db is running on a worker node and when it is shutdown the noobaa-db pod has to be migrated, should allow the new IO users and new IO can be spawned. It doesn't seem to be the case.

Expected behavior

Steps to reproduce

`Basic Q here: When a node that is currently running noobaa-db-pg-0 is made down, then the noobaa-db-pg-0 has been moved to other worker node and got into Running state around 6min delay, however after that we can't create

-- new users
-- new buckets (using s3mb)

Step 1:

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NOD
E READINESS GATES
noobaa-core-0 1/1 Running 0 20h 10.254.23.179 worker2.rkomandu-ta.cp.fyre.ibm.com

noobaa-db-pg-0 1/1 Running 0 31m 10.254.12.12 worker1.rkomandu-ta.cp.fyre.ibm.com

Step 2: Made worker1 down where noobaa-db-pg-0 is running

worker0.rkomandu-ta.cp.fyre.ibm.com Ready worker 53d v1.22.0-rc.0+a44d0f0
worker1.rkomandu-ta.cp.fyre.ibm.com NotReady worker 53d v1.22.0-rc.0+a44d0f0
worker2.rkomandu-ta.cp.fyre.ibm.com Ready worker 53d v1.22.0-rc.0+a44d0f0

Step 3: noobaa-db-pg-0 moved to worker2 from worker1
noobaa-db-pg-0 0/1 Init:0/2 0 15s worker2.rkomandu-ta.cp.fyre.ibm.com

 Step 4:  Still it is trying to get Initialized 
 noobaa-db-pg-0                                     0/1     Init:0/2   0             3m56s   <none>          worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
 <none>
 
 Step 5:  After 6mXsec, it got into Running state 
 noobaa-db-pg-0                                     1/1     Running   0              91m     10.254.23.217   worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
 <none>
 
 Step 6: Noobaa api to list_accounts just hangs 
 
 noobaa api account_api list_accounts {}

INFO[0000] ✅ Exists: NooBaa "noobaa"
INFO[0000] ✅ Exists: Service "noobaa-mgmt"
INFO[0000] ✅ Exists: Secret "noobaa-operator"
INFO[0000] ✅ Exists: Secret "noobaa-admin"
INFO[0000] ✈️ RPC: account.list_accounts() Request: map[]
WARN[0000] RPC: GetConnection creating connection to wss://localhost:42325/rpc/ 0xc000a996d0
INFO[0000] RPC: Connecting websocket (0xc000a996d0) &{RPC:0xc0004bd130 Address:wss://localhost:42325/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] RPC: Connected websocket (0xc000a996d0) &{RPC:0xc0004bd130 Address:wss://localhost:42325/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}

`

This is a bigger problem when we do any Failover testing the new user's can't be created , no new buckets can be created as well

Attaching must-gather logs

must-gather.local-noobaa-db-pg-0.tar.gz

Actual behavior

Expected behavior

Steps to reproduce

More information - Screenshots / Logs / Other output

@rkomandu rkomandu added the NS-FS label Feb 3, 2022
@rkomandu
Copy link
Collaborator Author

rkomandu commented Feb 3, 2022

With Alex on the Live debug session, to come out of this situation of unable to create the new buckets or new users on the system,

Work-around step are as follows

oc delete pod noobaa-core-0
pod "noobaa-core-0" deleted

NAME                                               READY   STATUS    RESTARTS         AGE     IP              NODE                                  NOMINATED NODE   READINESS GATES
noobaa-core-0                                      1/1     Running   0                10m     10.254.14.158   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>           <none> --> **This has been restarted** 
noobaa-db-pg-0                                     1/1     Running   0                2d      10.254.23.217   worker2.rkomandu-ta.cp.fyre.ibm.com   <none>           <none>
noobaa-default-backing-store-noobaa-pod-77176233   1/1     Running   0                2d6h    10.254.18.15    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <none>
noobaa-endpoint-5cdc86865c-9j654                   1/1     Running   0                9m34s   10.254.23.237   worker2.rkomandu-ta.cp.fyre.ibm.com   <none>           <none>
noobaa-endpoint-5cdc86865c-cfqls                   1/1     Running   0                9m44s   10.254.14.159   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>           <none>
noobaa-endpoint-5cdc86865c-mnd4m                   1/1     Running   0                9m23s   10.254.14.161   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>           <none>
noobaa-operator-54877b7dc9-zjsvl                   1/1     Running   0                2d2h    10.254.18.86    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <none>
ocs-metrics-exporter-7955bfc785-cn2zl              1/1     Running   0                2d2h    10.254.18.84    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <none>
ocs-operator-57d785c8c7-bqpfl                      1/1     Running   12 (4h32m ago)   2d2h    10.254.18.90    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <none>
odf-console-756c9c8bc7-4jsfl                       1/1     Running   0                2d2h    10.254.18.88    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <none>
odf-operator-controller-manager-89746b599-z64f6    2/2     Running   12 (4h32m ago)   2d2h    10.254.18.87    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <none>
rook-ceph-operator-74864f7c6f-rlf6c                1/1     Running   0                2d2h    10.254.18.82    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>           <none>
```


-- noobaa api calls started working


-- Able to create new buckets 

 s3u5302 mb s3://newbucket-alex-03feb2022
urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host 's3-openshift-storage.apps.rkomandu-ta.cp.fyre.ibm.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
make_bucket: newbucket-alex-03feb2022
`


@rkomandu
Copy link
Collaborator Author

rkomandu commented Feb 3, 2022

@nimrod-becker , Would like someone to investigate as the restart of noobaa-core pod has resulted in working env.

This is a priority defect IMO

@baum
Copy link
Contributor

baum commented Feb 6, 2022

@rkomandu

  • could you rephrase the bug condition, since DB migration is not necessarily related
  • could you make sure the issue is reproducible and under which conditions

thank you!

@rkomandu
Copy link
Collaborator Author

rkomandu commented Feb 7, 2022

@baum
I have provided the steps for bug create, could you try recreating in-house. As we saw the Live that on only Noobaa-core restart new IO/new account creations resumed on the system.

@dannyzaken
Copy link
Contributor

@rkomandu can you please edit the issue description to better follow the issue template?

@rkomandu
Copy link
Collaborator Author

rkomandu commented Feb 7, 2022

@dannyzaken , defect has been moved from the previous one as you could see 6853. Will try to see if i can update it again.

@baum
Copy link
Contributor

baum commented Feb 7, 2022

@rkomandu, this issue can not be reproduced on our side: In dev env. using kind cluster, DB failover could not be tested due to kind storage system limitations. DB failover is tested in RH QE though with CEPH.

To troubleshoot this issue, to rule out the network issues, I will provide you with a 'toolbox' image that performs RPC using cluster internal management address. The idea is to reproduce this issue and see if NooBaa RPC such as list accounts API is working using the toolbox, after noobaa-db-pg pod migration.

@rkomandu
Copy link
Collaborator Author

rkomandu commented Feb 7, 2022

@baum , Am not sure what network issue you are referring to. this is related to the noobaa-core restart as we have seen Live on the system the things started working with the response of Accounts, new users and mb create,

@rkomandu
Copy link
Collaborator Author

rkomandu commented Feb 7, 2022

noobaa-db has come into working state, however the noobaa api calls haven't responded. So we have done the restart of the Noobaa-core after looking from different angles and then system started working.

@baum
Copy link
Contributor

baum commented Feb 7, 2022

Archive.zip

@rkomandu, could you try the following procedure once noobaa-db comes into a working state, but the noobaa RPC API calls do not respond.

Please see the attached archive, it includes a simple RPC test, calling account.list_accounts() using the internal cluster address of noobaa-core.

The archive includes

  • rpc.js - test source
  • Dockerfile - building the toolbox image from the source above
  • toolbox.yaml - pod using the toolbox image

Sample run

➜  oc create -f toolbox.yaml
pod/toolbox created
➜  oc exec -ti toolbox -- bash
bash-4.4$ cd /root/node_modules/noobaa-core/src/test/rpc/
bash-4.4$ node rpc.js
load_config_local: NO LOCAL CONFIG
OpenSSL 1.1.1l  24 Aug 2021 setting up
init_rand_seed: starting ...
read_rand_seed: opening /dev/random ...
Feb-7 13:02:21.185 [/15]   [LOG] CONSOLE:: auth params  { email: '[email protected]', password: 'LAHsRAwCRRDIrlgC4q3f0w==', system: 'noobaa' }
Feb-7 13:02:21.187 [/15]   [LOG] CONSOLE:: rpc url  http://10.98.31.25:8080
(node:15) Warning: Accessing non-existent property 'RpcError' of module exports inside circular dependency
(Use `node --trace-warnings ...` to show where the warning was created)
Feb-7 13:02:21.196 [/15]   [LOG] CONSOLE:: read_rand_seed: reading 32 bytes from /dev/random ...
(node:15) [DEP0066] DeprecationWarning: OutgoingMessage.prototype._headers is deprecated
Feb-7 13:02:21.208 [/15]   [LOG] CONSOLE:: read_rand_seed: got 32 bytes from /dev/random, total 32 ...
Feb-7 13:02:21.208 [/15]   [LOG] CONSOLE:: read_rand_seed: closing fd ...
Feb-7 13:02:21.209 [/15]   [LOG] CONSOLE:: init_rand_seed: seeding with 32 bytes
rand_seed: OpenSSL 1.1.1l  24 Aug 2021 seeding randomness
Feb-7 13:02:21.210 [/15]   [LOG] CONSOLE:: init_rand_seed: done
Feb-7 13:02:21.825 [/15]   [LOG] CONSOLE::
Feb-7 13:02:21.922 [/15]   [LOG] CONSOLE:: accounts  {
  accounts: [ { name: SENSITIVE-b8e720282050fed7, email: SENSITIVE-d4bc82999e444a8c, access_keys: [ { access_key: SENSITIVE-3bcd74c0d5fb9444, secret_key: SENSITIVE-6fce8fe68278e012 } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'system-internal-storage-pool-61f2797668568e002a078531', can_create_buckets: true, systems: [ { name: 'noobaa', roles: [ 'operator' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } }, { name: SENSITIVE-a2dee06b3ad853df, email: SENSITIVE-a2dee06b3ad853df, access_keys: [ { access_key: SENSITIVE-8eb41064e84395a9, secret_key: SENSITIVE-57fac7497d77e859 } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: false, permission_list: [ SENSITIVE-6f3b29f1f7bb9970 ] }, default_resource: 'noobaa-default-backing-store', can_create_buckets: false, bucket_claim_owner: SENSITIVE-6f3b29f1f7bb9970, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } }, { name: SENSITIVE-b8e720282050fed7, email: SENSITIVE-9cf0fd89409efef8, access_keys: [ { access_key: SENSITIVE-72c9bd3c1b81e8e0, secret_key: SENSITIVE-593f1fadf1c863ee } ], has_login: true, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'noobaa-default-backing-store', can_create_buckets: true, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } }, { name: SENSITIVE-fdc12e472642f831, email: SENSITIVE-fdc12e472642f831, access_keys: [ { access_key: SENSITIVE-79f98fc256854ccf, secret_key: SENSITIVE-c729e136639d9471 } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: false, permission_list: [ SENSITIVE-7338c355a2d6ab1b ] }, default_resource: 'noobaa-default-backing-store', can_create_buckets: false, bucket_claim_owner: SENSITIVE-7338c355a2d6ab1b, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } }, { name: SENSITIVE-732bc7447032b841, email: SENSITIVE-732bc7447032b841, access_keys: [ { access_key: SENSITIVE-b3786c6c15bf6932, secret_key: SENSITIVE-80d4b2c36d0ea7cb } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: false, permission_list: [ SENSITIVE-bdfbaaeb3c14fe04 ] }, default_resource: 'noobaa-default-backing-store', can_create_buckets: false, bucket_claim_owner: SENSITIVE-bdfbaaeb3c14fe04, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } } ]
}

@rkomandu
Copy link
Collaborator Author

rkomandu commented Feb 9, 2022

I am trying to recreate today, but another unrelated issue with CSI and GUI communication which is being debugged. Hopefully will give it a try once the current situation is fixed on the system. This didn't happen earlier for me hence IBM GUI/CSI team are looking into my system.

@rkomandu
Copy link
Collaborator Author

@baum @dannyzaken

Tried the above toolbox setup instructions when the noobaa-db pod comes into running state. Please see below

ODF-4.9.2-9 build

Noobaa db restart was in Init state and then got to Running state once the worker2 node got into Ready state

Initially before the worker2 was down the noobaa is operational

noobaa-db-pg-0                                     1/1     Running   0                7d23h   10.254.23.217   worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
       <none>
noobaa-default-backing-store-noobaa-pod-77176233   1/1     Running   0                8d      10.254.18.15    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
       <none>
noobaa-endpoint-5b84fc4698-c8dqr                   1/1     Running   0                23h     10.254.20.173   worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
       <none>
noobaa-endpoint-5b84fc4698-lv8tm                   1/1     Running   0                23h     10.254.17.224   worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
       <none>
noobaa-endpoint-5b84fc4698-tsc4f                   1/1     Running   0                23h     10.254.14.19    worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
       <none>
       
       After few mins and seconds, checking the noobaa-db pod as it is trying to move to worker1 
       
       noobaa-db-pg-0                                     0/1     Init:0/2   0                14s   <none>          worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
      <none>
      
      noobaa-db-pg-0                                     0/1     Init:0/2   0                30s   <none>          worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
      <none>

noobaa-db-pg-0                                     0/1     Init:0/2   0                55s   <none>          worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
      <none>

noobaa-db-pg-0                                     0/1     Init:0/2   0                104s   <none>          worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
       <none>

noobaa-db-pg-0                                     0/1     Init:0/2   0                5m59s   <none>          worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
        <none>


And now finally once worker2 is made up the noobaa-db is set to Running state 

NAME                                               READY   STATUS    RESTARTS      AGE   IP              NODE                                  NOMINATED NODE
  READINESS GATES
noobaa-core-0                                      1/1     Running   0             21h   10.254.15.221   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
  <none>
noobaa-db-pg-0                                     1/1     Running   0             21h   10.254.12.240   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
  <none>
noobaa-default-backing-store-noobaa-pod-77176233   1/1     Running   0             9d    10.254.18.15    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
  <none>
noobaa-endpoint-5b84fc4698-7bhb5                   1/1     Running   0             21h   10.254.20.7     worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
  <none>
noobaa-endpoint-5b84fc4698-lv8tm                   1/1     Running   0             44h   10.254.17.224   worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
  <none>
noobaa-endpoint-5b84fc4698-tsc4f                   1/1     Running   0             44h   10.254.14.19    worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
  <none>
noobaa-operator-54877b7dc9-zjsvl                   1/1     Running   0             8d    10.254.18.86    worker0.rkomandu-ta.cp.fyre.ibm.com   <none>
  <none>


HA (MetalLB is functional) 

 oc get svc -n openshift-storage |grep Load
ibm-spectrum-scale-das-ip-worker0-rkomandu-ta-cp-fyre-ibm-com   LoadBalancer   172.30.95.89     10.17.127.178   80:31051/TCP,443:32733/TCP,8444:31078/TCP,7004:31005/TCP   14d
ibm-spectrum-scale-das-ip-worker1-rkomandu-ta-cp-fyre-ibm-com   LoadBalancer   172.30.160.49    10.17.127.179   80:30560/TCP,443:30484/TCP,8444:31076/TCP,7004:31342/TCP   14d
ibm-spectrum-scale-das-ip-worker2-rkomandu-ta-cp-fyre-ibm-com   LoadBalancer   172.30.51.82     10.17.127.180   80:31408/TCP,443:31184/TCP,8444:32501/TCP,7004:32399/TCP   14d


Tried the @baum mentioned the toolbox


`[[email protected] ODF-4.9.2]# oc create -f ./toolbox.yaml
pod/toolbox created
[[email protected] ODF-4.9.2]# oc exec -it toolbox --bash
Error: unknown flag: --bash
See 'oc exec --help' for usage.
[[email protected] ODF-4.9.2]# oc exec -it toolbox -- bash
error: unable to upgrade connection: container not found ("core")
[[email protected] ODF-4.9.2]# oc project openshift-storage
Already on project "openshift-storage" on server "https://api.rkomandu-ta.cp.fyre.ibm.com:6443".
[[email protected] ODF-4.9.2]# oc exec -it toolbox -- bash
bash-4.4$ cd /root/node_modules/
bash-4.4$ cd noobaa-core/
bash-4.4$ cd src
bash-4.4$ cd test
bash-4.4$ cd rpc
bash-4.4$ ls
rpc.js
bash-4.4$ uname -a
Linux toolbox 4.18.0-305.19.1.el8_4.x86_64 #1 SMP Tue Sep 7 07:07:31 EDT 2021 x86_64 x86_64 x86_64 GNU/Linux
bash-4.4$ pwd
/root/node_modules/noobaa-core/src/test/rpc
bash-4.4$ node rpc.js
load_config_local: NO LOCAL CONFIG
OpenSSL 1.1.1l  24 Aug 2021 setting up
init_rand_seed: starting ...
read_rand_seed: opening /dev/random ...
Feb-10 7:21:28.082 [/15]   [LOG] CONSOLE:: auth params  { email: '[email protected]', password: 'EgJzGPurcbNR0VyngfbicA==', system: 'noobaa' }
Feb-10 7:21:28.084 [/15]   [LOG] CONSOLE:: rpc url  http://172.30.137.215:80
(node:15) Warning: Accessing non-existent property 'RpcError' of module exports inside circular dependency
(Use `node --trace-warnings ...` to show where the warning was created)
Feb-10 7:21:28.090 [/15]   [LOG] CONSOLE:: read_rand_seed: reading 32 bytes from /dev/random ...
(node:15) [DEP0066] DeprecationWarning: OutgoingMessage.prototype._headers is deprecated
Feb-10 7:21:28.098 [/15]   [LOG] CONSOLE:: read_rand_seed: got 32 bytes from /dev/random, total 32 ...
Feb-10 7:21:28.098 [/15]   [LOG] CONSOLE:: read_rand_seed: closing fd ...
Feb-10 7:21:28.099 [/15]   [LOG] CONSOLE:: init_rand_seed: seeding with 32 bytes
rand_seed: OpenSSL 1.1.1l  24 Aug 2021 seeding randomness
Feb-10 7:21:28.099 [/15]   [LOG] CONSOLE:: init_rand_seed: done
Feb-10 7:21:28.183 [/15]   [LOG] CONSOLE::
Feb-10 7:21:28.196 [/15]   [LOG] CONSOLE:: accounts  {
  accounts: [ { name: SENSITIVE-658df9d055137abd, email: SENSITIVE-64577de2f7b1a9b0, access_keys: [ { access_key: SENSITIVE-7bc8345fe8af5ef5, secret_key: SENSITIVE-1fc3697230838203 } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'noobaa-s3res-4080029599', can_create_buckets: true, nsfs_account_config: { gid: 6000, uid: 6004, nsfs_only: true, new_buckets_path: '/' }, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'LIGHT' } }, { name: SENSITIVE-59387e6ae961fe37, email: SENSITIVE-78a31bbbd2a133fe, access_keys: [ { access_key: SENSITIVE-5058e68c5ecfcb36, secret_key: SENSITIVE-1bc07a6a6161c50c } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'noobaa-s3res-4080029599', can_create_buckets: true, nsfs_account_config: { gid: 6000, uid: 6003, nsfs_only: true, new_buckets_path: '/' }, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'LIGHT' } }, { name: SENSITIVE-cb994f21e619c111, email: SENSITIVE-cb994f21e619c111, access_keys: [ { access_key: SENSITIVE-6a57e11fdb0ff3f7, secret_key: SENSITIVE-e82f894c40204c68 } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'noobaa-s3res-4080029599', can_create_buckets: true, nsfs_account_config: { gid: 5555, uid: 5304, nsfs_only: true, new_buckets_path: '/' }, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'LIGHT' } }, { name: SENSITIVE-740dc8e97d24d6b7, email: SENSITIVE-740dc8e97d24d6b7, access_keys: [ { access_key: SENSITIVE-4f0788526af54536, secret_key: SENSITIVE-25e708aee8cf1631 } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'noobaa-s3res-4080029599', can_create_buckets: true, nsfs_account_config: { gid: 5555, uid: 5303, nsfs_only: true, new_buckets_path: '/' }, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'LIGHT' } }, { name: SENSITIVE-b8e720282050fed7, email: SENSITIVE-d4bc82999e444a8c, access_keys: [ { access_key: SENSITIVE-5c9b044d94a0166e, secret_key: SENSITIVE-ac4cbf4416fbce00 } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'system-internal-storage-pool-61f22b63543779002bee7190', can_create_buckets: true, systems: [ { name: 'noobaa', roles: [ 'operator' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'LIGHT' } }, { name: SENSITIVE-b8e720282050fed7, email: SENSITIVE-9cf0fd89409efef8, access_keys: [ { access_key: SENSITIVE-dcf5c099b1f544e2, secret_key: SENSITIVE-ff3c7a5d757610db } ], has_login: true, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'noobaa-default-backing-store', can_create_buckets: true, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'LIGHT' } }, { name: SENSITIVE-5220acf980316ab1, email: SENSITIVE-5220acf980316ab1, access_keys: [ { access_key: SENSITIVE-9b4c8329a77bed1f, secret_key: SENSITIVE-e6434719b445bbdd } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'noobaa-s3res-4080029599', can_create_buckets: true, nsfs_account_config: { gid: 5555, uid: 5300, nsfs_only: true, new_buckets_path: 'user-5300-bucket-27jan/' }, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'LIGHT' } }, { name: SENSITIVE-1a884e40f577ab90, email: SENSITIVE-1a884e40f577ab90, access_keys: [ { access_key: SENSITIVE-177145be5df93b04, secret_key: SENSITIVE-491b61747047bfdc } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'noobaa-s3res-4080029599', can_create_buckets: true, nsfs_account_config: { gid: 5555, uid: 5301, nsfs_only: true, new_buckets_path: 'user-5301-bucket-27jan/' }, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'LIGHT' } }, { name: SENSITIVE-30e7dfc8b31583c4, email: SENSITIVE-30e7dfc8b31583c4, access_keys: [ { access_key: SENSITIVE-a4252c8f0c93ecc1, secret_key: SENSITIVE-f51bbb28eb0bc321 } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'noobaa-s3res-4080029599', can_create_buckets: true, nsfs_account_config: { gid: 5555, uid: 5302, nsfs_only: true, new_buckets_path: 'user-5302-bucket-27jan/' }, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'LIGHT' } } ]
}
bash-4.4$ quit
bash: quit: command not found
bash-4.4$ exit
exit
command terminated with exit code 127
`
toolbox is running on worker2 
toolbox                                            1/1     Running   0             6m21s   10.254.20.49    worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
    <none>

However the new account when created using noobaa api call hangs, where it is able to show when list in the DB
Please see the below steps

`[[email protected] ODF-4.9.2]# cat account-create-s3user54.sh
noobaa api account_api create_account '{
"email": "[email protected]",
"name" : "s3user54",
"has_login": false,
"s3_access": true,
"allowed_buckets": { "full_permission": true },
"default_resource": "noobaa-s3res-4080029599",
"nsfs_account_config": {
"uid": 6004,
"gid": 6000,
"new_buckets_path": "/",
"nsfs_only": true
}
}'
[[email protected] ODF-4.9.2]# ./account-create-s3user54.sh
INFO[0000] ✅ Exists: NooBaa "noobaa"
INFO[0000] ✅ Exists: Service "noobaa-mgmt"
INFO[0000] ✅ Exists: Secret "noobaa-operator"
INFO[0000] ✅ Exists: Secret "noobaa-admin"
INFO[0000] ✈️ RPC: account.create_account() Request: map[allowed_buckets:map[full_permission:true] default_resource:noobaa-s3res-4080029599 email:[email protected] has_login:false name:s3user54 nsfs_account_config:map[gid:6000 new_buckets_path:/ nsfs_only:true uid:6004] s3_access:true]
WARN[0000] RPC: GetConnection creating connection to wss://localhost:36581/rpc/ 0xc000a114a0
INFO[0000] RPC: Connecting websocket (0xc000a114a0) &{RPC:0xc00039eff0 Address:wss://localhost:36581/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] RPC: Connected websocket (0xc000a114a0) &{RPC:0xc00039eff0 Address:wss://localhost:36581/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
E0209 23:16:28.633299 2075819 portforward.go:233] lost connection to pod
ERRO[1512] RPC: ReadMessages error: failed to get reader: failed to read frame header: EOF
ERRO[1512] RPC: closing connection (0xc000a114a0) &{RPC:0xc00039eff0 Address:wss://localhost:36581/rpc/ State:connected WS:0xc000976540 PendingRequests:map[wss://localhost:36581/rpc/-0:0xc0002997a0] NextRequestID:1 Lock:{state:1 sema:0} ReconnectDelay:0s}
ERRO[1512] RPC: could not close web socket wss://localhost:36581/rpc/
WARN[1512] RPC: RemoveConnection wss://localhost:36581/rpc/ current=0xc000a114a0 conn=0xc000a114a0
WARN[1512] RPC: GetConnection creating connection to wss://localhost:36581/rpc/ 0xc000118320
INFO[1512] RPC: Reconnect (0xc000118320) delay &{RPC:0xc00039eff0 Address:wss://localhost:36581/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:3s}
ERRO[1512] ⚠️ RPC: account.create_account() Call failed: RPC: connection closed while request is pending wss://localhost:36581/rpc/ wss://localhost:36581/rpc/-0
FATA[1512] ❌ RPC: connection closed while request is pending wss://localhost:36581/rpc/ wss://localhost:36581/rpc/-0

checked from other terminal, it shows that is is created , how is this created an entry into DB

[[email protected] ODF-4.9.2]# noobaa api account_api list_accounts {} | grep s3user54
INFO[0000] ✅ Exists: NooBaa "noobaa"
INFO[0000] ✅ Exists: Service "noobaa-mgmt"
INFO[0000] ✅ Exists: Secret "noobaa-operator"
INFO[0000] ✅ Exists: Secret "noobaa-admin"
INFO[0000] ✈️ RPC: account.list_accounts() Request: map[]
WARN[0000] RPC: GetConnection creating connection to wss://localhost:45049/rpc/ 0xc000448e10
INFO[0000] RPC: Connecting websocket (0xc000448e10) &{RPC:0xc0004bd1d0 Address:wss://localhost:45049/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] RPC: Connected websocket (0xc000448e10) &{RPC:0xc0004bd1d0 Address:wss://localhost:45049/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] ✅ RPC: account.list_accounts() Response OK: took 0.5ms
email: [email protected]
name: s3user54

noobaa api account_api read_account '{"email":"[email protected]"}'
INFO[0000] ✅ Exists: NooBaa "noobaa"
INFO[0000] ✅ Exists: Service "noobaa-mgmt"
INFO[0000] ✅ Exists: Secret "noobaa-operator"
INFO[0000] ✅ Exists: Secret "noobaa-admin"
INFO[0000] ✈️ RPC: account.read_account() Request: map[email:[email protected]]
WARN[0000] RPC: GetConnection creating connection to wss://localhost:40577/rpc/ 0xc000d21860
INFO[0000] RPC: Connecting websocket (0xc000d21860) &{RPC:0xc0005171d0 Address:wss://localhost:40577/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] RPC: Connected websocket (0xc000d21860) &{RPC:0xc0005171d0 Address:wss://localhost:40577/rpc/ State:init WS: PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] ✅ RPC: account.read_account() Response OK: took 0.5ms
access_keys:

  • access_key: e70tPUJXVwM9Qt9m9q70
    secret_key: dgwc9H98Ey8czo6kemv7jQqdLdMxzk4nl8ZY/ynf
    allowed_buckets:
    full_permission: true
    can_create_buckets: true
    default_resource: noobaa-s3res-4080029599
    email: [email protected]
    external_connections:
    connections: []
    count: 0
    has_login: false
    has_s3_access: true
    name: s3user54
    nsfs_account_config:
    gid: 6000
    new_buckets_path: /
    nsfs_only: true
    uid: 6004
    preferences:
    ui_theme: LIGHT
    systems:
  • name: noobaa
    roles:
    • admin
      `

Even the new bucket creation hangs as shown below ,

Step 1: Could list already existing bucket 

[root@rkomandu-app-node1 ~]# s3u5302 ls s3://newbucket-09-feb-2022
urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.17.127.180'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
2022-02-09 02:18:22 10737397760 file_5G

Step 2: when trying to create the bucket it is not allowing.. once the noobaa-db come into Running state 

[root@rkomandu-app-node1 ~]# s3u5302 mb s3://newbucket-u5302-10-feb-2022
urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.17.127.180'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.17.127.180'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
urllib3/connectionpool.py:1013: InsecureRequestWarning: Unverified HTTPS request is being made to host '10.17.127.180'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
make_bucket failed: s3://newbucket-u5302-10-feb-2022 Read timeout on endpoint URL: "https://10.17.127.180/newbucket-u5302-10-feb-2022"

In summary
-- new accounts with noobaa api showed hangs, database entry is create
-- new buckets with the mmdas doesn't get created.

System is Live, can show you this one

@baum
Copy link
Contributor

baum commented Feb 10, 2022

@rkomandu, thank you for the reply.

The toolbox test might indicate the issue with the MetalLB layer.

  1. The toolbox test succeeded in calling list_accounts() RPC, using the cluster internal IP: RPC URL http://172.30.137.215:80

  2. noobaa api account_api RPC call times out, while using MetalLB LoadBalancer

The main difference between noobaa api and toolbox approaches is the former used ClusterIP, while the latter needs to go via MetalLB LoadBalancer.

For better comparison, list_accounts() should be used with noobaa api account_api. Or, alternatively, rpc.js from Archive.zip is easy to modify to do ./account-create-s3user54.sh job, i.e. call create_account() function. This would make an apple to apple comparison if the same RPC call works with toolbox and times out with noobaa api account_api.

The bottom line, it sounds like a good idea to take a closer look at the MetalLB load balancer layer.

WDYT?

@rkomandu
Copy link
Collaborator Author

@baum,
It is not related to MetalLB as if you see my above post the s3uxxx ls and mb to the same had different results.

When we are doing rpc just like account create, how MetalLB is coming into picture from the noobaa API..

[[email protected] ODF-4.9.2]# time sh ./account-create-s3user55.sh
INFO[0000] ✅ Exists: NooBaa "noobaa"
INFO[0000] ✅ Exists: Service "noobaa-mgmt"
INFO[0000] ✅ Exists: Secret "noobaa-operator"
INFO[0000] ✅ Exists: Secret "noobaa-admin"
INFO[0000] ✈️  RPC: account.create_account() Request: map[allowed_buckets:map[full_permission:true] default_resource:noobaa-s3res-4080029599 email:[email protected] has_login:false name:s3user55 nsfs_account_config:map[gid:6000 new_buckets_path:/ nsfs_only:true uid:6005] s3_access:true]
WARN[0000] RPC: GetConnection creating connection to wss://localhost:36685/rpc/ 0xc0001aae60
INFO[0000] RPC: Connecting websocket (0xc0001aae60) &{RPC:0xc0001aaff0 Address:wss://localhost:36685/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] RPC: Connected websocket (0xc0001aae60) &{RPC:0xc0001aaff0 Address:wss://localhost:36685/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
<hangs> 

Where is MetalLB coming into picture here ..

@nimrod-becker
Copy link
Contributor

MetalLB affects the routing and networking of the system.
Using internal IPs works.

Seems like not a bug in NooBa. If you can reproduce without any LB deployed, using only what NooBaa has out of the box, then it would/cloud be a NooBaa issue

@rkomandu
Copy link
Collaborator Author

With @baum , the Live debugging, showed him the new account create doesn't happen. Here is the new toolbox he has provided and when tried it showed the following error
rpc.account-create.js.log

@rkomandu
Copy link
Collaborator Author

rkomandu commented Feb 10, 2022

@baum

Posting the toolbox-account rpc.js output

`bash-4.4$ node rpc.js
load_config_local: NO LOCAL CONFIG
OpenSSL 1.1.1l  24 Aug 2021 setting up
init_rand_seed: starting ...
read_rand_seed: opening /dev/random ...
Feb-10 11:30:04.685 [/16]   [LOG] CONSOLE:: auth params  { email: '[email protected]', password: 'EgJzGPurcbNR0VyngfbicA==', system: 'noobaa' }
Feb-10 11:30:04.686 [/16]   [LOG] CONSOLE:: rpc url  http://172.30.137.215:80
(node:16) Warning: Accessing non-existent property 'RpcError' of module exports inside circular dependency
(Use `node --trace-warnings ...` to show where the warning was created)
Feb-10 11:30:04.690 [/16]   [LOG] CONSOLE:: read_rand_seed: reading 32 bytes from /dev/random ...
(node:16) [DEP0066] DeprecationWarning: OutgoingMessage.prototype._headers is deprecated
Feb-10 11:30:04.697 [/16]   [LOG] CONSOLE:: read_rand_seed: got 32 bytes from /dev/random, total 32 ...
Feb-10 11:30:04.697 [/16]   [LOG] CONSOLE:: read_rand_seed: closing fd ...
Feb-10 11:30:04.698 [/16]   [LOG] CONSOLE:: init_rand_seed: seeding with 32 bytes
rand_seed: OpenSSL 1.1.1l  24 Aug 2021 seeding randomness
Feb-10 11:30:04.698 [/16]   [LOG] CONSOLE:: init_rand_seed: done
Feb-10 11:30:04.785 [/16]   [LOG] CONSOLE::


Feb-10 11:32:04.787 [/16]    [L0] core.rpc.rpc_base_conn:: RPC CONNECTION CLOSED. got event from connection: http://172.30.137.215:80(100s7fr) Error: RPC SEND TIMEOUT
    at /root/node_modules/noobaa-core/src/rpc/rpc_base_conn.js:155:23
    at Timeout._onTimeout (/root/node_modules/noobaa-core/src/util/promise.js:152:20)
    at listOnTimeout (internal/timers.js:557:17)
    at processTimers (internal/timers.js:500:7)
Feb-10 11:32:04.788 [/16]  [WARN] core.rpc.rpc_http:: HTTP ABORT REQ undefined
Feb-10 11:32:04.790 [/16] [ERROR] core.rpc.rpc:: RPC._request: response ERROR srv account_api.create_account reqid 1@http://172.30.137.215:80(100s7fr) connid http://172.30.137.215:80(100s7fr) params { email: SENSITIVE-f56400886485bf66, name: SENSITIVE-5a2a8929f9369b01, has_login: false, s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'noobaa-s3res-4080029599', nsfs_account_config: { uid: 7777, gid: 6000, new_buckets_path: '/', nsfs_only: true } }  [RpcError: connection closed http://172.30.137.215:80(100s7fr) reqid 1@http://172.30.137.215:80(100s7fr)] { rpc_code: 'DISCONNECTED', rpc_data: { retryable: true } }
(node:16) UnhandledPromiseRejectionWarning
(node:16) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:16) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
`

noobaa-core pod logs

noobaa-core-account-hung.10feb2022.log

@rkomandu
Copy link
Collaborator Author

@baum and I did a Live debug session and showed him the noobaa account api hangs with the same toolbox that was provided and the noobaa-core logs are uploaded in the above comment.

However the noobaa db has the entry as shown below

`noobaa api account_api list_accounts {} |grep s3user20220210
INFO[0000] ✅ Exists: NooBaa "noobaa"
INFO[0000] ✅ Exists: Service "noobaa-mgmt"
INFO[0000] ✅ Exists: Secret "noobaa-operator"
INFO[0000] ✅ Exists: Secret "noobaa-admin"
INFO[0000] ✈️  RPC: account.list_accounts() Request: map[]
WARN[0000] RPC: GetConnection creating connection to wss://localhost:35127/rpc/ 0xc000af81e0
INFO[0000] RPC: Connecting websocket (0xc000af81e0) &{RPC:0xc000461180 Address:wss://localhost:35127/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] RPC: Connected websocket (0xc000af81e0) &{RPC:0xc000461180 Address:wss://localhost:35127/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] ✅ RPC: account.list_accounts() Response OK: took 0.6ms
  email: [email protected]
  name: s3user20220210
`

@baum is looking into the noobaa core logs and then come back.

Summary: This is like the noobaa-db node was made down and then once it got to Running state, the noobaa api calls are used to create new user accounts which is not happening as it is hung, but the database has got that entry. Overall the system can't have new accounts and new buckets created as a bottom line

@baum
Copy link
Contributor

baum commented Feb 10, 2022

@rkomandu thank you for additional info!

@rkomandu
Copy link
Collaborator Author

@baum @nimrod-becker

As i mentioned earlier and the work-around that we tried earlier , same thing has been tried to come out of this situation. Now as you can see the noobaa-core was restarted and then the account creation can happen.

noobaa-core-0                                      1/1     Running   0               3m16s   10.254.20.221   worker2.rkomandu-ta.cp.fyr
e.ibm.com   <none>           <none>

Account creation

[[email protected] ODF-4.9.2]# sh ./account-create-s3user56.sh
INFO[0000] ✅ Exists: NooBaa "noobaa"
INFO[0000] ✅ Exists: Service "noobaa-mgmt"
INFO[0000] ✅ Exists: Secret "noobaa-operator"
INFO[0000] ✅ Exists: Secret "noobaa-admin"
INFO[0000] ✈️  RPC: account.create_account() Request: map[allowed_buckets:map[full_permission:true] default_resource:noobaa-s3res-4080029599 email:[email protected] has_login:false name:s3user56 nsfs_account_config:map[gid:6000 new_buckets_path:/ nsfs_only:true uid:6006] s3_access:true]
WARN[0000] RPC: GetConnection creating connection to wss://localhost:44267/rpc/ 0xc000a7a3c0
INFO[0000] RPC: Connecting websocket (0xc000a7a3c0) &{RPC:0xc0004d3180 Address:wss://localhost:44267/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] RPC: Connected websocket (0xc000a7a3c0) &{RPC:0xc0004d3180 Address:wss://localhost:44267/rpc/ State:init WS:<nil> PendingRequests:map[] NextRequestID:0 Lock:{state:1 sema:0} ReconnectDelay:0s}
INFO[0000] ✅ RPC: account.create_account() Response OK: took 22.3ms
access_keys:
- access_key: nnRBILthJfPSqZveQaKF
  secret_key: OJmh7d4F/jnyOvtU/uPsFqZootlFhvF3wF6rfxTk
token: eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhY2NvdW50X2lkIjoiNjIwNTM3Y2QwZmE4MWUwMDJhMWViNzdhIiwic3lzdGVtX2lkIjoiNjFmMjJiNjM1NDM3NzkwMDJiZWU3MTkwIiwicm9sZSI6ImFkbWluIiwiYXV0aG9yaXplZF9ieSI6Im5vb2JhYSIsImlhdCI6MTY0NDUwOTEzM30.pavl7mbRsfgMKDqg8H58TovkxeTlg0bIKw5YsLq8O78


It is now upto Noobaa team to RCA and fix the problem.

@baum
Copy link
Contributor

baum commented Feb 14, 2022

I have reviewed the noobaa core logs with Liran. There are no errors there for the create_account RPC, according to the logs the call completes successfully. Which also matches list_account output. re RCA, the issue might be either with (a) noobaa RPC code or (b) ClusterIP service issue.

@rkomandu could you reproduce with RPC tracing (increase debug level) to check option (a)?

@rkomandu
Copy link
Collaborator Author

@baum , i will create in the regular flow as mentioned above using noobaa api call, if you want to do with toolbox, let us do the Live session and get this collected.

@rkomandu
Copy link
Collaborator Author

rkomandu commented Feb 14, 2022

@baum , currently my setup is at noobaa-core (nsfs) log-level only

apiVersion: v1
data:
  DISABLE_DEV_RANDOM_SEED: "true"
  NOOBAA_DISABLE_COMPRESSION: "false"
  NOOBAA_LOG_LEVEL: nsfs
kind: ConfigMap


NAME                                               READY   STATUS    RESTARTS        AGE     IP              NODE                                  NOMINATED N
ODE   READINESS GATES
noobaa-core-0                                      1/1     Running   0               3d19h   10.254.20.221   worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
      <none>

As per @romayalon , the NSFS would benefit for the noobaa-endpoint logs. As we are trying to create accounts using noobaa-api, am not sure how that would help for the noobaa-core logs.

@rkomandu
Copy link
Collaborator Author

@baum, let me know how this debug nsfs, would help at the noobaa-core log level, as this was already the case on my system as mentioned in above update (for last recreate itself as I did noobaa-core pod restart) once the noobaa-db interactions hung

@rkomandu
Copy link
Collaborator Author

@nimrod-becker , @jeniawhite @baum
If this defect is not triaged and fixed, it would affect the functionality of the New user creations / new buckets creation as mentioned previously in the above comments. ASAIU, this would be a blocker for GA as well , if by chance a node is down and the noobaa-db doesn't get to functional is a problem

Current work-around in my 3M+3W node environment was to restart the noobaa-core pod.

@nimrod-becker
Copy link
Contributor

I disagree that a simple node is down is the scenario, that specific scenario is being tested and it works. There is something else here on this issue

@rkomandu
Copy link
Collaborator Author

@nimrod-becker , when you say it works, the noobaa-db migration happens and gets to Running state. However the creation of accounts / new buckets shouldn't hung / error out, once noobaa-db pod is Running.

@baum
Copy link
Contributor

baum commented Feb 17, 2022

@rkomandu Could you try reproducing the RPC issue with the following noobaa-core image with increased RPC debug messages:
quay.io/baum/noobaa-core:34773ea250ced918e4603ae7a78da79b87266961

This image is based on the following codebase.

Thank you!

@rkomandu
Copy link
Collaborator Author

@baum , please let me know how to patch the above noobaa-core image

@baum
Copy link
Contributor

baum commented Feb 17, 2022

Hello @rkomandu
For an existing system, edit noobaa cr ( oc edit noobaa noobaa ) and update

image: quay.io/baum/noobaa-core:34773ea250ced918e4603ae7a78da79b87266961

For a fresh install, use the --noobaa-image install option:

~ export CORE_IMAGE=quay.io/baum/noobaa-core:34773ea250ced918e4603ae7a78da79b87266961
~ noobaa .... --noobaa-image=$CORE_IMAGE install

Best regards

@rkomandu
Copy link
Collaborator Author

@baum , let me try the edit noobaa cr, as we don't use the noobaa CLI in downstream

@rkomandu
Copy link
Collaborator Author

rkomandu commented Feb 17, 2022

@baum , tried editing the noobaa cr, it doesn't seem to allow me , it is resetting to the odf4-mcg-core

Actual one:
actualImage: quay.io/rhceph-dev/odf4-mcg-core-rhel8@sha256:5507f2c1074bfb023415f0fef16ec42fbe6e90c540fc45f1111c8c929e477910

Tried with editing to the above noobaa-core@shavalue..

Noobaa-core doesn't restart

NAME                                               READY   STATUS    RESTARTS       AGE     IP              NODE                                  NOMINATED NO
DE   READINESS GATES
noobaa-core-0                                      1/1     Running   0              22h     10.254.20.22    worker2.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
noobaa-db-pg-0                                     1/1     Running   1 (3d4h ago)   8d      10.254.12.240   worker1.rkomandu-ta.cp.fyre.ibm.com   <none>
     <none>
`
```

@baum
Copy link
Contributor

baum commented Feb 18, 2022

@rkomandu, sounds like the upper-level operator overrides. Could you try editing the ClusterServiceVersion resource? It contains the noobaa-core image.

@rkomandu
Copy link
Collaborator Author

@baum

@rkomandu, sounds like the upper-level operator overrides. Could you try editing the ClusterServiceVersion resource? It contains the noobaa-core image.

this process didn't help.

Found a different method to patch the noobaa-core and the patching worked

Interestingly, in this new image of noobaa-core the new users can be created and the new buckets can be created which wasn't the case earlier.. Am not sure what changes are done here, Posting a step-wise process of the steps executed on this new image for the defect recreate.

Uploading the steps of the Account creation
Uploading the noobaa-core logs in gz format

noobaa-bug-create-db-hung-6869.txt.txt
noobaa-core-baum-21feb22.log.gz

@rkomandu
Copy link
Collaborator Author

Uploading the MG image when the (worker2) is down as did for the above re-create. MG gave an error at the end, which is expected since the worker2 is not in Ready state as of MG running
must-gather.local.baum-noobaa-core-image-when-worker2down.tar.gz

@baum
Copy link
Contributor

baum commented Feb 21, 2022

@rkomandu. This is the codebase for the change. Let me know if you need any further assistance.

@romayalon
Copy link
Contributor

That's interesting, the issue doesn't appear but @baum only added some log prints.
@rkomandu is there something different in your env?

@rkomandu
Copy link
Collaborator Author

@romayalon , posted above the steps performed. nothing changed on my system

@rkomandu
Copy link
Collaborator Author

rkomandu commented Apr 7, 2022

There are updates from the CSI team on the known limitations and having the CSI attacher replicaset for now. It has been tested with CNSA 513, CSI (2.5.0) + ODF - 4.9.5-4 d/s builds + Latest DAS operator.

Noobaa-db pod node when go down, it would take approximately around 6m 30sec - 7m and in that interim, no new accounts,exports/buckets can be created. Post the noobaa-db pod coming into Running state, Business is as-usual.

For now closing this defect. There is an enhancement planned by CSI team for future releases.

@rkomandu rkomandu closed this as completed Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants