Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update kubeflow/katib manifests from v0.14.0 #2273

Conversation

annajung
Copy link
Member

Signed-off-by: Anna Jung (VMware) [email protected]

Description of your changes:

  • Sync Katib 0.14.0 (ref)

cc @kubeflow/release-team

@annajung
Copy link
Member Author

github actions failed due to error: timed out waiting for the condition on pods/katib-mysql-5bf95ddfcc-587tt

I ran the commands locally against Kubernetes 1.22 and was not able to recreate the issue. Is it possible to review the pod logs? cc @kimwnasptd

@annajung
Copy link
Member Author

annajung commented Aug 30, 2022

After looking into this a bit more, I'm still not sure why it's failing. One approach that we could possibly take is instead of waiting for all pods in all namespaces to be "Ready", can we only check the kubeflow namespace instead?

kubectl wait --for=condition=Ready pods --all -n kubeflow --timeout 180s

If we take that approach, we could also add additional checking in the istio installation script to make sure all pods are up

kubectl wait --for=condition=Ready pods --all -n istio-system --timeout 180s

@kimwnasptd wdyt? Not sure if this will fix the issue tbh

Also, we need #2271 to merge first as it fixes the katib test script path

@kimwnasptd
Copy link
Member

kimwnasptd commented Aug 31, 2022

I'd like to understand why do we see this error in the first place, since if the MySQL pod never becomes ready then we could hit other problems with testing down the road.

After using this GH action to ssh into the worker node, in my forked repo that runs this PR kimwnasptd#2, I found out that the mysql pod's container had started but it would never become ready.

Doing a kubectl describe showed the following interesting event:

Warning  Unhealthy         15s (x5 over 75s)  kubelet            Startup probe failed: mysqladmin: [Warning] Using a password on the command line interface can be insecure.
mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)'
Check that mysqld is running and that the socket: '/var/run/mysqld/mysqld.sock' exists!

And I see the following logs

Logs
2022-08-31 15:25:46+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 8.0.29-1.el8 started.
2022-08-31 15:25:46+00:00 [Note] [Entrypoint]: Switching to dedicated user 'mysql'
2022-08-31 15:25:46+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 8.0.29-1.el8 started.
2022-08-31 15:25:46+00:00 [Note] [Entrypoint]: Initializing database files
2022-08-31T15:25:46.194799Z 0 [System] [MY-013169] [Server] /usr/sbin/mysqld (mysqld 8.0.29) initializing of server in progress as process 48
2022-08-31T15:25:46.194840Z 0 [ERROR] [MY-010338] [Server] Can't find error-message file '/usr/share/mysql-8.0/errmsg.sys'. Check error-message file location and 'lc-messages-dir' configuration directive.
2022-08-31T15:25:46.199453Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-08-31T15:25:46.537790Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2022-08-31T15:25:47.566600Z 6 [Warning] [MY-010453] [Server] root@localhost is created with an empty password ! Please consider switching off the --initialize-insecure option.
2022-08-31 15:25:49+00:00 [Note] [Entrypoint]: Database files initialized
2022-08-31 15:25:49+00:00 [Note] [Entrypoint]: Starting temporary server
2022-08-31T15:25:49.661301Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.29) starting as process 95
2022-08-31T15:25:49.661360Z 0 [ERROR] [MY-010338] [Server] Can't find error-message file '/usr/share/mysql-8.0/errmsg.sys'. Check error-message file location and 'lc-messages-dir' configuration directive.
2022-08-31T15:25:49.672270Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-08-31T15:25:50.218488Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2022-08-31T15:25:50.420162Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2022-08-31T15:25:50.420198Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2022-08-31T15:25:50.420905Z 0 [Warning] [MY-011810] [Server] Insecure configuration for --pid-file: Location '/var/lib/mysql' in the path is accessible to all OS users. Consider choosing a different directory.
2022-08-31T15:25:50.435929Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Socket: /var/run/mysqld/mysqlx.sock
2022-08-31T15:25:50.435992Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.29'  socket: '/var/lib/mysql/mysql.sock'  port: 0  MySQL Community Server - GPL.
2022-08-31 15:25:50+00:00 [Note] [Entrypoint]: Temporary server started.
Warning: Unable to load '/usr/share/zoneinfo/iso3166.tab' as time zone. Skipping it.
Warning: Unable to load '/usr/share/zoneinfo/leapseconds' as time zone. Skipping it.
Warning: Unable to load '/usr/share/zoneinfo/tzdata.zi' as time zone. Skipping it.
Warning: Unable to load '/usr/share/zoneinfo/zone.tab' as time zone. Skipping it.
Warning: Unable to load '/usr/share/zoneinfo/zone1970.tab' as time zone. Skipping it.
2022-08-31 15:25:52+00:00 [Note] [Entrypoint]: Creating database katib

2022-08-31 15:25:52+00:00 [Note] [Entrypoint]: Stopping temporary server
2022-08-31T15:25:52.353221Z 11 [System] [MY-013172] [Server] Received SHUTDOWN from user root. Shutting down mysqld (Version: 8.0.29).
2022-08-31T15:25:53.878715Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.29)  MySQL Community Server - GPL.
2022-08-31 15:25:54+00:00 [Note] [Entrypoint]: Temporary server stopped

2022-08-31 15:25:54+00:00 [Note] [Entrypoint]: MySQL init process done. Ready for start up.

2022-08-31T15:25:54.585100Z 0 [System] [MY-010116] [Server] /usr/sbin/mysqld (mysqld 8.0.29) starting as process 1
2022-08-31T15:25:54.585163Z 0 [ERROR] [MY-010338] [Server] Can't find error-message file '/usr/share/mysql-8.0/errmsg.sys'. Check error-message file location and 'lc-messages-dir' configuration directive.
2022-08-31T15:25:54.590410Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2022-08-31T15:25:54.773277Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2022-08-31T15:25:54.926548Z 0 [Warning] [MY-010068] [Server] CA certificate ca.pem is self signed.
2022-08-31T15:25:54.926585Z 0 [System] [MY-013602] [Server] Channel mysql_main configured to support TLS. Encrypted connections are now supported for this channel.
2022-08-31T15:25:54.927483Z 0 [Warning] [MY-011810] [Server] Insecure configuration for --pid-file: Location '/var/lib/mysql' in the path is accessible to all OS users. Consider choosing a different directory.
2022-08-31T15:25:54.943057Z 0 [System] [MY-010931] [Server] /usr/sbin/mysqld: ready for connections. Version: '8.0.29'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server - GPL.
2022-08-31T15:25:54.943373Z 0 [System] [MY-011323] [Server] X Plugin ready for connections. Bind-address: '::' port: 33060, socket: /var/run/mysqld/mysqlx.sock

@annajung annajung force-pushed the sync-kubeflow-katib-manifests-v0.14.0 branch from 75e399f to 3250ac6 Compare August 31, 2022 18:43
@annajung
Copy link
Member Author

cc @johnugeorge to see if you have encountered this in the past

@johnugeorge
Copy link
Member

Is this reproducible? We have been testing latest Katib in CI without any issues

@annajung
Copy link
Member Author

annajung commented Sep 1, 2022

@johnugeorge It is reproducible through GitHub actions, but not locally. We have been debugging with ssh to look through the logs and test against the GH env

However, even locally running through the steps, I see the above error logs that Kimonas mentioned but locally, passes the condition check

mysqladmin: connect to server at 'localhost' failed
error: 'Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)'
Check that mysqld is running and that the socket: '/var/run/mysqld/mysqld.sock' exists!

@kimwnasptd
Copy link
Member

After a discussion with @annajung we decided to comment out the last parts of the GH Action that wait for the Pods to become ready and apply a test Experiment.

This way we won't block the release with this test, and we can work on fixing this in parallel on master.

On a side note, we had seen this action work while developing with @NickLoukas. We are very confident that there's a change in the VM's of GH Actions, but we'll need to further inspect this.

Signed-off-by: Anna Jung (VMware) <[email protected]>
@kimwnasptd
Copy link
Member

Thanks @annajung!

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: annajung, kimwnasptd

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit cfe27e0 into kubeflow:v1.6-branch Sep 1, 2022
apo-ger added a commit to apo-ger/manifests that referenced this pull request Nov 24, 2022
- Use the updated KinD configuration file
- Trigger the workflows when Kind configuration file changes

  * Exclude Katib workflow. See:
    kubeflow#2273 (comment)

  * Exclude Kserve workflow, until we update Knative version to v1.8. See:
    kubeflow#2325 (comment)

Signed-off-by: Apostolos Gerakaris <[email protected]>
google-oss-prow bot pushed a commit that referenced this pull request Nov 24, 2022
* Fix kind configuration file script

Signed-off-by: Apostolos Gerakaris <[email protected]>

* tests: Refactor kind configuration file

Have one file for the KinD configuration instead of
version specific ones.

Signed-off-by: Apostolos Gerakaris <[email protected]>

* workflows: Update GH Action workflows

- Use the updated KinD configuration file
- Trigger the workflows when Kind configuration file changes

  * Exclude Katib workflow. See:
    #2273 (comment)

  * Exclude Kserve workflow, until we update Knative version to v1.8. See:
    #2325 (comment)

Signed-off-by: Apostolos Gerakaris <[email protected]>

Signed-off-by: Apostolos Gerakaris <[email protected]>
kevin85421 pushed a commit to juliusvonkohout/manifests that referenced this pull request Feb 28, 2023
…#2331)

* Fix kind configuration file script

Signed-off-by: Apostolos Gerakaris <[email protected]>

* tests: Refactor kind configuration file

Have one file for the KinD configuration instead of
version specific ones.

Signed-off-by: Apostolos Gerakaris <[email protected]>

* workflows: Update GH Action workflows

- Use the updated KinD configuration file
- Trigger the workflows when Kind configuration file changes

  * Exclude Katib workflow. See:
    kubeflow#2273 (comment)

  * Exclude Kserve workflow, until we update Knative version to v1.8. See:
    kubeflow#2325 (comment)

Signed-off-by: Apostolos Gerakaris <[email protected]>

Signed-off-by: Apostolos Gerakaris <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants