Cannot seem to run ALTER command on replicas of the same shard #316

sudhanvaghebbale · 2024-06-28T00:42:55Z

Hi all, I wanted to reach out regarding an issue I have been facing while running Clickhouse schema migrations using alembic. This issue may not be related to alembic altogether but wanted to know if someone else has experienced this. Here's what I have been trying to do -

Clickhouse Setup

I have a cluster setup containing two Clickhouse nodes which are part of the same shard i.e. two replicas on one shard. I am also using zookeeper along with this. I have brought both of them up using Docker.

These are the schema migrations I ran in the order mentioned:

Migration 1

def upgrade() -> None:
   op.execute(
      """
		CREATE TABLE IF NOT EXISTS table_local `ON CLUSTER '{cluster}'` (
	       {fields here...}
	    )
		ENGINE = ReplicatedMergeTree('/clickhouse/tables/{cluster}/{shard}/test/table', '{replica}')
		TTL toDateTime(timestamp) + INTERVAL 3 MONTH DELETE   
		PARTITION BY toYYYYMMDD (timestamp);
      """
   )

Migration 2

def upgrade() -> None:
   op.execute(
      """
		CREATE TABLE IF NOT EXISTS table ON CLUSTER '{cluster}' AS table_local
        ENGINE = Distributed('{cluster}', test, table_local, rand());
      """
   )

Migration 3

def upgrade() -> None:
   op.execute(
      """
         DROP TABLE IF EXISTS table ON CLUSTER '{cluster}' SYNC;
      """
   )

   time.sleep(30)

   op.execute(
      """
         ALTER TABLE table_local ON CLUSTER '{cluster}' ADD COLUMN IF NOT EXISTS new_column String;
      """
   )

   time.sleep(30)

   op.execute(
      """
         CREATE TABLE IF NOT EXISTS table ON CLUSTER '{cluster}' AS table_local
         ENGINE = Distributed('{cluster}', test, table_local, rand());
      """
   )

After running these migrations, I see the alembic logs indicating that the migrations have run. I have also verified that the alembic_version table contains the last/correct migration run.

What I've come to observe is that only one replica in the shard has the updated column while the other replica does not have it even though the migration seems to have ran successfully. I don't see any errors in the clickhouse-server.err.log file. When I look at the clickhouse-server logs, I see this line corresponding to the ALTER query.

2024.06.25 15:23:46.767800 [ 700 ] {} <Debug> DDLWorker: Task query-0000000001 has already been executed by replica (clickhouse%2D02:9000) of the same shard.

I tried a few remedies to overcome this like:

Add a sleep between each DDL command to make sure it's not a timing issue.
I used the alter_sync setting in the query to allow sync of alter command across replicas.
I tried to run all these commands using just curl without the use of alembic but same issue

These are versions I am running:

ClickHouse server version: 24.5.1
ClickHouse client version: 24.5.1.1763 (official build)
Alembic: 1.13.1
clickhouse-sqlalchemy: 0.3.2
sqlalchemy: 2.0.30

I want to know if someone else has also faced this issue in the past and if yes, how have they solve this problem. Any help will be greatly appreciated. Thanks!

The text was updated successfully, but these errors were encountered:

xzkostyan · 2024-06-28T14:51:51Z

Hi.

Have alter_sync helped with this case?

sudhanvaghebbale · 2024-07-01T14:20:46Z

No, I have had no luck with alter_sync as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot seem to run ALTER command on replicas of the same shard #316

Cannot seem to run ALTER command on replicas of the same shard #316

sudhanvaghebbale commented Jun 28, 2024

xzkostyan commented Jun 28, 2024

sudhanvaghebbale commented Jul 1, 2024

Cannot seem to run ALTER command on replicas of the same shard #316

Cannot seem to run ALTER command on replicas of the same shard #316

Comments

sudhanvaghebbale commented Jun 28, 2024

Clickhouse Setup

Migration 1

Migration 2

Migration 3

These are versions I am running:

xzkostyan commented Jun 28, 2024

sudhanvaghebbale commented Jul 1, 2024