Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/more reliable zookeeper #992

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

pvbouwel
Copy link
Contributor

This change will just create 1 KazooClient and keeps it running in the background. The KazooClient itself has logic to recover connectivity.

In order to deal with intermittent failure we also retry operations if they fail due to connectivity related issues. At this time the once that came from test scenario's were kazoo.exceptions.SessionExpiredError and kazoo.exceptions.ConnectionLoss but if other retryable exceptions are identified they could be easily added.

Peter Van Bouwel added 3 commits January 10, 2025 15:51
Rather than opening and closing connections for each Zookeeper interaction use the functionality in the KazooClient which maintains a connection with Zookeeper.

This lowers resource consumption and improves latency.
KazooClient was mocked for some tests make sure this mocking is maintained with the new way of using KazooClient.
The IDE will get confused because the client will create methods at runtime. By referencing the KazooClient methods in the class the IDE can use hints based on the downstream methods.
@pvbouwel pvbouwel requested a review from jdries January 10, 2025 17:18
Copy link
Member

@soxofaan soxofaan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(apparently I still had these comments from some time ago)

from time import sleep
from typing import TYPE_CHECKING

if TYPE_CHECKING:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are all stdlib/thirdparty (based) symbols, so I don't think it's necessary to guard this with if TYPE_CHECKING
(that's typically just a trick to dig you out of circular import holes)

from kazoo.protocol.states import ZnodeStat

# Helper type which corresponds to what Zookeeper for callbacks as used by watches
ZEvent = tuple[bytes, ZnodeStat]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we're still on python 3.8, I don't think we can already use type annotations like tuple[...] and still have to stick to Tuple[...]

@soxofaan
Copy link
Member

soxofaan commented Feb 3, 2025

FYI: I just merged latest master in this feature branch to make sure this PR doesn't get stale


@contextlib.contextmanager
def _zk_client(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can't keep this usage pattern for now:

 with self._zk_client() as zk:
     ...

but return the new wrapper instead there.

That would make the introduced change of this PR smaller (in LoC).
And it would even allow to gradually roll this out (e.g. with a feature flag or config)

exceptions=(NoActiveConnectionException,),
tries=600,
logger=None,
)(_wait_100ms_for_zookeeper_connection)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while reviewing this I got a bit lost in this _wait_100ms_for_zookeeper_connection/_wait_up_to_1minute_for_zookeeper_connection construction.

Can't it be simplified to a single helper like

def _wait_on_connected(self, timeout: float=60, sleep=0.1):
    end = time.time() + timeout
    while time.time() < end:
        if self._zk.connected:
            return
        if self._logger is not None:
            self._logger.debug(f"No active connection for zookeeper client ({self._zkhosts})")
        time.sleep(sleep)
    raise ZookeeperClient.NoActiveConnectionException()

(PS: I would avoid hardcoding implementation details like 100ms and 1minute in the function names to make it easier to fine-tune these when necessary)

@soxofaan
Copy link
Member

soxofaan commented Feb 4, 2025

while reviewing I also started wondering if it wouldn't be easier to just add an option for reuse to

@contextlib.contextmanager
def _zk_client(self):
kz_retry = KazooRetry(max_tries=10, delay=0.5, backoff=2)
zk = KazooClient(hosts=self._hosts,connection_retry=kz_retry,
command_retry=kz_retry, timeout=3.0)
zk.start(timeout=15.0)
try:
yield zk
finally:
zk.stop()
zk.close()

e.g. PoC:

that approach would also allow to roll client reuse out gradually (e.g. with config or feature flags), instead of a a complete KazooClient replacement as proposed in this PR #992

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants