MLflow token authentication #2

gmertes · 2024-06-27T21:04:11Z

Add the TokenAuth class to handle authentication with the keycloak server in our MLflow deployment.
Add the new command anemoi-training mlflow login

Wall of text below, mostly for documentation purposes.

About tokens

The ECMWF MLflow server is protected with token authentication, provided by a centrally managed keycloak server. There are two kinds of tokens: refresh and access tokens.

Refresh tokens are long lived (30 days) and they replace your credentials when requesting tokens. The first time you authenticate with your username and password, you get a refresh token. This token can be used for all subsequent token requests within its expiry time, to avoid having to prompt for user and password every time. Using the refresh token, you request access tokens.

Access tokens are short lived (order of minutes), and these are the tokens that actually authenticate you to the MLflow server (they are attached to the header of each HTTP request). Before each GET/POST to the MLflow server, we need to check that we have a valid access token to use. If not, we request a new one.

Requesting a new access token also gives you a new refresh token, that is again valid for 30 days. So as long as we have a valid refresh token, and we do the request within 30 days, we can keep requesting tokens indefinitely. If your refresh token expires, you need to authenticate with user/pass again.

A token being valid in this context generally means one that is within the expiry time. Tokens can also be invalidated on the server side, e.g. to revoke someone's access.

Logging in

Using the mlflow login command we acquire a new refresh token, which is saved to disk. For simplicity we call this action "logging in". As explained above, this action uses an existing refresh token, if it's on disk. If there is no refresh token, or an invalid one, the command will prompt the user interactively to input their username and password.

Even if we have a refresh token on disk, we need to "log in" to make sure that this token is accepted by the server.

Therefore, this command should be called once every time before starting a new training run. Most importantly, it needs to be called in an interactive session so that the user can input their credentials if prompted. TBD how to integrate this into the training workflow. Requiring a manual call before the run has the risk of users forgetting, resulting in the job failing because of invalid tokens. We could make this action transparant to the user, called automatically before starting the run.

During training

Once we have the new refresh token, we are guaranteed access to the MLflow server for at least 30 days. So even if we are running in a non-interactive batch job, we can guarantee that we don't need to prompt the user for credentials.

At the beginning of the training job, the refresh token is loaded from disk and used to request access token for all subsequent calls to MLflow.

How to integrate this

At the start of the training run, a TokenAuth object should be initialised, which loads the refresh token from disk. Before every call to MLflow, TokenAuth.authenticate() should be called to check and, if needed, acquire an access token. The access token is kept in memory and reused for subsequent calls. If a valid token is already in memory, authenticate() does nothing and has a minimal impact on performance.

gmertes · 2024-07-03T12:28:24Z

Some benchmarks. We benchmark the authenticate function, which will be called before every logging call during training and has a potential to impact performance.

Synthetic benchmark 1

We time the initialisation, the first call to authenticate which does the actual token request, followed by 10 mil repetitions. We expect to see only 1 call to the authentication server, with the repeats doing nothing since there is still a valid token in memory.

import logging
import time
import timeit
from contextlib import contextmanager
from time import perf_counter

from anemoi.training.diagnostics.mlflow.auth import TokenAuth

logging.basicConfig(format="[%(asctime)s] %(levelname)s : %(message)s", level=logging.DEBUG)
log = logging.getLogger(__name__)


@contextmanager
def timer():
    start = perf_counter()
    yield lambda: perf_counter() - start


with timer() as t:
    auth = TokenAuth(url="***")

log.info(f"Init: {t():.4} sec")

with timer() as t:
    auth.authenticate()

log.info(f"Authenticate, request access token: {t():.4} sec")

repeat = 10_000_000
total = timeit.timeit(auth.authenticate, number=repeat)

log.info(f"Authenticate, subsequent calls: {total/repeat:.4} sec")

Result:

[2024-07-03 12:16:12,945] INFO : Init: 0.0007962 sec
[2024-07-03 12:16:12,947] DEBUG : Starting new HTTPS connection (1): ***:443
[2024-07-03 12:16:12,997] DEBUG : ***:443 "POST /refreshtoken HTTP/1.1" 200 None
[2024-07-03 12:16:12,997] INFO : Access token refreshed.
[2024-07-03 12:16:12,997] INFO : Authenticate, request access token: 0.05267 sec
[2024-07-03 12:16:16,218] INFO : Authenticate, subsequent calls: 3.22e-07 sec

Interpretation:
Output is as expected. Only the first call to authenticate did the actual token request to the server, all subsequent repeats did nothing.

Synthetic benchmark 2

We time how long it takes to obtain a new access token from Atos. We request 100 tokens from the server with no throttling. We force authenticate to obtain a new token from the server by clearing the access_expires attribute, simulating an expired token.

def token_bench():
    auth.authenticate()
    auth.access_expires = 0

repeat = 100
total = timeit.timeit(token_bench, number=repeat)

log.info(f"Token request: {total/repeat:.4} sec")

Result:

[2024-07-03 12:19:24,993] INFO : Access token refreshed.
[2024-07-03 12:19:25,065] INFO : Access token refreshed.
[2024-07-03 12:19:25,132] INFO : Access token refreshed.
... 
[2024-07-03 12:19:59,876] INFO : Access token refreshed.
[2024-07-03 12:19:59,916] INFO : Access token refreshed.
[2024-07-03 12:20:00,971] INFO : Access token refreshed.
[2024-07-03 12:20:01,015] INFO : Access token refreshed.
[2024-07-03 12:20:01,260] INFO : Access token refreshed.

[2024-07-03 12:20:01,260] INFO : Token request: 0.3632 sec

Interpretation:
The result is as expected. The token request takes <0.5s on average and, with the current expiry time of 5 minutes, we only do one token request every ~4 minutes in practice. So there is minimal/no impact on training performance.

gmertes · 2024-07-08T15:07:23Z

anaprietonem

Great work Gert! I have added some comments mostly related to readability and some questions just to clarify minor aspects

pyproject.toml

src/anemoi/training/diagnostics/mlflow/logger.py

src/anemoi/training/diagnostics/mlflow/auth.py

src/anemoi/training/diagnostics/mlflow/logger.py

src/anemoi/training/diagnostics/callbacks.py

src/anemoi/training/commands/mlflow.py

(cherry picked from commit ecmwf-lab/aifs-mono@b856ddd) * add ability to continue run in mlflow logs and not create child run add model init logic for weights only and all bugfix: commented out synchronous arg in MLFLOW logger fixed overwriting function with hidden property in AIFSMLflowLogger * Update logging.py Simplying the if block for setting log_hyperparams * removed synchronous arg from config, refined code * Update logged message * removed synchronous param from AIFSMLflowLogger * Added plot async param back * change default setting for on_resume_create_child to False to maintain default behaviour from before this PR --------- Co-authored-by: Rilwan (Akanni) Adewoyin <[email protected]>

gmertes added 27 commits June 25, 2024 10:54

Update dependencies

3531281

Add mlflow subpackage

1c95835

Add auth module

624fabb

Use unix time

fbe8d0a

Store new refresh token in memory

a05a374

Use anemoi.utils.config

7dec83c

Add authenticate fn

683069b

Load config on init

7e5bb01

Refactor login logic

70bf8c3

Add force credentials arg

224897d

Check refresh token expiry on auth

460d598

Handle error status

de93951

Retry login with credentials if token fails

aecf26a

Add mlflow command

ff3206b

Assert url

2140a56

Header

868f269

Add docstrings

c5acb4f

Fix logger import

c4d9a4e

Make refresh token a property, update expiry in setter

d39541f

Refactor token request, update refresh token on auth

f7fb94a

Update mlflow logger to latest aifs-mono version

f4b781c

Use code logger

6e8bfcc

Return a global auth instance with get_auth

d7b2abc

Add enabled decorator

4a7509c

Log refresh token expiry date

131b138

Simplify expiry calculation

c4d8b2e

Remove assert url

6edf62d

gmertes added 2 commits July 3, 2024 13:42

Remove global auth facility

fb2c2e8

Integrate TokenAuth into AIFSMLflowLogger

b3a1ea8

gmertes added 4 commits July 5, 2024 09:47

Only initialise on rank 0

04f90d5

Log time it takes to refresh token

47d7e69

Add tests

9e036a0

Add authentication config entry

b0beb3b

gmertes marked this pull request as ready for review July 8, 2024 15:07

gmertes requested a review from anaprietonem July 8, 2024 15:11

anaprietonem reviewed Jul 9, 2024

View reviewed changes

Comment

0fb6067

gmertes force-pushed the feature/mlflow-auth branch from e8d97b3 to 0fb6067 Compare July 10, 2024 13:10

gmertes added 2 commits July 12, 2024 11:51

Add health check

4ff1339

Make refresh token expiry a constant

b3ae738

gmertes mentioned this pull request Jul 12, 2024

MLflow documentation #4

Open

gmertes added 3 commits July 16, 2024 13:10

Make --url required, in absence of a config

290e48f

Rename AIFSMLflowLogger -> AnemoiMLflowLogger

edcd1e1

API update

4756457

gmertes force-pushed the feature/mlflow-auth branch from 488e903 to ca7ce4d Compare July 18, 2024 08:41

gmertes force-pushed the feature/mlflow-auth branch from ca7ce4d to 60a2928 Compare July 18, 2024 08:43

Add mlflow sync command placeholder, update logging

808bd3d

gmertes merged commit 6f7fcf7 into develop Jul 18, 2024
4 of 8 checks passed

gmertes deleted the feature/mlflow-auth branch July 18, 2024 10:42

gmertes added this to the First Release milestone Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MLflow token authentication #2

MLflow token authentication #2

gmertes commented Jun 27, 2024

gmertes commented Jul 3, 2024

gmertes commented Jul 8, 2024 •

edited

Loading

anaprietonem left a comment

MLflow token authentication #2

MLflow token authentication #2

Conversation

gmertes commented Jun 27, 2024

About tokens

Logging in

During training

How to integrate this

gmertes commented Jul 3, 2024

Synthetic benchmark 1

Synthetic benchmark 2

gmertes commented Jul 8, 2024 • edited Loading

Describe your changes

Type of change

Checklist before requesting a review

Tag possible reviewers

anaprietonem left a comment

Choose a reason for hiding this comment

gmertes commented Jul 8, 2024 •

edited

Loading