Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLflow token authentication #2

Merged
merged 41 commits into from
Jul 18, 2024
Merged

MLflow token authentication #2

merged 41 commits into from
Jul 18, 2024

Conversation

gmertes
Copy link
Member

@gmertes gmertes commented Jun 27, 2024

  • Add the TokenAuth class to handle authentication with the keycloak server in our MLflow deployment.
  • Add the new command anemoi-training mlflow login

Wall of text below, mostly for documentation purposes.

About tokens

The ECMWF MLflow server is protected with token authentication, provided by a centrally managed keycloak server. There are two kinds of tokens: refresh and access tokens.

Refresh tokens are long lived (30 days) and they replace your credentials when requesting tokens. The first time you authenticate with your username and password, you get a refresh token. This token can be used for all subsequent token requests within its expiry time, to avoid having to prompt for user and password every time. Using the refresh token, you request access tokens.

Access tokens are short lived (order of minutes), and these are the tokens that actually authenticate you to the MLflow server (they are attached to the header of each HTTP request). Before each GET/POST to the MLflow server, we need to check that we have a valid access token to use. If not, we request a new one.

Requesting a new access token also gives you a new refresh token, that is again valid for 30 days. So as long as we have a valid refresh token, and we do the request within 30 days, we can keep requesting tokens indefinitely. If your refresh token expires, you need to authenticate with user/pass again.

A token being valid in this context generally means one that is within the expiry time. Tokens can also be invalidated on the server side, e.g. to revoke someone's access.

Logging in

Using the mlflow login command we acquire a new refresh token, which is saved to disk. For simplicity we call this action "logging in". As explained above, this action uses an existing refresh token, if it's on disk. If there is no refresh token, or an invalid one, the command will prompt the user interactively to input their username and password.

Even if we have a refresh token on disk, we need to "log in" to make sure that this token is accepted by the server.

Therefore, this command should be called once every time before starting a new training run. Most importantly, it needs to be called in an interactive session so that the user can input their credentials if prompted. TBD how to integrate this into the training workflow. Requiring a manual call before the run has the risk of users forgetting, resulting in the job failing because of invalid tokens. We could make this action transparant to the user, called automatically before starting the run.

During training

Once we have the new refresh token, we are guaranteed access to the MLflow server for at least 30 days. So even if we are running in a non-interactive batch job, we can guarantee that we don't need to prompt the user for credentials.

At the beginning of the training job, the refresh token is loaded from disk and used to request access token for all subsequent calls to MLflow.

How to integrate this

At the start of the training run, a TokenAuth object should be initialised, which loads the refresh token from disk. Before every call to MLflow, TokenAuth.authenticate() should be called to check and, if needed, acquire an access token. The access token is kept in memory and reused for subsequent calls. If a valid token is already in memory, authenticate() does nothing and has a minimal impact on performance.

@gmertes
Copy link
Member Author

gmertes commented Jul 3, 2024

Some benchmarks. We benchmark the authenticate function, which will be called before every logging call during training and has a potential to impact performance.

Synthetic benchmark 1

We time the initialisation, the first call to authenticate which does the actual token request, followed by 10 mil repetitions. We expect to see only 1 call to the authentication server, with the repeats doing nothing since there is still a valid token in memory.

import logging
import time
import timeit
from contextlib import contextmanager
from time import perf_counter

from anemoi.training.diagnostics.mlflow.auth import TokenAuth

logging.basicConfig(format="[%(asctime)s] %(levelname)s : %(message)s", level=logging.DEBUG)
log = logging.getLogger(__name__)


@contextmanager
def timer():
    start = perf_counter()
    yield lambda: perf_counter() - start


with timer() as t:
    auth = TokenAuth(url="***")

log.info(f"Init: {t():.4} sec")

with timer() as t:
    auth.authenticate()

log.info(f"Authenticate, request access token: {t():.4} sec")

repeat = 10_000_000
total = timeit.timeit(auth.authenticate, number=repeat)

log.info(f"Authenticate, subsequent calls: {total/repeat:.4} sec")

Result:

[2024-07-03 12:16:12,945] INFO : Init: 0.0007962 sec
[2024-07-03 12:16:12,947] DEBUG : Starting new HTTPS connection (1): ***:443
[2024-07-03 12:16:12,997] DEBUG : ***:443 "POST /refreshtoken HTTP/1.1" 200 None
[2024-07-03 12:16:12,997] INFO : Access token refreshed.
[2024-07-03 12:16:12,997] INFO : Authenticate, request access token: 0.05267 sec
[2024-07-03 12:16:16,218] INFO : Authenticate, subsequent calls: 3.22e-07 sec

Interpretation:
Output is as expected. Only the first call to authenticate did the actual token request to the server, all subsequent repeats did nothing.

Synthetic benchmark 2

We time how long it takes to obtain a new access token from Atos. We request 100 tokens from the server with no throttling. We force authenticate to obtain a new token from the server by clearing the access_expires attribute, simulating an expired token.

def token_bench():
    auth.authenticate()
    auth.access_expires = 0

repeat = 100
total = timeit.timeit(token_bench, number=repeat)

log.info(f"Token request: {total/repeat:.4} sec")

Result:

[2024-07-03 12:19:24,993] INFO : Access token refreshed.
[2024-07-03 12:19:25,065] INFO : Access token refreshed.
[2024-07-03 12:19:25,132] INFO : Access token refreshed.
... 
[2024-07-03 12:19:59,876] INFO : Access token refreshed.
[2024-07-03 12:19:59,916] INFO : Access token refreshed.
[2024-07-03 12:20:00,971] INFO : Access token refreshed.
[2024-07-03 12:20:01,015] INFO : Access token refreshed.
[2024-07-03 12:20:01,260] INFO : Access token refreshed.

[2024-07-03 12:20:01,260] INFO : Token request: 0.3632 sec

Interpretation:
The result is as expected. The token request takes <0.5s on average and, with the current expiry time of 5 minutes, we only do one token request every ~4 minutes in practice. So there is minimal/no impact on training performance.

@gmertes
Copy link
Member Author

gmertes commented Jul 8, 2024

Describe your changes

Adds support for token authentication with the new MLflow server.

I tested single and multi GPU training extensively. I also benchmarked the authenticate function in a real training run, and the results are identical to the synthetic benchmarks above.

I have not written formal documentation, because it is TBD how we document this in a way that is not specific to ECMWF. The code itself is fully documented, and tries to be as general as possible.

  • Adds a new command anemoi-training mlflow login. This command needs to be run the first time you use the new server, or if your last training run was more than 30 days ago.
  • Copied the AIFSMLflowLogger from aifs-mono to here.
  • Adds token authentication to the AIFSMLflowLogger.
  • Also copied get_code_logger from aifs-mono to here, since AIFSMLflowLogger uses it.

Checks fails because anemoi-models doesn't have a distribution yet and some of the tests (not related to this PR) are broken. I suggest to fix that in separate PR since it doesn't impact the functionality of this PR.

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist before requesting a review

  • I have performed a self-review of my code
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation and docstrings to reflect the changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have ensured that the code is still pip-installable after the changes and runs
  • I have ran this on single GPU
  • I have ran this on multi-GPU or multi-node
  • I have ran this to work on LUMI (or made sure the changes work independently.)
  • I have ran the Benchmark Profiler against the old version of the code

Tag possible reviewers

@JesperDramsch @mchantry

@gmertes gmertes marked this pull request as ready for review July 8, 2024 15:07
@gmertes gmertes requested a review from anaprietonem July 8, 2024 15:11
Copy link
Contributor

@anaprietonem anaprietonem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work Gert! I have added some comments mostly related to readability and some questions just to clarify minor aspects

pyproject.toml Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
src/anemoi/training/diagnostics/mlflow/auth.py Outdated Show resolved Hide resolved
src/anemoi/training/diagnostics/callbacks.py Outdated Show resolved Hide resolved
src/anemoi/training/commands/mlflow.py Show resolved Hide resolved
@gmertes gmertes mentioned this pull request Jul 12, 2024
(cherry picked from commit ecmwf-lab/aifs-mono@b856ddd)

* add ability to continue run in mlflow logs and not create child run

add model init logic for weights only and all

bugfix: commented out synchronous arg in MLFLOW logger

fixed overwriting function with hidden property in AIFSMLflowLogger

* Update logging.py

Simplying the if block for setting log_hyperparams

* removed synchronous arg from config, refined code

* Update logged message

* removed synchronous param from AIFSMLflowLogger

* Added plot async param back

* change default setting for on_resume_create_child to False to maintain default behaviour from before this PR

---------

Co-authored-by: Rilwan (Akanni) Adewoyin <[email protected]>
@gmertes gmertes merged commit 6f7fcf7 into develop Jul 18, 2024
4 of 8 checks passed
@gmertes gmertes deleted the feature/mlflow-auth branch July 18, 2024 10:42
@gmertes gmertes added this to the First Release milestone Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants