-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLflow token authentication #2
Conversation
Some benchmarks. We benchmark the Synthetic benchmark 1We time the initialisation, the first call to import logging
import time
import timeit
from contextlib import contextmanager
from time import perf_counter
from anemoi.training.diagnostics.mlflow.auth import TokenAuth
logging.basicConfig(format="[%(asctime)s] %(levelname)s : %(message)s", level=logging.DEBUG)
log = logging.getLogger(__name__)
@contextmanager
def timer():
start = perf_counter()
yield lambda: perf_counter() - start
with timer() as t:
auth = TokenAuth(url="***")
log.info(f"Init: {t():.4} sec")
with timer() as t:
auth.authenticate()
log.info(f"Authenticate, request access token: {t():.4} sec")
repeat = 10_000_000
total = timeit.timeit(auth.authenticate, number=repeat)
log.info(f"Authenticate, subsequent calls: {total/repeat:.4} sec") Result: [2024-07-03 12:16:12,945] INFO : Init: 0.0007962 sec
[2024-07-03 12:16:12,947] DEBUG : Starting new HTTPS connection (1): ***:443
[2024-07-03 12:16:12,997] DEBUG : ***:443 "POST /refreshtoken HTTP/1.1" 200 None
[2024-07-03 12:16:12,997] INFO : Access token refreshed.
[2024-07-03 12:16:12,997] INFO : Authenticate, request access token: 0.05267 sec
[2024-07-03 12:16:16,218] INFO : Authenticate, subsequent calls: 3.22e-07 sec Interpretation: Synthetic benchmark 2We time how long it takes to obtain a new access token from Atos. We request 100 tokens from the server with no throttling. We force def token_bench():
auth.authenticate()
auth.access_expires = 0
repeat = 100
total = timeit.timeit(token_bench, number=repeat)
log.info(f"Token request: {total/repeat:.4} sec") Result: [2024-07-03 12:19:24,993] INFO : Access token refreshed.
[2024-07-03 12:19:25,065] INFO : Access token refreshed.
[2024-07-03 12:19:25,132] INFO : Access token refreshed.
...
[2024-07-03 12:19:59,876] INFO : Access token refreshed.
[2024-07-03 12:19:59,916] INFO : Access token refreshed.
[2024-07-03 12:20:00,971] INFO : Access token refreshed.
[2024-07-03 12:20:01,015] INFO : Access token refreshed.
[2024-07-03 12:20:01,260] INFO : Access token refreshed.
[2024-07-03 12:20:01,260] INFO : Token request: 0.3632 sec Interpretation: |
Describe your changesAdds support for token authentication with the new MLflow server. I tested single and multi GPU training extensively. I also benchmarked the I have not written formal documentation, because it is TBD how we document this in a way that is not specific to ECMWF. The code itself is fully documented, and tries to be as general as possible.
Checks fails because Type of changePlease delete options that are not relevant.
Checklist before requesting a review
Tag possible reviewers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work Gert! I have added some comments mostly related to readability and some questions just to clarify minor aspects
e8d97b3
to
0fb6067
Compare
488e903
to
ca7ce4d
Compare
(cherry picked from commit ecmwf-lab/aifs-mono@b856ddd) * add ability to continue run in mlflow logs and not create child run add model init logic for weights only and all bugfix: commented out synchronous arg in MLFLOW logger fixed overwriting function with hidden property in AIFSMLflowLogger * Update logging.py Simplying the if block for setting log_hyperparams * removed synchronous arg from config, refined code * Update logged message * removed synchronous param from AIFSMLflowLogger * Added plot async param back * change default setting for on_resume_create_child to False to maintain default behaviour from before this PR --------- Co-authored-by: Rilwan (Akanni) Adewoyin <[email protected]>
ca7ce4d
to
60a2928
Compare
TokenAuth
class to handle authentication with the keycloak server in our MLflow deployment.anemoi-training mlflow login
Wall of text below, mostly for documentation purposes.
About tokens
The ECMWF MLflow server is protected with token authentication, provided by a centrally managed keycloak server. There are two kinds of tokens: refresh and access tokens.
Refresh tokens are long lived (30 days) and they replace your credentials when requesting tokens. The first time you authenticate with your username and password, you get a refresh token. This token can be used for all subsequent token requests within its expiry time, to avoid having to prompt for user and password every time. Using the refresh token, you request access tokens.
Access tokens are short lived (order of minutes), and these are the tokens that actually authenticate you to the MLflow server (they are attached to the header of each HTTP request). Before each GET/POST to the MLflow server, we need to check that we have a valid access token to use. If not, we request a new one.
Requesting a new access token also gives you a new refresh token, that is again valid for 30 days. So as long as we have a valid refresh token, and we do the request within 30 days, we can keep requesting tokens indefinitely. If your refresh token expires, you need to authenticate with user/pass again.
A token being valid in this context generally means one that is within the expiry time. Tokens can also be invalidated on the server side, e.g. to revoke someone's access.
Logging in
Using the
mlflow login
command we acquire a new refresh token, which is saved to disk. For simplicity we call this action "logging in". As explained above, this action uses an existing refresh token, if it's on disk. If there is no refresh token, or an invalid one, the command will prompt the user interactively to input their username and password.Even if we have a refresh token on disk, we need to "log in" to make sure that this token is accepted by the server.
Therefore, this command should be called once every time before starting a new training run. Most importantly, it needs to be called in an interactive session so that the user can input their credentials if prompted. TBD how to integrate this into the training workflow. Requiring a manual call before the run has the risk of users forgetting, resulting in the job failing because of invalid tokens. We could make this action transparant to the user, called automatically before starting the run.
During training
Once we have the new refresh token, we are guaranteed access to the MLflow server for at least 30 days. So even if we are running in a non-interactive batch job, we can guarantee that we don't need to prompt the user for credentials.
At the beginning of the training job, the refresh token is loaded from disk and used to request access token for all subsequent calls to MLflow.
How to integrate this
At the start of the training run, a
TokenAuth
object should be initialised, which loads the refresh token from disk. Before every call to MLflow,TokenAuth.authenticate()
should be called to check and, if needed, acquire an access token. The access token is kept in memory and reused for subsequent calls. If a valid token is already in memory,authenticate()
does nothing and has a minimal impact on performance.