Skip to content

Adding pilot registrations and authentification (Router) #421

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Robin-Van-de-Merghel
Copy link
Contributor

@Robin-Van-de-Merghel Robin-Van-de-Merghel commented Mar 27, 2025

Changes

Endpoints

Adding a pilot service with some endpoints:

  • POST / creates a pilot with (if not prevented) a secret
  • DELETE / deletes pilots by stamp
  • DELETE /interval deletes pilots that lived more than n days
  • POST /token exchanges a pilot secret for a token
  • POST /refresh-token refresh a pilot token
  • POST /fields/secrets creates secrets
  • PATCH /fields/secrets associates a pilot with a secret
  • PATCH /fields/jobs associates a pilot with jobs
  • PATCH /fields helps modifying pilot fields (benchmark, gridsite, ...)
  • GET /search searchs for pilots with parameters

Note

The DELETE /interval is there because we need it directly and because it is faster, but we can simplify it with GET /search then DELETE /.

Security Model

As the security model dictates, pilot secrets are strings, and hashed in the db itself.

Important

For the JWT perspective, we need to chose whether a pilot will need refresh tokens or not, and how long a token will live to implement it.

These changes are mandatory for this PR.

After offline discussions: A pilot will have a different token (refresh and access), and with a different duration.

@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch 3 times, most recently from e74fe72 to 9d1c062 Compare March 28, 2025 09:11
@Robin-Van-de-Merghel Robin-Van-de-Merghel marked this pull request as ready for review March 28, 2025 09:31
@Robin-Van-de-Merghel
Copy link
Contributor Author

Robin-Van-de-Merghel commented Mar 28, 2025

The failed CI i'm not sure if I have to regenerate the client manually.

@aldbr
Copy link
Contributor

aldbr commented Mar 28, 2025

The failed CI i'm not sure if I have to regenerate the client manually.

Yes, you need to regenerate the client manually, here is the documentation: https://github.com/DIRACGrid/diracx/blob/main/docs/CLIENT.md#updating-the-client

If you have any trouble, please let me know

@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch from a269416 to 8645c01 Compare March 28, 2025 13:09
Comment on lines 156 to 280
if "foreign key" in str(e.orig).lower():
raise PilotNotFoundError(pilot_id=pilot_id) from e
if "duplicate entry" in str(e.orig).lower():
raise PilotAlreadyExistsError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look a bit fragile (e.g. at the moment we are effectively only supporting MySQL, but what if we add support also for e.g. PG?).
Maybe there's nothing different that can be done, but worth checking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just went to the code of SQLAlchemy, there's indeed an IntegrityError, but nothing is generic. We have to get some db-specific error: psycopg2.errors.ForeignKeyViolation for postgres, if error_code == 2291: for oracle, ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you rely on an error code instead of relying on a string at least?
Also, it seems you are not using and testing the case where PilotAlreadyExistsError is raised (or I possibly missed it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we check if an error is an instance of another module pymysql we could potentially catch some errors as code that are specific on a db. And even with that, I saw errors where people had to use both IntegrityError from sql-alchemy and pymy integrity error because of a bad handling..

It is not pretty, and you can read this response: https://stackoverflow.com/a/70714697

Also, it seems you are not using and testing the case where PilotAlreadyExistsError is raised (or I possibly missed it)

This part add_pilot_credentials is not used yet but soon will be when Dirac or another entity will register pilots on DiracX and add credentials. I currently didn't catch it, because HTTPExceptions are to be raised on a router, and in the logic it will be automatically raised.
I don't know if it is fine to raise an error from the logic and raise the same one to the router: in a way it helps understand from the logic the potential, in another, it adds code...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll open an issue for this, to later fix this

@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch 3 times, most recently from 5e80165 to b22d1dc Compare April 1, 2025 07:19
@Robin-Van-de-Merghel
Copy link
Contributor Author

Modified from (PilotID, secret) login request to (PilotRef, secret), see this issue I opened about it.

@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch from 536c2a5 to a38f6ea Compare April 2, 2025 08:03
@Robin-Van-de-Merghel
Copy link
Contributor Author

Tested with this Pilot PR version and worked successfully. Could retrieve a DiracX token from a Pilot.

@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch from 252da7c to b3822cd Compare April 2, 2025 13:54
@Robin-Van-de-Merghel
Copy link
Contributor Author

If someone has a solution for this CI, I'm all ears.

I moved a function as suggested above to diracx.logic, and it seems to have destroyed OSDB? (I don't use OpenSearch).

@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch 4 times, most recently from 8730f95 to 44310ed Compare April 4, 2025 07:56
Comment on lines 156 to 280
if "foreign key" in str(e.orig).lower():
raise PilotNotFoundError(pilot_id=pilot_id) from e
if "duplicate entry" in str(e.orig).lower():
raise PilotAlreadyExistsError(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you rely on an error code instead of relying on a string at least?
Also, it seems you are not using and testing the case where PilotAlreadyExistsError is raised (or I possibly missed it)

@Robin-Van-de-Merghel
Copy link
Contributor Author

[DB Specific bug:]

(pymysql.err.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'RETURNING `PilotAgents`.`PilotID`' at line 1")
[SQL: INSERT INTO `PilotAgents` (`InitialJobID`, `CurrentJobID`, `PilotJobReference`, `PilotStamp`, `DestinationSite`, `Queue`, `GridSite`, `VO`, `GridType`, `BenchMark`, `SubmissionTime`, `LastUpdateTime`, `Status`, `StatusReason`, `AccountingSent`) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) RETURNING `PilotAgents`.`PilotID`]
[parameters: (0, 0, 'aa', '', 'NotAssigned', 'Unknown', 'Unknown', 'diracAdmin', 'DIRAC', 0.0, datetime.datetime(2025, 4, 8, 8, 27, 35, 874664, tzinfo=datetime.timezone.utc), datetime.datetime(2025, 4, 8, 8, 27, 35, 874664, tzinfo=datetime.timezone.utc), 'Submitted', 'Unknown', 'False')]

insert(PilotAgents).values(values).returning(PilotAgents.pilot_id) is not supported in mysql, but the CI passes.

@Robin-Van-de-Merghel
Copy link
Contributor Author

Robin-Van-de-Merghel commented May 5, 2025

Added support for pilots in this diracx-charts PR

@Robin-Van-de-Merghel
Copy link
Contributor Author

Robin-Van-de-Merghel commented May 5, 2025

Could merge cli commands to have only dirac internal add-pilot*

@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch 2 times, most recently from 5d46d0f to cf2dd5b Compare May 7, 2025 12:12
@Robin-Van-de-Merghel Robin-Van-de-Merghel mentioned this pull request May 7, 2025
39 tasks
@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch from e38c107 to d882974 Compare May 8, 2025 09:06
@Robin-Van-de-Merghel Robin-Van-de-Merghel requested a review from aldbr May 8, 2025 09:15
@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch 2 times, most recently from 4f66b63 to 06442f4 Compare May 9, 2025 08:44
Comment on lines 104 to 196
pilots_credentials = await self.get_pilot_credentials_by_stamp([pilot_stamp])

# 2. Get the pilot secret itself
secrets = await self.get_secrets_by_hashed_secrets_bulk([pilot_hashed_secret])
secret = secrets[0] # Semantic, assured by fetch_records_bulk_or_raises

matches = [
pilot_credential
for pilot_credential in pilots_credentials
if secret["SecretID"] == pilot_credential["PilotSecretID"]
]

# 3. Compare the secret_id
if len(matches) == 0:

raise BadPilotCredentialsError(
data={
"pilot_stamp": pilot_stamp,
"pilot_hashed_secret": pilot_hashed_secret,
"real_hashed_secret": secret["HashedSecret"],
"pilot_secret_id[]": str(
[
pilot_credential["PilotSecretID"]
for pilot_credential in pilots_credentials
]
),
"secret_id": secret["SecretID"],
"test": str(pilots_credentials),
}
)
elif len(matches) > 1:

raise DBInBadStateError(
detail="This should not happen. Duplicates in the database."
)
pilot_credentials = matches[0] # Semantic

# 4. Check if the secret is expired
now = datetime.now(tz=timezone.utc)
# Convert the timezone, TODO: Change with #454: https://github.com/DIRACGrid/diracx/pull/454
expiration = secret["SecretExpirationDate"].replace(tzinfo=timezone.utc)
if expiration < now:

try:
await self.delete_secrets_bulk([secret["SecretID"]])
except SecretNotFoundError as e:
await self.conn.rollback()

raise DBInBadStateError(
detail="This should not happen. Pilot should have a secret, but not found."
) from e

raise SecretHasExpiredError(
data={
"pilot_hashed_secret": pilot_hashed_secret,
"now": str(now),
"expiration_date": secret["SecretExpirationDate"],
}
)

# 5. Now the pilot is authorized, increment the counters (globally and locally).
try:
# 5.1 Increment the local count
await self.increment_pilot_local_secret_and_last_time_use(
pilot_secret_id=pilot_credentials["PilotSecretID"],
pilot_stamp=pilot_credentials["PilotStamp"],
)

# 5.2 Increment the global count
await self.increment_global_secret_use(
secret_id=pilot_credentials["PilotSecretID"]
)
except Exception as e: # Generic, to catch it.
# Should NOT happen
# Wrapped in a try/catch to still catch in case of an error in the counters
# Caught and raised here to avoid raising a 4XX error
await self.conn.rollback()

raise DBInBadStateError(
detail="This should not happen. Pilot has credentials, but has a corrupted secret."
) from e

# 6. Delete all secrets if its count attained the secret_global_use_count_max
if secret["SecretGlobalUseCountMax"]:
if secret["SecretGlobalUseCount"] + 1 == secret["SecretGlobalUseCountMax"]:
try:
await self.delete_secrets_bulk([secret["SecretID"]])
except SecretNotFoundError as e:
# Should NOT happen
await self.conn.rollback()
raise DBInBadStateError(
detail="This should not happen. Pilot has credentials, but has corrupted secret."
) from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I have the feeling that this function should go in diracx.logic, it looks like it does not directly interact with sqlalchemy
  • Aren't you supposed to use the search method instead of get_pilot_credentials_by_stamp, get_secrets_by_hashed_secrets_bulk
  • I haven't checked what you do with DBInBadStateError, but if you don't catch it anywhere then the transaction should be automatically rolled back. See
    ### SQL Databases
    To depend on a SQL-backed database, use the classes in `diracx.routers.dependencies`. The connection is managed through a central pool, with transactions opened for the duration of a request. Successful requests commit the transaction, while requests with HTTP status code `>=400` roll back the transaction. Connections are returned to the pool for reuse.
    Example:
    ```python
    from diracx.routers.dependencies import JobDB, JobLoggingDB
    @router.delete("/{job_id}")
    async def delete_single_job(job_db: JobDB, job_logging_db: JobLoggingDB): ...
    ```
    There are advanced and uncommon scenarios where committing a transaction is necessary even when returning an error response (e.g., revoking tokens in the database and returning an error to a potentially malicious user). In such cases, explicitly committing the transaction before raising an exception is crucial. Without this explicit commit, the intended changes would be rolled back along with the transaction, leading to unintended consequences:
    ```python
    from diracx.routers.dependencies import AuthDB
    @router.post("/token")
    async def token(auth_db: AuthDB, ...)
    ...
    if refresh_token_attributes["status"] == RefreshTokenStatus.REVOKED:
    # Revoke all the user tokens associated with the subject
    await auth_db.revoke_user_refresh_tokens(sub)
    # Explicitly commit the transaction to ensure the revocation is saved,
    # even though an error will be returned to the user.
    await auth_db.conn.commit()
    # Raise an HTTP exception to signal the error
    raise HTTPException(status_code=401)
    ```
    Refer to the [SQLAlchemy documentation](https://docs.sqlalchemy.org/en/20/core/pooling.html) for more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this part:

  • Maybe I need yes to add it to the logic 🤔
  • For the search, I'm waiting it to be generic, because for now the mechanic of search does not completely do what I want (the error for example)
  • For the DBInBadStateError, I don't catch it to raise a 500 error: if I catch it and raise a 4XX error, it is bad, because the problem does not come from the client but from the server.

For the last point, I prefer when there's an error inside the DB to rollback everything, because if I insert corrupted data, it will be a mess to find which one is corrupted

status_code=status.HTTP_400_BAD_REQUEST,
detail="expiration_minutes must be strictly positive.",
)
if pilot_secret_use_count_max and pilot_secret_use_count_max <= 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as earlier here: don't you need these checks in create_credentials?

@aldbr
Copy link
Contributor

aldbr commented May 12, 2025

@fstagni

It is true that:
a pilot is associated to at most 1 secret
a secret could be associated to more than a pilot, but 1 by default.
This table looks to me it is still needed.

Why do you think the PilotToSecretMapping is still needed if we have a 1-N relationship?
In PilotAgents we can have a secret_id foreign key that would point to PilotSecrets.secret_id.

Having a PilotToSecretMapping implies a N-N relationship I think.

@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch from 65cd1cd to 9ed1a7b Compare May 12, 2025 12:51
@Robin-Van-de-Merghel
Copy link
Contributor Author

Facing #417 , so set require_auth to False.

@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch from 8471d6c to f7f4c4a Compare May 14, 2025 08:57
@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch 3 times, most recently from 1a2e563 to a5d2788 Compare May 15, 2025 07:19
@Robin-Van-de-Merghel Robin-Van-de-Merghel marked this pull request as ready for review May 15, 2025 07:41
@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch from e8866db to cc8de96 Compare May 21, 2025 13:58
@Robin-Van-de-Merghel
Copy link
Contributor Author

As discussed offline:

  • We won't use pilot user anymore
  • We will separate pilot route into pilots/ and pilot_management/
  • Users and Pilots will have a different token, and we will have to separate one from the other
  • Pilot refresh tokens will last more than user's, and we will refresh tokens as we fetch data from the Pilot

@Robin-Van-de-Merghel Robin-Van-de-Merghel force-pushed the robin-pilot-registrations branch from 720f302 to a957480 Compare May 26, 2025 14:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants