Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add DeviceUsage and DeviceUsageKind for Instance.device_usage #628

Merged
merged 9 commits into from
Jul 26, 2023

Conversation

johnsonshih
Copy link
Contributor

@johnsonshih johnsonshih commented Jul 11, 2023

What this PR does / why we need it:
This PR is part of the works for Configuration level resource support. In order to keep the device usage state for a usage slot, we introduce new data types DeviceUsageKind and DeviceUsage to keep track of the owner of usage slot.
The design of DeviceUsageKind and DeviceUsage is described in the doc PR: project-akri/akri-docs#76
device usage design at:
https://github.com/johnsonshih/akri-docs/blob/user/jshih/cl-implementation/proposals/configuration-level-resources.md#maintaining-device-usage

The complete PR for CL resource support is in #627

Special notes for your reviewer:

If applicable:

  • this PR has an associated PR with documentation in akri-docs
  • this PR contains unit tests
  • added code adheres to standard Rust formatting (cargo fmt)
  • code builds properly (cargo build)
  • code is free of common mistakes (cargo clippy)
  • all Akri tests succeed (cargo test)
  • inline documentation builds (cargo doc)
  • all commits pass the DCO bot check by being signed off -- see the failing DCO check for instructions on how to retroactively sign commits

Signed-off-by: Johnson Shih <[email protected]>
Copy link
Contributor

@diconico07 diconico07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, I'd like to try to have a second look during this week after taking a look at the global CL resource PR so I do not yet click Approve (will do after second look or at the end of the week if I don't find the time to have a second look)

agent/src/util/device_plugin_service.rs Show resolved Hide resolved
Signed-off-by: Johnson Shih <[email protected]>
Signed-off-by: Johnson Shih <[email protected]>
@bfjelds
Copy link
Collaborator

bfjelds commented Jul 20, 2023

let slot_usage =
    DeviceUsage::create(&DeviceUsageKind::Instance, &device_usage_id).unwrap();
akri_annotations.insert(
    format!("{}{}", AKRI_SLOT_ANNOTATION_NAME_PREFIX, &device_usage_id),
    slot_usage.to_string(),
);

i'm probably missing the reason, but it looks like slot_reconciliation is just using the name in slot_usage ... the original code just passed device_usage_id, why change it? is vdev_id used in reconciliation?

maybe it is needed for removal_slot_map and comparing with the usage pulled from Instance? that can't be it, the Instance stores node names in the deviceusage.

can the device_usage_id be stored as the annotation value and save some parsing?

Signed-off-by: Johnson Shih <[email protected]>
@johnsonshih johnsonshih requested a review from bfjelds July 20, 2023 22:49
@johnsonshih
Copy link
Contributor Author

johnsonshih commented Jul 21, 2023

[bfjelds: Sorry @johnsonshih, I meant to quote reply this comment, but seem to have editted it instead. Restoring your original comment as best as I can]

This is for supporting Configuration level resources.

Since the Akri Instance is the source of true for Agent, we need to keep the device plugin type in the device usage, for the same reason, we need the same information for reconciler. Originally the annotation name looks like

"akri.agent.slot-my-resource-00095f-3": "my-resource-00095f-3"

with the DeviceUsage, the annotation can hold the device plugin type information.

"akri.agent.slot-my-resource-00095f-2": "C:0:my-resource-00095f-2"

Please check the details from the CL resource proposal
https://github.com/project-akri/akri-docs/blob/main/proposals/configuration-level-resources.md#maintaining-device-usage

@bfjelds
Copy link
Collaborator

bfjelds commented Jul 21, 2023

Sorry @johnsonshih, I meant to quote reply to your last comment, but seem to have editted it instead. Restored your original comment as best as I can and replying here:

This is for supporting Configuration level resources.
...
Please check the details from the CL resource proposal
https://github.com/project-akri/akri-docs/blob/main/proposals/configuration-level-resources.md#maintaining-device-usage

sorry, the doc definitely does a great job explaining what to expect, but didn't help my brain understand why kind is required in the reconciler. unfortunately, it has been a long time and i didn't remember the block of code in reconciler that adds missing usage to Instance.deviceUsage (which would need to reconstruct the DeviceUsage object).

i still feel like overloading DeviceUsage to accept node name or slot name is confusing. i would rather the annotation be changed to be something like "<slotname>=<deviceusage.tostring>", then the reconciler could just split it apart once into map<slotname, deviceusage> and use deviceusage to repopulate the missing Instance.deviceUsage. if that were the case, then DeviceUsage's name would be unambiguous.

@johnsonshih
Copy link
Contributor Author

Sorry @johnsonshih, I meant to quote reply to your last comment, but seem to have editted it instead. Restored your original comment as best as I can and replying here:

This is for supporting Configuration level resources.
...
Please check the details from the CL resource proposal
https://github.com/project-akri/akri-docs/blob/main/proposals/configuration-level-resources.md#maintaining-device-usage

sorry, the doc definitely does a great job explaining what to expect, but didn't help my brain understand why kind is required in the reconciler. unfortunately, it has been a long time and i didn't remember the block of code in reconciler that adds missing usage to Instance.deviceUsage (which would need to reconstruct the DeviceUsage object).

i still feel like overloading DeviceUsage to accept node name or slot name is confusing. i would rather the annotation be changed to be something like "<slotname>=<deviceusage.tostring>", then the reconciler could just split it apart once into map<slotname, deviceusage> and use deviceusage to repopulate the missing Instance.deviceUsage. if that were the case, then DeviceUsage's name would be unambiguous.

I'll change the code to avoid overloading DeviceUsage.

Here is the summary for what reconciler does and why it needs usage_kind:
the deviceUsage in Akri Instance keep the information slot_id: (usage_kind: node_name). The Akri Instances can be written by multiple nodes in parallel, there is a race condition that can cause the deviceUsage information in Arki Instance be accidentally wiped or have incorrect data written to it. The reconciler is responsible to ensure the deviceUsage status for the node that reconciler runs on are correct.

Reconciler needs slot_id, usage_kind and node_name to do the check. It gets that information by using the name of node it is running on and the annotation (usage_kind: slot_id) from pods running on the node. Without usage_kind, reconciler cannot correct the deviceUsage to correct state.

@bfjelds
Copy link
Collaborator

bfjelds commented Jul 24, 2023

I'll change the code to avoid overloading DeviceUsage.

Here is the summary for what reconciler does and why it needs usage_kind: the deviceUsage in Akri Instance keep the information slot_id: (usage_kind: node_name). The Akri Instances can be written by multiple nodes in parallel, there is a race condition that can cause the deviceUsage information in Arki Instance be accidentally wiped or have incorrect data written to it. The reconciler is responsible to ensure the deviceUsage status for the node that reconciler runs on are correct.

Reconciler needs slot_id, usage_kind and node_name to do the check. It gets that information by using the name of node it is running on and the annotation (usage_kind: slot_id) from pods running on the node. Without usage_kind, reconciler cannot correct the deviceUsage to correct state.

one last suggestion, if SlotUsage is write!(f, "{}:{}", <slot-id>, <nodeusage.tostring>), then you won't need to parse the nodeusage kind and recreate a NodeUsage object to fix the race condition (just use the string found in the annotation).

i'm approving regardless of whether you take the suggestion, just thought it might be easier to maintain.

@johnsonshih
Copy link
Contributor Author

I'll change the code to avoid overloading DeviceUsage.
Here is the summary for what reconciler does and why it needs usage_kind: the deviceUsage in Akri Instance keep the information slot_id: (usage_kind: node_name). The Akri Instances can be written by multiple nodes in parallel, there is a race condition that can cause the deviceUsage information in Arki Instance be accidentally wiped or have incorrect data written to it. The reconciler is responsible to ensure the deviceUsage status for the node that reconciler runs on are correct.
Reconciler needs slot_id, usage_kind and node_name to do the check. It gets that information by using the name of node it is running on and the annotation (usage_kind: slot_id) from pods running on the node. Without usage_kind, reconciler cannot correct the deviceUsage to correct state.

one last suggestion, if SlotUsage is write!(f, "{}:{}", <slot-id>, <nodeusage.tostring>), then you won't need to parse the nodeusage kind and recreate a NodeUsage object to fix the race condition (just use the string found in the annotation).

i'm approving regardless of whether you take the suggestion, just thought it might be easier to maintain.

I thought about that before but decided to keep use the slot_id to avoid causing break change. After a second thought, I think it's worth to pay the price now to get a clean design. I'll update the PR to use NodeUsage in the annotation value to reduce the parsing

@johnsonshih johnsonshih requested a review from bfjelds July 26, 2023 01:16
@johnsonshih johnsonshih merged commit 4e421d6 into project-akri:main Jul 26, 2023
50 checks passed
@johnsonshih johnsonshih deleted the user/jshih/device-usage branch July 26, 2023 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants