-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nvidia] Skip SAI discovery on ports #1416
base: master
Are you sure you want to change the base?
[nvidia] Skip SAI discovery on ports #1416
Conversation
aa63de0
to
995d79b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will lead to inconsistency ASIC_DB vs what's on device, which will later on lead to crash
syncd/SaiSwitchInterface.h
Outdated
@@ -89,7 +89,8 @@ namespace syncd | |||
|
|||
virtual void onPostPortCreate( | |||
_In_ sai_object_id_t port_rid, | |||
_In_ sai_object_id_t port_vid) = 0; | |||
_In_ sai_object_id_t port_vid, | |||
_In_ bool discoverPortObjects = true) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is very strict to ports, if we decide later on to do something similar on other objects then this is not optimal solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function is meant to be used on ports. Considering current approach, I assume there will be onPostXCreate() functions for other object types. Then, if needed, they can accept a boolean flag in the same way. This is simple and gives required granularity.
syncd/Syncd.cpp
Outdated
#ifdef SKIP_SAI_PORT_DISCOVERY_ON_FAST_BOOT | ||
const bool discoverPortObjectsInFastBoot = false; | ||
#else | ||
const bool discoverPortObjectsInFastBoot = true; | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fast boot cak be initiated after code was compiled which then this check will be hardcoded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also there are no tests for testing this code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fast boot cak be initiated after code was compiled which then this check will be hardcoded
This was the intention. For Nvidia - skip discover on ports in fast boot. The runtime check for fast boot is done in the condition below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be runtime check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kcudnik What is the benefit of runtime check here? Syncd is compiled per platform and on Nvidia we do not want to run discovery. We know this at compile time.
@kcudnik Yes, current design leads to performance problems on devices with lots of ports (could be 512, 1024 and more - tens of thousands keys to insert to ASIC_DB on init). |
Discovery Is done only once at switch create, how long it takes in your case ? We need to have a full view of device since later on warm boo will try to delete objects which were not discovered at current stage it's hard to predict when cras will happen, but it will be 99% on warm boot |
@kcudnik I got your point, please note that Nvidia does not use standard warm-reboot flow in syncd, instead uses fast-fast-boot mode which is quite similar to fast-boot. No comparison logic is involved on Nvidia, therefore in my testing I did not observe crashes even after doing consecutive fast-reboot and warm-reboot. Therefore, the change is limited to be Nvidia only. I attach the log of discovery during fast-reboot. I measured 4.8 sec on 202405 dedicated to discovery which is 16% of time budget for fast-reboot. It is only a 256 port system and we expect the time to increase linearly with the number of ports which is growing with new HWSKUs. |
Logs are very consuming time please check without logging, also maybe there is also room on your reiver to optimize speed to return objects on ports ? |
fropm logs you pasted, it seems like you are creating those posts explicitly? why thsoe ports are not existing already when switch is created ? |
Logs are set to CRITICAL level during discovery and there are no many logs from syncd itself. We optimized various SAI calls. However, it does not address the main issue which is that on Nvidia platform there's no reason to do discovery.
It's a matter of user configuration. SAI provides default ports or none, user can override/breakout ports in CONFIG_DB. In later case ports are added using port bulk API. Either way, we would like to not have to discover objects on fast/warm boot as the number of objects and attributes grows with new platforms. |
Then w need deeper discussion here, @lguohan can you jump in? If there will be no warm boot performed on this particular platform, then we can add feature to exclude this specific platform via switch from comma done option |
i am a little bit confuse, the PR title is about fast-boot. @kcudnik , while you are talking about warm reboot. what does this code impact? |
it does not matter whether is warm boot or fast boot this wiil cause inconsestincy in db and later on crash |
syncd/SaiSwitch.cpp
Outdated
@@ -948,46 +948,50 @@ void SaiSwitch::redisUpdatePortLaneMap( | |||
|
|||
void SaiSwitch::onPostPortCreate( | |||
_In_ sai_object_id_t port_rid, | |||
_In_ sai_object_id_t port_vid) | |||
_In_ sai_object_id_t port_vid, | |||
_In_ bool discoverPortObjects) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i feel like this eintrie change in this function is overcomplicated, it sholud be something like this:
if (object_type(oid) == SAI_OBJECT_TYPE_PORT && shouldSkipPorts)
continue;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kcudnik
Regarding object_type(oid) == SAI_OBJECT_TYPE_PORT
, the function is called onPostPortCreate
so unless someone is calling it on object other than port I don't think this check is needed.
Do you mean early return? Like:
redisUpdatePortLaneMap(port_rid);
if (!discoverPortObjects)
{
return;
}
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, i thoung you also modify discover process, since it will also discover all objects on all ports, so i guess on cold boot you only need onpostportcreate, but this could still crash on next fast-boot
please do couple of fst-boot to fast-boot reboots with your patch to see if this will wrok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking at taht code, you only need to modify SaiDiscovery process with flag to ignore port discovery, no else code is needed to be changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and ig you look on master, in SaiDiscovery.cpp file at line 34, you can actually pass new flag - to not discover port objects over VendorSaiOptions class to not forward all bool arguments to discover ports, if you want
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and you can disable those ports in init script for your platform only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kcudnik Platform is known at compile time, syncd is compiled differently for different platforms. My change as is right now should not affect other platforms. This change purpose is to improve startup time, however with platform detection done in script I will add some additional CPU cycles for that. Even though it is very small, on some lower systems the init scripts execution time is worse, that's why I am leaning towards moving all to compile time if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's know at compile time, put this:
#ifdef nvidia
if (object_type(oid) == SAI_OBJECT_TYPE_PORT)
continue;
#endif
in si discovery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add log warn message, that discovery port was disabled on nvidia platform
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kcudnik I applied your suggestion and skipped discovery for any type of boot on Nvidia.
Tested the following reboot flow with T0 topology on Nvidia platform:
cold -> fast -> fast -> warm -> warm -> fast -> warm
Didn't add log message since that would print it for every port.
Signed-off-by: Stepan Blyschak <[email protected]>
56c4012
to
6bdf9bf
Compare
/azp run |
Azure Pipelines successfully started running 1 pipeline(s). |
You could add this message in syncd constructor |
Given that modern systems have lots of ports, performing SAI discovery takes very long time, e.g. (8 sec) for 256 port system. This has a big impact of fast-boot downtime and the discovery itself is not required for Nvidia platform fast-boot.
Same applies to Nvidia fastfast-boot (aka warm-boot), yet needs to be tested separately.