Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem using SwitchML with PCI NIC virtual function #31

Open
kfertakis opened this issue Nov 16, 2022 · 0 comments
Open

Problem using SwitchML with PCI NIC virtual function #31

kfertakis opened this issue Nov 16, 2022 · 0 comments

Comments

@kfertakis
Copy link

Hi,

I am running SwitchML allreduce_benchmarks on a cluster of nodes with a mix of MLX5 NICs and some with Intel 82599 ES 10G NICs thus I'm using DPDK as the communication backend. I need to share the NIC on each host with other traffic so I'm virtualizing it by creating a virtual function of the PCI device in order to use the original device for general purpose traffic and the virtual device to run the SwitchML app. However, when I try to run SwitchML with the virtual device, I'm getting the following error:

Submitting 5 warmup jobs.
EAL: Detected 20 lcore(s)
EAL: Detected 2 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Selected IOVA mode 'VA'
EAL: No available hugepages reported in hugepages-1048576kB
EAL: Probing VFIO support...
EAL: VFIO support initialized
EAL: PCI device 0000:01:10.1 on NUMA socket 0
EAL:   probe driver: 8086:10ed net_ixgbe_vf
EAL:   using IOMMU type 1 (Type 1)
E1116 19:08:30.807562 74629 dpdk_master_thread_utils.inc:277] Flow isolated mode failed: 1 Function not implemented
F1116 19:08:31.361418 74629 dpdk_master_thread_utils.inc:154] Flow rule can't be added: 1Function not implemented
*** Check failure stack trace: ***
    @     0x7f2291e280cd  google::LogMessage::Fail()
    @     0x7f2291e29f33  google::LogMessage::SendToLog()
    @     0x7f2291e27c28  google::LogMessage::Flush()
    @     0x7f2291e2a999  google::LogMessageFatal::~LogMessageFatal()
    @     0x564504203fb2  switchml::InsertFlowRule()
    @     0x5645042048ca  switchml::InitPort()
    @     0x564504205ae6  switchml::DpdkMasterThread::operator()()
    @     0x7f2291ae34c0  (unknown)
    @     0x7f22915766db  start_thread
    @     0x7f229015761f  clone

which seems to be caused by struct rte_flow_error error; LOG_IF(FATAL, rte_flow_validate(port_id, &attr, pattern, action, &error) != 0) << "Flow rule can't be added: " << error.type << (error.message ? error.message : "(no stated reason)"); in InsertFlowRule function. Any ideas on why this is happening and whether I can overcome this? Much appreciate it.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant