Distributing tensors across NUMA nodes #207

shg8 · 2024-04-06T00:30:29Z

I'm wondering how much support Neural Speed has for NUMA systems. The Advanced Usage page suggests that all tensors should be allocated on the first NUMA node numactl -m 0 -C 0-<physic_cores-1>. Is there any benefit to doing this?

The text was updated successfully, but these errors were encountered:

kevinintel · 2024-04-15T01:47:44Z

Without numa, the performance will drop a lot

shg8 · 2024-04-15T02:55:02Z

Without numa, the performance will drop a lot

I previously thought that this binds all memory allocations to the first NUMA node. However, this would increase internode traffic significantly. Additionally, each thread isn't able to fully utilize the memory bandwidth if the topology has different memory affinities for different nodes. Is my understanding correct? Could you kindly add a bit more to why we're not interleaving the memory allocations?

kevinintel · 2024-04-17T03:38:09Z

Intel Xeon offen has 2 sockets, -m 0 aimed to bind the memory in first socket.
There are overhead of communcation between 2 sockets, if you want to reduce internode, you can try our TP.

kevinintel self-assigned this Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributing tensors across NUMA nodes #207

Distributing tensors across NUMA nodes #207

shg8 commented Apr 6, 2024 •

edited

Loading

kevinintel commented Apr 15, 2024

shg8 commented Apr 15, 2024

kevinintel commented Apr 17, 2024

Distributing tensors across NUMA nodes #207

Distributing tensors across NUMA nodes #207

Comments

shg8 commented Apr 6, 2024 • edited Loading

kevinintel commented Apr 15, 2024

shg8 commented Apr 15, 2024

kevinintel commented Apr 17, 2024

shg8 commented Apr 6, 2024 •

edited

Loading