Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMDA: Low perfomance numbers on Gen3x16 #293

Open
DimanYLT opened this issue Aug 8, 2024 · 8 comments
Open

XMDA: Low perfomance numbers on Gen3x16 #293

DimanYLT opened this issue Aug 8, 2024 · 8 comments

Comments

@DimanYLT
Copy link

DimanYLT commented Aug 8, 2024

Hi! Friends, I am measuring the performance of XDMA on the Z19-P board, PCIe Gen4x16 (the IP core only supports Gen3x16) - and I can't reach the theoretical speed of at least 12 GB/s, but i get only 5-6 GB/s
My system is Fedora 39, Linux Kernel 6.5.2

As a result of extensive work, I have tried absolutely all debugging options available to me, but the speed remained the same. Here are my observations and assumptions:

  1. The processor is not fully utilized. The maximum is 60%. I find this strange because, for example, the command dd of=/dev/null if=/dev/zero bs=1MB count=10000 fully loads the processor at 100%, while dd of=/dev/null if=/dev/xdma0_c2h_0 bs=1MB count=10000, which also transfers bytes via XDMA (I checked), also shows 60% and the same bandwidth. This detail is one of my arguments for why the programs I am using (like dma_from_device.c, etc.) are not related to speed limitations; rather, it is the driver that limits it. Even the oldest Linux system command dd cannot handle the transfer properly. Perhaps the current state of the driver is incompatible with some component of the system, for example, with the new version of Fedora 39, and XDMA simply does not deliver full performance due to some bug.

  2. The Hardware Numbers program (the second graph) shows excellent results. This Xilinx program was created so that we could learn the potential performance figures of our PCIe interface without software and drivers. Thus, the problem is definitely not in the hardware but somewhere in the OS or XDMA.

  3. When reconfiguring the XDMA IP core from Gen 3x16 to Gen 3x8, I expected to see the same 5-6 GB/s as in Gen 3x16, but I saw 2.2 GB/s. Both values for both configurations are ~30% of the maximum. Something during operation cuts the speed by ~70%.

  4. I noticed that speed increases when I build and do insmod xdma on different versions of the Linux kernel. On 6.5.2, it works 10-20% faster than on 6.9.9.

  5. Different versions of the dma_ip_drivers repository do not yield significant results.

  6. Previously, I had a problem with Poll Mode. It was performing worse. But with the help of #define XDMA_DEBUG 1 (AR71435), I fixed the issue; however, it did not help the overall performance. I didn't find anything else strange in this debug log. Except, perhaps, that for some reason a strange number of descriptors is allocated. For example, if the transfer is 1MB, 255 descriptors are allocated, but only 16 are actually used (also writes "nents 16/256"). These messages in dmesg are output by the following function from libxdma.c, line 3040:

#ifdef __LIBXDMA_DEBUG__
static void sgt_dump(struct sg_table *sgt)
{
	int i;
	struct scatterlist *sg = sgt->sgl;

       pr_info("sgt 0x%p, sgl 0x%p, nents %u/%u.\n", sgt, sgt->sgl, sgt->nents,
		sgt->orig_nents);

	for (i = 0; i < sgt->orig_nents; i++, sg = sg_next(sg))
		pr_info("%d, 0x%p, pg 0x%p,%u+%u, dma 0x%llx,%u.\n", i, sg,
			sg_page(sg), sg->offset, sg->length, sg_dma_address(sg),
			sg_dma_len(sg));`
}

The first graph shows the final results of my measurements with honest figures.

I will be very grateful for any hint from you

Final
hwN_final

@MischaBaars
Copy link

Dmitriy,

I have similar problems with XDMA PCIe3x8 data transfers into BRAM memory, but not as severe as yours. The maxima I found are 3,381.66 Mbytes/s read speed and 3,399.44 Mbytes/s write speed for a maximum packet size of 256 Kbytes (limited by the number of BRAM blocks in the FPGA). That's a lot better than 2.2 Gbytes/s, but still less than half of the theoretical maximum according to Wikipedia. Your graph shows similar values for 256 Kbyte packages. Do you think it makes sense to make the packages larger? It looks like your curve is picking up speed for small packages, my curve is concave from the start. I have not touched any IRQ related driver settings yet, unrelated to the package size there might be room for improvement in the graph, but for now it's under-performing like your setup.

Mischa.

@DimanYLT
Copy link
Author

DimanYLT commented Aug 9, 2024

Mischa,

You have created very interesting graphs; however, I have not worked with BRAM and am an absolute beginner in FPGA, and this project is my first. I am only reading data from FIFO, but its size is limited to one megabyte, so for testing, I am just transmitting meaningless data to the c2h channel ('1).

Did I understand correctly that you are copying data from the computer and transferring it to BRAM on the FPGA?

Diman.

@martindubois
Copy link

Bonjour,

I am a specialist in device driver development on Windows and Linux. I developed DrvDMA (https://www.kms-quebec.com/Cards/0041_en.pdf), a driver supporting XDMA engine. I would be happy to find a way to test my driver with your specific hardware to know if the performance is better.

For now, I tested it on a x4-gen3 and get ~ 20 Gb/s (~ 2.5 GB/s) from the card to the PC memory, but the speed was limited by the FPGA data processing.

Do you use an FPGA dev kit? If yes which one? Would you want to share your bits stream?

Regards,
Martin

@MischaBaars
Copy link

Hi Diman,

I've been doing mostly C/C++ and Assembly all these years. The same here, I am also inexperienced with FPGAs.

I am indeed copying data from the computer into BRAM and back into main memory. In fact, the graph shows the speed tests for all the different input combinations of the 'PG195: AXI4 Memory Mapped Default Example Design'. That's how far I've come. Not very far yet.

Mischa.

@DimanYLT
Copy link
Author

Hi Martin,

Thank you for your interest, but I don't think the bitstream file will give you much to work with, because you'll need our board anyway. I would recommend creating an Example-Design on AxiStream and testing it there

Diman.

@DimanYLT
Copy link
Author

Hi Mischa,

Then I think this unusual graph is indeed related to your computer's memory limitations. Could you tell me what program you're using to measure bandwidth?

Diman.

@martindubois
Copy link

Hi Diman,

Thanks for taking time to answer. If we want to try DrvDMA, with your board, we can discuss it.

Regards,
Martin

@MischaBaars
Copy link

MischaBaars commented Aug 17, 2024

Hi Diman,

No, it does not have anything to do with the computer's main memory. These are DDR4/3200 sims and they are fast enough. I wouldn't have measured 17 Gb/s read speed with the Kingston Fury Renegade PCIe 4.0 x4 SSD otherwise. I think the FPGA transfer speed is limited by the BRAM, but I can only be sure when I've tried the RAM plus the memory controller instead of the BRAM. Another factor that could influence the transfer speed is the use of IRQs, which are disabled in the current driver, if I'm correct.

To answer your question, I've written my own software to measure the transfer speed (by placing timers around the read and write system calls).

Mischa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants