Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployed to two servers, client reported an error during testing #10

Open
ligang2425 opened this issue Sep 22, 2023 · 22 comments
Open

Deployed to two servers, client reported an error during testing #10

ligang2425 opened this issue Sep 22, 2023 · 22 comments

Comments

@ligang2425
Copy link

Hi,Please let me ask you a question, when I deploy the server side and client side of UDPST on different servers for simulation test, I encountered this problem as below.
The 25000UDP port is open and can be accessed
The client side runs a command like this: . /udpst -u ip
The server runs the command like this: . /udpst
UDPST
Is there something I'm missing or overlooking?

@lc9892
Copy link

lc9892 commented Sep 22, 2023

Hello. The basic usage you show should be fine, but there are a couple things to also check. In addition to port 25000 being reachable on the server, the server must also be reachable on all UDP ephemeral ports (32768 - 60999 as of the Linux 2.4 kernel, available via cat /proc/sys/net/ipv4/ip_local_port_range). This is because after the setup request is received by the server on the control port (25000) it will open a UDP ephemeral port for the data traffic. The client will need to be able to finish the setup to that port and then use it for the data transfer. So, if there's a firewall in front of the server it generally needs to open 25000 and 32768 - 60999 for UDP.

Another suggestion, in case your server has multiple routing options on more than one NIC, is to bind the server to the local NIC interface address. Just specify the local IP address on the server command line (e.g., ./udpst ).

If neither helps your issue can you please provide the output (from both sides) with the "-v -D" options.

Thanks

@ligang2425
Copy link
Author

Hello, I'm sorry for replying so late. I have confirmed that all temporary ports for UDP should be set to open when I created the server; As you mentioned, I ran with - v and - D, and the results are as follows. Please take a look at what the problem is.
client1

@lc9892
Copy link

lc9892 commented Oct 5, 2023

Your error of "LOCAL WARNING: Incoming traffic has completely stopped", when doing an upstream test, seem to indicate that the status feedback messages from the server are not making it back to the client. Can you try these options...

  1. Also run the server with "-v -D" and include that output.
  2. Include the interface IP on the server command line to bind the process to the specific IP interface.
  3. Use the "-j" option on both client and server (to keep all datagrams within 1500 byte MTU).
  4. Disable GSO on client and server via "cmake -D HAVE_GSO=OFF ."
  5. Run a local client and server on both machines (using 127.0.0.1) to verify it functions when not going over a network.

@ligang2425
Copy link
Author

Hello, thank you for your continuous attention. According to what you said, I also added the -v, -D and -j parameters on the server side. Unfortunately, I don’t quite understand what the interface IP you mentioned refers to. And currently, the two machines cannot be found. I will continue to try to run them locally in the future. I will continue to study this issue. The following is the output of the client after the client sends the request:
客户端输出
The following is the server output, omitting some duplicate information:
服务端输出1
服务端输出2
Please take a look at what the problem is.

@lc9892
Copy link

lc9892 commented Oct 6, 2023

The one thing I mentioned is to provide the local address for the server to bind to (instead of "Awaiting setup requests on :::25000"). So, on the server provide the local IP address "./udpst -v -D -j 10.53.5.253" (or whatever the local IP address of the test interface is). This will make sure that the server only communicates over that specific interface (which can be an issue when a server has multiple interfaces).

Also, on both machines try testing locally as a sanity check. So, in one terminal window do "./udpst -v -D -j 127.0.0.1" for the server and in another terminal window (on the same machine) do "./udpst -v -D -j -u 127.0.0.1" for the client. Make sure this runs as expected.

From the output you provided it's difficult to see what the issue might be, it looks like traffic starts and then stops. Assuming the previous local testing works, there could be a problem with UDP between those devices. It may be worth trying another tool such as iPerf to test UDP between those machines. Make sure to test with UDP for a valid comparison.

Lastly, I see you're only doing upstream tests. Have you tried downstream tests? Do they work?

@ligang2425
Copy link
Author

Hello, I'm very sorry. I was busy with other work some time ago, so I didn't pay attention here.
Thank you very much for your attention. Regarding the question I asked before, I am very sorry. There should be a problem with the access rights of my public IP, so the access was unsuccessful;
But after the IP was resolved, I encountered such a problem. I ran it between two EC2 servers, one client and one server. The two EC2 servers each had their own public IP. The client ran the command ./udpst. -u public IP -v -D -j, run ./udpst -v -D -j on the server side, the test can be performed normally;
But if I put the client in a ubuntu virtual machine on a computer behind my company's network, and then run it in the same way, I find the following situation. By the way, the server side runs ./udpst -v -D -j 10.53.5.253 also has the following results:
upload test:
Client:
1022Client

Server:
1022Sever1
1022Server2

downloadTest:
client:
1022downClient1
1022downClient2
server:
1022downServer

Does this have anything to do with NAT, were the two disconnected during the test, or can you see what prevented the two from establishing a normal connection?
It must be mentioned that if I run Server and Client on the same computer, the test is successful!
Thanks again for your attention and I hope to find some inspiration from you!

@lc9892
Copy link

lc9892 commented Oct 23, 2023

Thanks for the detailed info. I don't think the issue is NAT/PAT since the setup handshake (setup request/response and test activation request/response) seems to be getting processed. Instead, I would guess that your issue has to do with your company's network security devices (firewall, intrusion detector, packet inspector,...) since it appears that load traffic starts but is then shutdown. After that the software watchdog expires and the test is ended. This isn't be very surprising with advanced security devices since the load traffic we test with can be easily mistaken for a DOS/DDOS attack (i.e., a high rate UDP stream).

@ligang2425
Copy link
Author

Thank you for your concern, can you tell from the output of the client as well as the server side that it is due to the client not receiving a response?
What I understand you mean is that after the client receives the activation response from the server in the upload test, it starts to send the load data, but due to the firewall on the client's side or some other security reasons, the client doesn't receive the response from the server, so the client doesn't continue to send the load test data, is that right?
As well as, "skipping status transmission, awaiting initial load PDUs", what does it mean?
Thanks again for your patience!

@lc9892
Copy link

lc9892 commented Oct 26, 2023

You are basically correct. In the client output for the downstream test (DEBUG Status Feedback...) you can see that the Mbps values are greater than 0.00 at first and the RTTVar is not -1 for a few samples. That means there was some initial traffic in both directions. But quickly the Mbps goes to 0.00 and a little bit later you see the "traffic has completely stopped" message - so it looks like the load traffic got blocked. In the upstream direction it's a little harder to see because it appears that no load PDUs ever make it to the server ("Skipping status transmission..."). Again, I can understand how a security device might see the downstream test as a DOS attack after the load starts to ramp up. And an upstream test may be seen as an infected company machine (or bot) that is sending out to the Internet. And since our port number and protocol is not well known (especially for UDP), a security device can't really validate it like it might a standard Ookla test. So just to summarize, the protocol is setup to only allow 3 seconds of no datagram reception (in either direction) before the software aborts the test.

@ligang2425
Copy link
Author

Thanks for the answer;
I did a packet capture test and in the upload test, I found that client only sends the setup request and Test Activation request and receives the response from the server, but other than that, there is no relevant data transfer from the client side as shown in the figure:
虚拟机上传测试抓包截图

What confuses me is why the client doesn't do anything after the test connection is established, is this process initiated by the server?What is the process of interaction between them after the connection is established?
By the way, when downloading the test in the client side, there is a lot of data transfer in the packet capture.
Thank you very much for your attention!

@ligang2425
Copy link
Author

One thing to add is that I used tcpdump on the server side to grab packets for testing, and found that the server also has only the four packets mentioned above, the setup request and response, and the Test Activation request and response, which is very confusing to me. As shown in the figure:
server端抓包

Why do both of them remain silent after the connection is established?

@lc9892
Copy link

lc9892 commented Oct 30, 2023

Well, for an upload test it makes sense that the server capture wouldn't show any transmissions because it is waiting for an initial load PDU before sending the first status feedback message. However, for an upload test, a capture on the client machine should show load PDUs start to go out after the Test Activation response is received (they would use the same port numbers as the Test Activation PDUs). The client wouldn't know that the network might block those packets. Is it possible that the capture is missing or filtering the load PDUs? Can you test the capture mechanism with an upload test running to another machine locally (to make sure that it sees the packets when a test is working)?

@twieskot
Copy link

Hi,

I just want to mention that after updating from udpst 7.X.X to the new version, we've experienced the same problem (Minimum connection required (1) / Incoming traffic has stopped completely) on a lot of different hardware platforms.
We are using udpst on serveral embedded device platforms.

We've tried the following things to get it working:

- We have checked the system firewall; the test setup seems to be fine (same as for version 7.X.X).
- We have tried to build without GSO -> no effect.
- Disabling jumbo frames had an effect on some devices

We also experienced that the upload test sometimes only reaches up to 10 Mbits or gives back a totally wrong value (~ 4 Billion MBits). The exact same firmware with the old version is working fine on all these devices. By now, we are communicating to all our customers to keep using the old (7.X.X) version (this works really well here).

tr471_udpst_measurements.pdf

udpst-server-logs.txt

And the moment we are busy with a lot other featueres, so the minium work is it to continue with 7.X.X - but for the future it would be nice to support the newest version of the protocol.

Thanks a lot for the good work

@lc9892
Copy link

lc9892 commented Feb 1, 2024

Thank you for the feedback. We'd certainly like to resolve these if possible.

As for the large values being returned on the MaxLinear Seale grx550, it may be due to the introduction of a 64-bit value ("uint64_t rxBytes; // Received bytes") in the "subIntStats" structure -- and specifically the conversion to/from network byte order. Can you confirm the endianness of that system via: lscpu | grep "Byte Order"

If not little endian, and you're willing to do a simple code change to test this, you could modify the ntohll(x)/htonll(x) macros in udpst_common.h to test for it (as it should). Something like...

#ifndef ntohll
#  if __BYTE_ORDER == __LITTLE_ENDIAN
#    define ntohll(x) (((uint64_t) (ntohl((int) ((x << 32) >> 32))) << 32) | (unsigned int) ntohl(((int) (x >> 32))))
#  else
#    define ntohll(x) (x)
#  endif
#endif
#ifndef htonll
#define htonll(x) ntohll(x)
#endif

Thanks

@lc9892
Copy link

lc9892 commented Feb 9, 2024

I just wanted to do a follow-up on that first issue (large values shown). I completely understand that you are very busy, but would you be able to do a quick check on the endianness of the MaxLinear Seale grx550 via a: lscpu | grep "Byte Order"

Thank you

@twieskot
Copy link

Hi, sorry for the delay.

The fix in your comment above solved the issue with the "big result".
The byte order of "MaxLinear Seale grx550" is big_endian.

Now we are running into the same problem than on "Hawkeye 2.2 GHz, qcaarmv8", we are measuring only up to 10mbits (not more) in upstream direction. A log file with the "broken" and the "corrected - but slow" test is attached

result_comparison.txt

updated test table (still other open issues);
tr471_udpst_measurements.pdf

Let me know if I can help with further tests or data

@lc9892
Copy link

lc9892 commented Feb 19, 2024

Thank you for the confirmation. We'll make sure that fix is part of the next point release.

As for a low upstream rate, the first thing is to confirm that you should disable jumbo frames via "-j" if they are not available. Also, if 1500-byte packets are supported end-to-end without fragmentation, you could use the "-T" option to slightly increase the max packet size from 1250 to 1500. This certainly helps...along with making sure the binary is compiled as 64-bit when the OS is 64-bit (particularly with ARM processors).

However, in your case the most significant thing to try for a higher upstream rate is to disable GSO (via "cmake -D HAVE_GSO=OFF .")...on the assumption that it may not be fully supported. The interesting thing is that you are seeing loss in the very first sub-interval (when traffic is the lowest), along with reordering. If this does not resolve your issue, could you provide debug output (via "-vD") on both the client and server. This way we can see the socket buffer levels and loss progression during the test.

Thanks

@twieskot
Copy link

Hi, i performed new test runs on the different platforms with the byte-order fix (see comment above) and with "cmake -D HAVE_GSO=OFF" on the client side (my embedded devices). All devices have a 32bit architecture.

Here is the updated test-matrix:
tr471_udpst_measurements.pdf

Its looking better, but there are still remaining issues:

"Intel Puma 7" -> Downstream/Upstream test is not working at all
Hawkeye 2.2 GHz (qcaarmv8) -> Downstream-Test with JumboFrames is too slow

Here are the log files (server/client side with options -v -D):
udpst_test_log.zip

And the wireshark capture files (too big as attachment):
https://download.kotti.me/udpst/puma7_downstream_test_capture.pcap
https://download.kotti.me/udpst/puma7_upstream_test_capture.pcap
https://download.kotti.me/udpst/qcaarmv8_downstream_slow.pcap

Note: The exact same device firmware with udpst 7.XXX is working on the same test setup / device (see matrix pdf).

Thanks for your support...

@lc9892
Copy link

lc9892 commented Feb 28, 2024

Thanks for the details - I'm still going through it. But if you have a chance, I think it would be worth trying to disable the remaining optimizations on the Intel Puma 7 devices:
cmake -D HAVE_RECVMMSG=OFF .
cmake -D HAVE_SENDMMSG=OFF .
Some of the fields in the test output, as well as the pcap, show some pretty crazy values (corruption of some kind). This would disable one of the biggest deltas from 7.5.1 to now.

And unless your network interface is configured with a 9000-byte MTU, it makes sense to stick with jumbo sizes disabled. We've seen a lot of issues on some devices with fragmentation (e.g., insufficient Fragment Reassembly Memory - see README).

Thanks

@twieskot
Copy link

twieskot commented Mar 5, 2024

Ok - unfortunately this does not seem to help, new logs with ( -D HAVE_RECVMMSG=OFF -D HAVE_SENDMMSG=OFF) :

new_testrun.zip

I will also try to debug on this issue in the near future..

The same setup with the old udpst version (just the udpst binary exchanged) is working.
The log files as well in the zip above.
In the old release we had udpst 7.4.0 on the device - maybe i should compare it with 7.5.1 as well.

@lc9892
Copy link

lc9892 commented Mar 7, 2024

Well, that rules out a number of things - so it was a good experiment. And your idea to try 7.5.1 is a good one. In general, both directions are showing very strange values or some type of corruption of data fields.

And for completeness, can you include the output from a "lscpu" command.

@twieskot
Copy link

lscpu is not available on the system, but here is the output of "cat /proc/cpuinfo"
cpuinfo.txt

I also tested with v7.5.1. now - downstream/upstream test is working fine (same results as for 7.4.0)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants