-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to solve Connection reset by remote peer? #985
Comments
Hi @sunkenQ , apologies for the late reply but I've been out of the office for the past couple of weeks. Could you elaborate what you mean by "stress test"? Is it that you run the same code many times in sequence, many times but all in parallel, or some other situation? From the looks of this error it seems that you're hitting race conditions during the phase where connections are established. In that situation we require the use of the UCX stream API which can cause such collisions and, unfortunately, there's no good solution for that currently. Exchanging peer information is required so that endpoints know which tag to use to communicate both ways when using the tag API ( |
@pentschev Thanks for your reply! In addition, I also tested single client to send a request to service A and B every two seconds in the way of a scheduled task. The message
The log shows the first memory increase at line 167: Line # Mem usage Increment Occurences Line Contents
============================================================
154 303.3359 MiB 303.3359 MiB 1 @profile(precision=4,stream=open('send_data.log','w+'))
155 async def send_data(self, keepalive_request: bool, request_type: str = "update"):
156 """
157 Sends the data to the server.
158
159 Args:
160 keepalive_request: Indicates if it's a keepalive request.
161 request_type: The type of inference ("update" or "infer").
162 """
163 303.3359 MiB 0.0000 MiB 1 if request_type == "update":
164 ep = await ucp.create_endpoint(self.server_host, self.update_port)
165 await ep.send(self.update_buffer)
166 else:
167 313.1328 MiB 9.7969 MiB 2 ep = await ucp.create_endpoint(self.server_host, self.infer_port)
168 313.1328 MiB 0.0000 MiB 1 await ep.send(self.infer_buffer)
169
170 313.1328 MiB 0.0000 MiB 1 if keepalive_request is True:
171 ok = await ep.recv_obj()
172 ok = ok.decode("utf-8")
173 313.3555 MiB 0.2227 MiB 2 await ep.close() Below is the memory trend graph: If you need any other information, please let me know at any time, thank you! |
This sounds a lot like the problem is related to the transport that gets used to transfer. Depending on your setup it's likely that
The
From what I've heard from UCX developers, we should be ok with up to some 16K endpoints, but I'm not sure now if you're supposed to have at most only Would you be able to provide a minimal reproducer and/or logs for this? If you can't provide a minimal reproducer, at least a log of the processes for fewer iterations (for now let's say Based on your description of your use case, I believe it may be somewhat similar to this test, therefore I would suggest having a look to see if they are similar, and if not, we would at least like to see what's different between the two cases. |
I constructed a simple server using UCX-py to continuously accept data. During the stress test, I found that some requests would be lost due to network fluctuations. I guess UCX-py /ucx has a certain setting that limits the time of trying to connect.
And I don't want to server side to give the client a return value to determine whether the connection is successful, because I found in the experiment, if you want to return value will lead to take to increase very much, if there are other ways can let me avoid request will not be lost because the network fluctuation?
Simple Server code:
ERROR Message:
The text was updated successfully, but these errors were encountered: