-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage node overloads and returns 504, Very urgent !!! #1175
Comments
@NewDund The error enotconn we can see in crash.log will happen in case the connection is NOT established (for TCP connection) so I guess some network trouble happened around that time and some connections were forcefully closed by network switches placed between LeoGateway and LeoStorage. So I'd recommend you contact your infra team (OR if your run LeoFS on some public cloud then contact the provider) to confirm if something went wrong around that time.
Just in case, please give us the error log files on LeoStorage if exists. |
And also any system metrics related to network would be helpful if your system gather those information regularly. |
@mocchira
|
@mocchira |
@NewDund I can see many error lines including the word sending_data_to_remote. So let me confirm that under a normal circumstance, does this error line appear with such a high frequency rate? Since sending_data_to_remote means that "LeoStorage is trying to do MDCR with remote clusters so now it's impossible to take another replication tasks, try again after a certain period of time", MDCR seems to become bottleneck and that might cause LeoStorage to become high load situation. One possible reason why MDCR becomes bottleneck is your bandwidth setting 150Mbit so please check the bandwidth actually used at the time the incident happened. if the network bandwidth is occupied then you may have to consider setting more larger number like 200 - 300Mbit to the bandwidth setting. |
@mocchira |
@mocchira The main point I want to say is that local clusters and remote clusters should theoretically be independent of each other, and nothing happens to one cluster should affect the other. That is to say, even if the sudden failure of communication between two clusters leads to the failure of replicating objects or restoring clusters, it should not affect the normal access requests of another cluster. I think this should be the focus of optimization. NEED FIX ~ |
Got it then probably the cause would be network related things.
Right. The current retry mechanism we are using at any asynchronous tasks including MDCR isn't optimized sufficiently (There are rooms for improvements) so we are going to improve it in the future release. |
Well, anyway. It is hoped that the related issues will be seen in the description of future version updates. |
Yes we will. At first, I will file the issue related MDCR to deal with the situation in which the network bandwidth between both clusters is unstable. |
@NewDund also I will share the recommended configuration which might contribute to make this kind of incident less likely to happen compared to the default settings. Please wait for a while. |
@NewDund The setting mdc_replication.size_of_stacked_objs in leo_storage.conf explained on https://leo-project.net/leofs/docs/admin/settings/leo_storage/ would be to control how much amount of data is transferred to the remote cluster at once. The default is around 32MB. that said, this number probably would be too large for your environment (capped 150Mbit) so I'd recommend you lower this setting such as 16MB (16777216). This might prevent the same kind of incident from happening again. |
Okay, thank you, but we're going to stop MDCR first, and then use self-written programs to manually synchronize the two clusters. Then if the official has a more perfect solution, we will test it again. |
OK. little bit sad for us but your service stability should be the matter prior to anything else.
Thanks! I'm sure we will be able to improve the MDCR stablity thanks to your feedback so once we are confident to meet your requirements, we'll get back to you. |
@mocchira Let's take a look at my log. The following is the result returned by my leofs mq-stats:
And these are the error logs of my network management node:
Finally, the error log of my storage node:
|
@mocchira |
@mocchira |
@mocchira One of my PUT requests, haproxy returned me 200, but my leo_gateway did not receive this PUT request. Is this possible? If possible, how can I judge whether PUT is successful? |
@NewDund Sorry for the long delay. I can spare time to look into this tonight. |
@mocchira
My cluster has been functioning normally, but today there is a weird situation.
Between 17:58-18:18, my storage node was loaded very heavily, and my request returned a lot of 504. When I ran'leofs-adm mq-stats storage-node', I found'leo_sync_obj_with_dc_queue'became very high, but I didn't do anything today. I don't know why?
Here are my crash. log and error logs.
crash.log
error
If you need any information, please contact me.
Because it is the production environment, so I hope to reply as soon as possible, thank you.
The text was updated successfully, but these errors were encountered: