Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DragonFly snapshots stopped for a week, and we lost data #4257

Open
wernermorgenstern opened this issue Dec 4, 2024 · 3 comments
Open

DragonFly snapshots stopped for a week, and we lost data #4257

wernermorgenstern opened this issue Dec 4, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@wernermorgenstern
Copy link

wernermorgenstern commented Dec 4, 2024

Describe the bug
We have Snapshots enabled every 5 minutes, to S3.
Here is our configuration for the Snapshotting:

       --alsologtostderr 
       --cluster_mode=emulated
       --maxclients=1000000
       --maxmemory=0
       --dbnum=1
       --cache_mode
       --conn_use_incoming_cpu
       --slowlog_max_len=1024
       --slowlog_log_slower_than=10000
       --snapshot_cron=*/5 * * * *
       --s3_ec2_metadata 
       --dir s3://luna-dragonfly/locate/dragonfly-agent-device-0-primary/                                                                                                                                             
       --dbfilename=dragonfly-agent-device-0-{timestamp} 

Our snapshots to S3 stopped from 11/22 till the database crashed on 11/29, and then it restored the data from 11/22

To Reproduce
Not sure how to make it reproducible.

Expected behavior
Snapshots should be taken every 5 minutes, and added to the S3 Bucket.

Environment (please complete the following information):

  • Kernel: # Command: Linux dragonfly-agent-device-0-primary-0 5.10.228-219.884.amzn2.aarch64 #1 SMP Wed Oct 23 17:17:31 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
  • Containerized?: EKS
  • Dragonfly Version: 1.24

Additional context
In Grafana Logs, I see these lines, where it should have logged the snapshot.

I20241122 11:25:02.677474    12 rdb_save.cc:1270] Channel write took 1112 ms while writing 35393/35393
I20241122 11:25:04.081243    13 rdb_save.cc:1270] Channel write took 1623 ms while writing 34972/34972
I20241122 11:25:05.775197    11 rdb_save.cc:1270] Channel write took 2414 ms while writing 32815/32815

Also, we have a Primary and a Replica. Is it needed to do snapshots for the Primary and the replica? Or just the primary?

Regarding the crash, I don't see anything in the logs really

Our EKS Resources are:

     Limits:                                                                                                                                                                                                    
       cpu:     500m                                                                                                                                                                                            
       memory:  1500Mi                                                                                                                                                                                          
     Requests:                                                                                                                                                                                                  
       cpu:      500m                                                                                                                                                                                                  
       memory:   1500Mi     
@wernermorgenstern wernermorgenstern added the bug Something isn't working label Dec 4, 2024
@romange
Copy link
Collaborator

romange commented Dec 4, 2024

@wernermorgenstern I appreciate filing the issue but unfortunately it does not have enough information to identify the root cause of these problem. The persistence section of the "info" response has information on how recent the last save was. you may want to monitor this.

@wernermorgenstern
Copy link
Author

@romange , thank you.
I saw this issue:
#4244

I saw this in the chain:

In addition (and I do not know if it's related) there is a problem in dns resolve code and that's why your periodic snapshotting stops - it's just being stuck there. For me to understand what happens, can you please run dragonfly process with: --vmodue=dns_resolve=1 ? it will print bunch of logs that may help identifying the issue.

Could it be a DNS issue?

@romange
Copy link
Collaborator

romange commented Dec 4, 2024

It could be an issue with our code that handles the DNS resolution but I can not say for certain. In case you see again the snapshots stopped working again, I could instruct on how to provide more info. In any case, this comment #4244 (comment) provides an advice on how to increase the verbosity around the dns resolution code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants