DragonFly snapshots stopped for a week, and we lost data #4257

wernermorgenstern · 2024-12-04T21:42:29Z

Describe the bug
We have Snapshots enabled every 5 minutes, to S3.
Here is our configuration for the Snapshotting:

       --alsologtostderr 
       --cluster_mode=emulated
       --maxclients=1000000
       --maxmemory=0
       --dbnum=1
       --cache_mode
       --conn_use_incoming_cpu
       --slowlog_max_len=1024
       --slowlog_log_slower_than=10000
       --snapshot_cron=*/5 * * * *
       --s3_ec2_metadata 
       --dir s3://luna-dragonfly/locate/dragonfly-agent-device-0-primary/                                                                                                                                             
       --dbfilename=dragonfly-agent-device-0-{timestamp}

Our snapshots to S3 stopped from 11/22 till the database crashed on 11/29, and then it restored the data from 11/22

To Reproduce
Not sure how to make it reproducible.

Expected behavior
Snapshots should be taken every 5 minutes, and added to the S3 Bucket.

Environment (please complete the following information):

Kernel: # Command: Linux dragonfly-agent-device-0-primary-0 5.10.228-219.884.amzn2.aarch64 #1 SMP Wed Oct 23 17:17:31 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
Containerized?: EKS
Dragonfly Version: 1.24

Additional context
In Grafana Logs, I see these lines, where it should have logged the snapshot.

I20241122 11:25:02.677474    12 rdb_save.cc:1270] Channel write took 1112 ms while writing 35393/35393
I20241122 11:25:04.081243    13 rdb_save.cc:1270] Channel write took 1623 ms while writing 34972/34972
I20241122 11:25:05.775197    11 rdb_save.cc:1270] Channel write took 2414 ms while writing 32815/32815

Also, we have a Primary and a Replica. Is it needed to do snapshots for the Primary and the replica? Or just the primary?

Regarding the crash, I don't see anything in the logs really

Our EKS Resources are:

     Limits:                                                                                                                                                                                                    
       cpu:     500m                                                                                                                                                                                            
       memory:  1500Mi                                                                                                                                                                                          
     Requests:                                                                                                                                                                                                  
       cpu:      500m                                                                                                                                                                                                  
       memory:   1500Mi

The text was updated successfully, but these errors were encountered:

romange · 2024-12-04T22:23:22Z

@wernermorgenstern I appreciate filing the issue but unfortunately it does not have enough information to identify the root cause of these problem. The persistence section of the "info" response has information on how recent the last save was. you may want to monitor this.

wernermorgenstern · 2024-12-04T22:38:21Z

@romange , thank you.
I saw this issue:
#4244

I saw this in the chain:

In addition (and I do not know if it's related) there is a problem in dns resolve code and that's why your periodic snapshotting stops - it's just being stuck there. For me to understand what happens, can you please run dragonfly process with: --vmodue=dns_resolve=1 ? it will print bunch of logs that may help identifying the issue.

Could it be a DNS issue?

romange · 2024-12-04T22:44:29Z

It could be an issue with our code that handles the DNS resolution but I can not say for certain. In case you see again the snapshots stopped working again, I could instruct on how to provide more info. In any case, this comment #4244 (comment) provides an advice on how to increase the verbosity around the dns resolution code.

wernermorgenstern added the bug Something isn't working label Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DragonFly snapshots stopped for a week, and we lost data #4257

DragonFly snapshots stopped for a week, and we lost data #4257

wernermorgenstern commented Dec 4, 2024 •

edited

Loading

romange commented Dec 4, 2024

wernermorgenstern commented Dec 4, 2024

romange commented Dec 4, 2024 •

edited

Loading

DragonFly snapshots stopped for a week, and we lost data #4257

DragonFly snapshots stopped for a week, and we lost data #4257

Comments

wernermorgenstern commented Dec 4, 2024 • edited Loading

romange commented Dec 4, 2024

wernermorgenstern commented Dec 4, 2024

romange commented Dec 4, 2024 • edited Loading

wernermorgenstern commented Dec 4, 2024 •

edited

Loading

romange commented Dec 4, 2024 •

edited

Loading