Erreur entry not found in expire table + Unsupported tag #4303

enzo-pauvy · 2024-12-12T14:22:52Z

Hello there,

I'm in the process of deploying Dragonfly 1.25.4 as a replacement for Redis and I'm encountering a few errors, I was wondering if anyone could help or if this is expected.
All the errors occurs in the first indexing phase, but once the total items is stable (balance between expiring and new keys), the master has changed on its own and then it seems more stable.
I have 3 servers with 1 dfly and 1 sentinel service on each server.

OS: Ubuntu 24.04.1 LTS
Kernel: 6.8.0-45-generic
Containerized: Systemd
Dragonfly Version: 1.25.4

I can't figure out how to reproduce the error, but here's what I have in the dragonfly logs.

Error entry not found in expire table

Many errors in db_slice.cc

E20241204 13:46:07.394174 867126 db_slice.cc:1139] Internal error, entry ...:... not found in expire table, db_index: 0, expire table size: 11417, prime table size: 114690x59194f2bafa1  dfly::(anonymous namespace)::ScanCb()

Following by SIGSEGV signal in DbSlice::PostUpdate

*** SIGSEGV received at time=1733320149 on cpu 8 ***
PC: @     0x59194f5be1d7  (unknown)  dfly::DbSlice::PostUpdate()

Or

F20241205 05:18:09.021003 1063853 db_slice.cc:847] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second
	...
*** SIGABRT received at time=1733375889 on cpu 9 ***
PC: @     0x72b3a2e9eb1c  (unknown)  pthread_kill

Or

F20241205 02:49:47.255090 1054284 db_slice.cc:1239] Check failed: !prime_it.is_done()
	...
*** SIGABRT received at time=1733366987 on cpu 5 ***
PC: @     0x713361a9eb1c  (unknown)  pthread_kill

Error Unsupported tag

Unsupported tag xxx in compact_object.cc (68, 223, 254, ...)

F20241204 15:06:41.331504 889887 compact_object.cc:1129] Unsupported tag 254
	...
*** SIGABRT received at time=1733324801 on cpu 11 ***
PC: @     0x76b222c9eb1c  (unknown)  pthread_kill

Storage
Here, I use S3 storage every 5 minutes to compensate for the crash. With cron, because the backup is not upload after a crash. The errors are the same without the backup.

The text was updated successfully, but these errors were encountered:

romange · 2024-12-12T15:50:10Z

@adiholden looks like another preemption bug.

romange · 2024-12-12T15:50:43Z

@enzo-pauvy what's an indexing phase?

enzo-pauvy · 2024-12-12T16:22:59Z

I call “indexing phase” the period (here from 15:00 to ~06:00) during which keys are added to an empty dragonfly database. After 06:00, the number of elements doesn't change much and there are no more crashes.
I haven't tested much more after that yet.

romange · 2024-12-12T17:08:42Z

Is there a way for us to host Dragonfly server so we could debug this? Are you on the public cloud? Could you DM me on the discord please?

romange · 2024-12-13T06:30:16Z

@enzo-pauvy how do you run dragonfly? Can you please provide all its runtime flags?

enzo-pauvy · 2024-12-13T07:28:09Z

Here is my flags.txt file, I am running dragonfly locally with systemd

# https://www.dragonflydb.io/docs/managing-dragonfly/flags
--bind=0.0.0.0
--port=<dfly_port>
--log_dir=/var/log/dragonfly
--aclfile=/var/lib/dragonfly/preprod.acl
--requirepass=<pass>
--masteruser=<user>
--masterauth=<pass>
--maxmemory=4gb
--cache_mode=True
--dir=s3://<bucket>/...
--dbfilename=dfly_...
--snapshot_cron=*/5 * * * *

# https://github.com/dragonflydb/dragonfly/pull/3615
--replica_announce_ip=<dns>

# https://www.dragonflydb.io/docs/managing-dragonfly/using-tls
--tls
--tls_replication
--tls_key_file=<key>
--tls_cert_file=<cert>

And the sentinel.conf file

port <sentinel_port>

requirepass "<pass>"

sentinel deny-scripts-reconfig yes

sentinel monitor default <dns> <dfly_port> 2

sentinel down-after-milliseconds default 5000
sentinel parallel-syncs default 2
sentinel auth-user default <user>
sentinel auth-pass default <pass>
sentinel failover-timeout default 6000

sentinel resolve-hostnames yes
sentinel announce-hostnames yes

tls-replication yes
tls-auth-clients no
tls-port <sentinel_tls_port>
tls-cert-file "<cert>"
tls-key-file "<key>"
tls-ca-cert-file "<cert>"

# Generated by CONFIG REWRITE
...

romange · 2024-12-13T08:07:07Z

Is it possible for you to test with cache_mode=False and tell us if it still crashes?

enzo-pauvy · 2024-12-13T11:53:20Z

It looks like I still have the same errors with --cache_mode=False.
Several Internal error followed by a SIGSEGV signal.

dragonfly_v1.25.4[3737700]: E20241213 09:51:46.731494 3737706 db_slice.cc:1139] Internal error, entry ...:... not found in expire table, db_index: 0, expire table size: 5646, prime table size: 57510x5fcf9282cfa1  dfly::(anonymous namespace)::ScanCb()
dragonfly_v1.25.4[3737700]: 0x5fcf9282d95a  dfly::(anonymous namespace)::OpScan()
dragonfly_v1.25.4[3737700]: 0x5fcf9282f52c  std::_Function_handler<>::_M_invoke()
dragonfly_v1.25.4[3737700]: 0x5fcf930c1d65  util::fb2::FiberQueue::Run()
dragonfly_v1.25.4[3737700]: *** SIGSEGV received at time=1734084495 on cpu 3 ***
dragonfly_v1.25.4[3737700]: PC: @     0x5fcf92b301d7  (unknown)  dfly::DbSlice::PostUpdate()
systemd[1]: dragonfly_preprod.service: Main process exited, code=dumped, status=11/SEGV
systemd[1]: dragonfly_preprod.service: Failed with result 'core-dump'.
systemd[1]: dragonfly_preprod.service: Consumed 21min 20.299s CPU time, 275.1M memory peak, 0B memory swap peak.
systemd[1]: dragonfly_preprod.service: Scheduled restart job, restart counter is at 1.
systemd[1]: Started dragonfly_preprod.service - Dragonfly Service.

enzo-pauvy added the bug Something isn't working label Dec 12, 2024

romange mentioned this issue Dec 13, 2024

Missed replication items in cache_mode full sync #4306

Closed

adiholden assigned chakaz Dec 19, 2024

romange assigned romange and unassigned chakaz Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erreur entry not found in expire table + Unsupported tag #4303

Erreur entry not found in expire table + Unsupported tag #4303

enzo-pauvy commented Dec 12, 2024

romange commented Dec 12, 2024

romange commented Dec 12, 2024

enzo-pauvy commented Dec 12, 2024

romange commented Dec 12, 2024

romange commented Dec 13, 2024

enzo-pauvy commented Dec 13, 2024 •

edited

Loading

romange commented Dec 13, 2024

enzo-pauvy commented Dec 13, 2024

Erreur entry not found in expire table + Unsupported tag #4303

Erreur entry not found in expire table + Unsupported tag #4303

Comments

enzo-pauvy commented Dec 12, 2024

romange commented Dec 12, 2024

romange commented Dec 12, 2024

enzo-pauvy commented Dec 12, 2024

romange commented Dec 12, 2024

romange commented Dec 13, 2024

enzo-pauvy commented Dec 13, 2024 • edited Loading

romange commented Dec 13, 2024

enzo-pauvy commented Dec 13, 2024

enzo-pauvy commented Dec 13, 2024 •

edited

Loading