Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erreur entry not found in expire table + Unsupported tag #4303

Open
enzo-pauvy opened this issue Dec 12, 2024 · 8 comments
Open

Erreur entry not found in expire table + Unsupported tag #4303

enzo-pauvy opened this issue Dec 12, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@enzo-pauvy
Copy link

Hello there,

I'm in the process of deploying Dragonfly 1.25.4 as a replacement for Redis and I'm encountering a few errors, I was wondering if anyone could help or if this is expected.
All the errors occurs in the first indexing phase, but once the total items is stable (balance between expiring and new keys), the master has changed on its own and then it seems more stable.
I have 3 servers with 1 dfly and 1 sentinel service on each server.

  • OS: Ubuntu 24.04.1 LTS
  • Kernel: 6.8.0-45-generic
  • Containerized: Systemd
  • Dragonfly Version: 1.25.4

I can't figure out how to reproduce the error, but here's what I have in the dragonfly logs.

Error entry not found in expire table

  • Many errors in db_slice.cc
E20241204 13:46:07.394174 867126 db_slice.cc:1139] Internal error, entry ...:... not found in expire table, db_index: 0, expire table size: 11417, prime table size: 114690x59194f2bafa1  dfly::(anonymous namespace)::ScanCb()
  • Following by SIGSEGV signal in DbSlice::PostUpdate
*** SIGSEGV received at time=1733320149 on cpu 8 ***
PC: @     0x59194f5be1d7  (unknown)  dfly::DbSlice::PostUpdate()
  • Or
F20241205 05:18:09.021003 1063853 db_slice.cc:847] Check failed: db.expire.Insert(main_it->first.AsRef(), ExpirePeriod(delta)).second
	...
*** SIGABRT received at time=1733375889 on cpu 9 ***
PC: @     0x72b3a2e9eb1c  (unknown)  pthread_kill
  • Or
F20241205 02:49:47.255090 1054284 db_slice.cc:1239] Check failed: !prime_it.is_done()
	...
*** SIGABRT received at time=1733366987 on cpu 5 ***
PC: @     0x713361a9eb1c  (unknown)  pthread_kill

Error Unsupported tag

  • Unsupported tag xxx in compact_object.cc (68, 223, 254, ...)
F20241204 15:06:41.331504 889887 compact_object.cc:1129] Unsupported tag 254
	...
*** SIGABRT received at time=1733324801 on cpu 11 ***
PC: @     0x76b222c9eb1c  (unknown)  pthread_kill

Storage
Here, I use S3 storage every 5 minutes to compensate for the crash. With cron, because the backup is not upload after a crash. The errors are the same without the backup.
Screenshot from 2024-12-11 10-03-40

@enzo-pauvy enzo-pauvy added the bug Something isn't working label Dec 12, 2024
@romange
Copy link
Collaborator

romange commented Dec 12, 2024

@adiholden looks like another preemption bug.

@romange
Copy link
Collaborator

romange commented Dec 12, 2024

@enzo-pauvy what's an indexing phase?

@enzo-pauvy
Copy link
Author

I call “indexing phase” the period (here from 15:00 to ~06:00) during which keys are added to an empty dragonfly database. After 06:00, the number of elements doesn't change much and there are no more crashes.
I haven't tested much more after that yet.

@romange
Copy link
Collaborator

romange commented Dec 12, 2024

Is there a way for us to host Dragonfly server so we could debug this? Are you on the public cloud? Could you DM me on the discord please?

@romange
Copy link
Collaborator

romange commented Dec 13, 2024

@enzo-pauvy how do you run dragonfly? Can you please provide all its runtime flags?

@enzo-pauvy
Copy link
Author

enzo-pauvy commented Dec 13, 2024

Here is my flags.txt file, I am running dragonfly locally with systemd

# https://www.dragonflydb.io/docs/managing-dragonfly/flags
--bind=0.0.0.0
--port=<dfly_port>
--log_dir=/var/log/dragonfly
--aclfile=/var/lib/dragonfly/preprod.acl
--requirepass=<pass>
--masteruser=<user>
--masterauth=<pass>
--maxmemory=4gb
--cache_mode=True
--dir=s3://<bucket>/...
--dbfilename=dfly_...
--snapshot_cron=*/5 * * * *

# https://github.com/dragonflydb/dragonfly/pull/3615
--replica_announce_ip=<dns>

# https://www.dragonflydb.io/docs/managing-dragonfly/using-tls
--tls
--tls_replication
--tls_key_file=<key>
--tls_cert_file=<cert>

And the sentinel.conf file

port <sentinel_port>

requirepass "<pass>"

sentinel deny-scripts-reconfig yes

sentinel monitor default <dns> <dfly_port> 2

sentinel down-after-milliseconds default 5000
sentinel parallel-syncs default 2
sentinel auth-user default <user>
sentinel auth-pass default <pass>
sentinel failover-timeout default 6000

sentinel resolve-hostnames yes
sentinel announce-hostnames yes

tls-replication yes
tls-auth-clients no
tls-port <sentinel_tls_port>
tls-cert-file "<cert>"
tls-key-file "<key>"
tls-ca-cert-file "<cert>"

# Generated by CONFIG REWRITE
...

@romange
Copy link
Collaborator

romange commented Dec 13, 2024

Is it possible for you to test with cache_mode=False and tell us if it still crashes?

@enzo-pauvy
Copy link
Author

It looks like I still have the same errors with --cache_mode=False.
Several Internal error followed by a SIGSEGV signal.

dragonfly_v1.25.4[3737700]: E20241213 09:51:46.731494 3737706 db_slice.cc:1139] Internal error, entry ...:... not found in expire table, db_index: 0, expire table size: 5646, prime table size: 57510x5fcf9282cfa1  dfly::(anonymous namespace)::ScanCb()
dragonfly_v1.25.4[3737700]: 0x5fcf9282d95a  dfly::(anonymous namespace)::OpScan()
dragonfly_v1.25.4[3737700]: 0x5fcf9282f52c  std::_Function_handler<>::_M_invoke()
dragonfly_v1.25.4[3737700]: 0x5fcf930c1d65  util::fb2::FiberQueue::Run()
dragonfly_v1.25.4[3737700]: *** SIGSEGV received at time=1734084495 on cpu 3 ***
dragonfly_v1.25.4[3737700]: PC: @     0x5fcf92b301d7  (unknown)  dfly::DbSlice::PostUpdate()
systemd[1]: dragonfly_preprod.service: Main process exited, code=dumped, status=11/SEGV
systemd[1]: dragonfly_preprod.service: Failed with result 'core-dump'.
systemd[1]: dragonfly_preprod.service: Consumed 21min 20.299s CPU time, 275.1M memory peak, 0B memory swap peak.
systemd[1]: dragonfly_preprod.service: Scheduled restart job, restart counter is at 1.
systemd[1]: Started dragonfly_preprod.service - Dragonfly Service.

@romange romange assigned romange and unassigned chakaz Dec 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants