Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RocksDB causes Bookkeeper to crash #4558

Closed
lhotari opened this issue Mar 4, 2025 · 2 comments
Closed

RocksDB causes Bookkeeper to crash #4558

lhotari opened this issue Mar 4, 2025 · 2 comments
Labels

Comments

@lhotari
Copy link
Member

lhotari commented Mar 4, 2025

BUG REPORT

Describe the bug

I noticed with testing Pulsar 4.0.3 / BookKeeper 4.17.1, that there are BookKeeper crashes that happen.

#                                                                                                                                                                                                                               │
│ # A fatal error has been detected by the Java Runtime Environment:                                                                                                                                                              │
│ #                                                                                                                                                                                                                               │
│ #  SIGSEGV (0xb) at pc=0x0000ffff7f2d5f48, pid=1, tid=237                                                                                                                                                                       │
│ #                                                                                                                                                                                                                               │
│ # JRE version: OpenJDK Runtime Environment Corretto-21.0.6.7.1 (21.0.6+7) (build 21.0.6+7-LTS)                                                                                                                                  │
│ # Java VM: OpenJDK 64-Bit Server VM Corretto-21.0.6.7.1 (21.0.6+7-LTS, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)                                                                        │
│ # Problematic frame:                                                                                                                                                                                                            │
│ # C  [librocksdbjni14395278800560636484.so+0x2a9f48]  Java_org_rocksdb_RocksDB_getLongProperty+0x150                                                                                                                            │
│ #                                                                                                                                                                                                                               │
│ # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again                                                                                     │
│ #                                                                                                                                                                                                                               │
│ # An error report file with more information is saved as:                                                                                                                                                                       │
│ # /tmp/hs_err_pid1.log                                                                                                                                                                                                          │
│ #                                                                                                                                                                                                                               │
│ # If you would like to submit a bug report, please visit:                                                                                                                                                                       │
│ #   https://github.com/corretto/corretto-21/issues/                                                                                                                                                                             │
│ # The crash happened outside the Java Virtual Machine in native code.                                                                                                                                                           │
│ # See problematic frame for where to report the bug.                                                                                                                                                                            │

This occurs at

@Override
public long count() throws IOException {
try {
return db.getLongProperty("rocksdb.estimate-num-keys");
} catch (RocksDBException e) {
throw new IOException("Error in getting records count", e);
}
}
since that's the only call to getLongProperty.

This gets called from stats:

this.stats = new EntryLocationIndexStats(
stats,
() -> {
try {
return locationsDb.count();
} catch (IOException e) {
return -1L;
}
});

To Reproduce

Steps are unclear. I upgraded a local Kubernetes cluster running with Apache Pulsar Helm chart version 3.9.0 / Pulsar 4.0.2 to current master branch version of the Helm chart / Pulsar 4.0.3

Expected behavior

Crashes shouldn't happen.

Additional context

I checked the code in KeyValueStorageRocksDB and there doesn't seem to be a solution to prevent calling count() after the storage is closed.
When looking at the close implementation, I noticed that before closing, the RocksDB WAL isn't flushed with fsync. There seems to be another issue where a graceful shutdown isn't performed for RocksDb when running with BookKeeper.

@lhotari
Copy link
Member Author

lhotari commented Mar 5, 2025

This has previously been reported as #4238, fix PR is #4243

@merlimat
Copy link
Contributor

merlimat commented Mar 5, 2025

I checked the code in KeyValueStorageRocksDB and there doesn't seem to be a solution to prevent calling count() after the storage is closed.

Yes, I think we need to stop the gauge when the object is closed, otherwise it will continue

When looking at the close implementation, I noticed that before closing, the RocksDB WAL isn't flushed with fsync. There seems to be another issue where a graceful shutdown isn't performed for RocksDb when running with BookKeeper.

We don't rely on flushing the wal on close. We don't care about wal, because we already rely on the BK journal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants