RocksDB causes Bookkeeper to crash #4558

lhotari · 2025-03-04T14:13:48Z

BUG REPORT

Describe the bug

I noticed with testing Pulsar 4.0.3 / BookKeeper 4.17.1, that there are BookKeeper crashes that happen.

#                                                                                                                                                                                                                               │
│ # A fatal error has been detected by the Java Runtime Environment:                                                                                                                                                              │
│ #                                                                                                                                                                                                                               │
│ #  SIGSEGV (0xb) at pc=0x0000ffff7f2d5f48, pid=1, tid=237                                                                                                                                                                       │
│ #                                                                                                                                                                                                                               │
│ # JRE version: OpenJDK Runtime Environment Corretto-21.0.6.7.1 (21.0.6+7) (build 21.0.6+7-LTS)                                                                                                                                  │
│ # Java VM: OpenJDK 64-Bit Server VM Corretto-21.0.6.7.1 (21.0.6+7-LTS, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)                                                                        │
│ # Problematic frame:                                                                                                                                                                                                            │
│ # C  [librocksdbjni14395278800560636484.so+0x2a9f48]  Java_org_rocksdb_RocksDB_getLongProperty+0x150                                                                                                                            │
│ #                                                                                                                                                                                                                               │
│ # No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again                                                                                     │
│ #                                                                                                                                                                                                                               │
│ # An error report file with more information is saved as:                                                                                                                                                                       │
│ # /tmp/hs_err_pid1.log                                                                                                                                                                                                          │
│ #                                                                                                                                                                                                                               │
│ # If you would like to submit a bug report, please visit:                                                                                                                                                                       │
│ #   https://github.com/corretto/corretto-21/issues/                                                                                                                                                                             │
│ # The crash happened outside the Java Virtual Machine in native code.                                                                                                                                                           │
│ # See problematic frame for where to report the bug.                                                                                                                                                                            │

This occurs at

bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/KeyValueStorageRocksDB.java

Lines 511 to 518 in 26da346

    
           @Override 
        
           public long count() throws IOException { 
        
               try { 
        
                   return db.getLongProperty("rocksdb.estimate-num-keys"); 
        
               } catch (RocksDBException e) { 
        
                   throw new IOException("Error in getting records count", e); 
        
               } 
        
           }

since that's the only call to getLongProperty.

This gets called from stats:

bookkeeper/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/EntryLocationIndex.java

Lines 57 to 65 in 26da346

    
           this.stats = new EntryLocationIndexStats( 
        
               stats, 
        
               () -> { 
        
                   try { 
        
                       return locationsDb.count(); 
        
                   } catch (IOException e) { 
        
                       return -1L; 
        
                   } 
        
               });

To Reproduce

Steps are unclear. I upgraded a local Kubernetes cluster running with Apache Pulsar Helm chart version 3.9.0 / Pulsar 4.0.2 to current master branch version of the Helm chart / Pulsar 4.0.3

Expected behavior

Crashes shouldn't happen.

Additional context

I checked the code in KeyValueStorageRocksDB and there doesn't seem to be a solution to prevent calling count() after the storage is closed.
When looking at the close implementation, I noticed that before closing, the RocksDB WAL isn't flushed with fsync. There seems to be another issue where a graceful shutdown isn't performed for RocksDb when running with BookKeeper.

The text was updated successfully, but these errors were encountered:

lhotari · 2025-03-05T14:43:55Z

This has previously been reported as #4238, fix PR is #4243

merlimat · 2025-03-05T16:35:54Z

I checked the code in KeyValueStorageRocksDB and there doesn't seem to be a solution to prevent calling count() after the storage is closed.

Yes, I think we need to stop the gauge when the object is closed, otherwise it will continue

When looking at the close implementation, I noticed that before closing, the RocksDB WAL isn't flushed with fsync. There seems to be another issue where a graceful shutdown isn't performed for RocksDb when running with BookKeeper.

We don't rely on flushing the wal on close. We don't care about wal, because we already rely on the BK journal.

lhotari added the type/bug label Mar 4, 2025

lhotari closed this as completed Mar 5, 2025

lhotari mentioned this issue Mar 5, 2025

Check rocksdb closed before operating #4243

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RocksDB causes Bookkeeper to crash #4558

RocksDB causes Bookkeeper to crash #4558

lhotari commented Mar 4, 2025 •

edited

Loading

lhotari commented Mar 5, 2025

merlimat commented Mar 5, 2025

RocksDB causes Bookkeeper to crash #4558

RocksDB causes Bookkeeper to crash #4558

Comments

lhotari commented Mar 4, 2025 • edited Loading

lhotari commented Mar 5, 2025

merlimat commented Mar 5, 2025

lhotari commented Mar 4, 2025 •

edited

Loading