-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug(tcmalloc):ALL nodes coredump when doing Bulkload #2006
Comments
The aforementioned phenomenon is one of issues triggered by bulkload download. Phenomenon
ReasonAfter execute the Case 1 With Phenomenon 1operation:we restart one node. ballot increase,function
**But at 15:17:56.873, the 88.5 replicais still downloading sst file, cause core. ** D2024-05-20 15:17:56.873 (1716189476873362400 146626) replica.default7.04010007000000ca: block_service_manager.cpp:181:download_file(): download file(/home/work/ssd2/pegasus/c3tst-performance2/replica/reps/88.5.pegasus/bulk_load/33.sst) succeed, file_size = 65930882, md5 = 7a4d3da9250f52b4e31095c1d7042c2f D2024-05-20 15:17:58.348 (1716189478348326864 146626) replica.default7.04010007000000ca: replica_bulk_loader.cpp:479:download_sst_file(): [[email protected]:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p32_10G/5 ,local_dir /home/work/ssd2/pegasus/c3tst-performance2/replica/reps/88.5.pegasus/bulk_load,f_meta.name 33.sst Case 2 With Phenomenon 2operation:app ingest_p4_10G partition 1,bulkload file missing 88.sst,89.sst,90.sst,93.sst
primary replicaprimary replica failed to download file(88.sst) ,and stop downloading all sst file.
But meta says continue downloading.
primary replica reports download progress to meta.
meta says stop downloading,and clear _metadata.files. However, all download tasks were not terminated successfully.
At 14:28:29, download_sst_file task still exists, access _metadata.files, causing core.
secondary replicaSecondary receives primary replica message to cancel the bulkload task and clear _metadata.files, but does not terminate all download tasks, cause core dump. Other replicas generate core dumps due to this reason, cause many replica server core dump.
tcmalloc report large alloc_metadata.files is cleared, cause f_meta.name length in download_sst_file function is very long.
|
download_sst_file (apache#2006) replica_bulk_loader::clear_bulk_load_states function cannot cancel already downloading sst task, which access `_metadata.files` references. But clear_bulk_load_states function will clear `_metadata.files`. It's cause core dump. I use a copy of `_metadata.files` to solve this problem.
…sst_file (apache#2006) replica_bulk_loader::clear_bulk_load_states function cannot cancel already downloading sst task, which access `_metadata.files` references. But clear_bulk_load_states function will clear `_metadata.files`. It's cause core dump. I use a copy of `_metadata.files` to solve this problem.
Bug Report
What did you do?
Doing bulkload (download sst file stage) with any action which need to restart ONE node,may cause ALL nodes coredump.
What did you see ?
There are three kind of coredump in different nodes
Type one:
Type two:
Type three:
v2.4
The text was updated successfully, but these errors were encountered: