Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storages: New serialization/deserialization for DataTypeString #9608

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

JinheLin
Copy link
Contributor

@JinheLin JinheLin commented Nov 13, 2024

What problem does this PR solve?

Issue Number: close #9673

Problem Summary:

  • Currently, we use size-prefix format for string serialization and deserialization. This can be very slow for short string, e.g. less than 32 bytes.

What is changed and how it works?

  • This PR add a new storage format for string, which seperating sizes and chars into different streams, so that we can deserialize string objects in batch.

  • The new format is not enabled in this PR.

    • We can change the value of DefaultSerdesFormat to enable it.
  • For backward compatibility, this PR add a new name for new format, the old name is keepped for old format. Check DataTypeString::getName() for details.

    • In storage, both ColumnFileTiny and DMFile will store the name of data type in metadata and get the data type by name when restoring. So reading legacy string data is compatible.
    • In exchange operator, the logic is similar: sender will add the name of type in packets and receiver will get data type by this name.
      • If exchanging data doesn't contain string data, everything is as usual.
      • Receiver with new code can recognizes the old name, so it can deserialize data as usual.
      • But receiver with old code cannot recognizes the new name, so it will throw a exception if receiving a unknow type name: ERROR 1105 (HY000): other error for mpp stream: Code: 49, e.displayText() = DB::Exception: CHBlockChunkCodecV1 schema mismatch at column 1, expected String, actual StringV1, e.what() = DB::Exception.
  • Microbenchmark

    • sizeX means the string size is about X bytes. To simulating variable length string, it is not strictly equal to X.
    • none/lz4/lw are the compression algorithm.
      • none means no compression.
      • lz4 means use lz4 compression algorithm for sizes and chars.
      • lw means use lightweight compreesion algorithm for sizes, but still use lz4 for chars.
--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
serialize/size-prefix_size1_none          793468 ns       789944 ns          858
serialize/size-prefix_size2_none          914658 ns       910472 ns          752
serialize/size-prefix_size4_none          905957 ns       901785 ns          786
serialize/size-prefix_size8_none          869287 ns       865437 ns          829
serialize/size-prefix_size16_none         834400 ns       830599 ns          854
serialize/size-prefix_size32_none         878731 ns       874594 ns          794
serialize/size-prefix_size64_none         952452 ns       947889 ns          746
serialize/size-prefix_size128_none       1410488 ns      1404789 ns          507
serialize/size-prefix_size256_none       2802038 ns      2789040 ns          240
serialize/size-prefix_size512_none       5525736 ns      5505273 ns          120
serialize/size-prefix_size1024_none     11839227 ns     11782614 ns           59
serialize/seperate_size1_none              68823 ns        68513 ns         9920
serialize/seperate_size2_none              75622 ns        75265 ns         9072
serialize/seperate_size4_none              90008 ns        89602 ns         7714
serialize/seperate_size8_none             115553 ns       115003 ns         6189
serialize/seperate_size16_none            154620 ns       153878 ns         4375
serialize/seperate_size32_none            233807 ns       232861 ns         2869
serialize/seperate_size64_none            407312 ns       405665 ns         1726
serialize/seperate_size128_none           900914 ns       896124 ns          710
serialize/seperate_size256_none          2383541 ns      2372055 ns          286
serialize/seperate_size512_none          5273792 ns      5248727 ns          128
serialize/seperate_size1024_none        10923320 ns     10876060 ns           63
deserialize/size-prefix_size1_none        601878 ns       599202 ns         1142
deserialize/size-prefix_size2_none        402256 ns       400372 ns         1731
deserialize/size-prefix_size4_none        422841 ns       420891 ns         1697
deserialize/size-prefix_size8_none        445656 ns       443519 ns         1652
deserialize/size-prefix_size16_none       715033 ns       711734 ns         1000
deserialize/size-prefix_size32_none       894119 ns       889896 ns          782
deserialize/size-prefix_size64_none      2091559 ns      2081000 ns          321
deserialize/size-prefix_size128_none     4547851 ns      4524581 ns          156
deserialize/size-prefix_size256_none     9743899 ns      9695204 ns           76
deserialize/size-prefix_size512_none    19487930 ns     19391362 ns           36
deserialize/size-prefix_size1024_none   40468350 ns     40279164 ns           17
deserialize/seperate_size1_none            75242 ns        74891 ns         9284
deserialize/seperate_size2_none            81858 ns        81460 ns         9050
deserialize/seperate_size4_none            88652 ns        88196 ns         8318
deserialize/seperate_size8_none           133628 ns       133053 ns         5456
deserialize/seperate_size16_none          168714 ns       167897 ns         4102
deserialize/seperate_size32_none          251624 ns       250376 ns         2790
deserialize/seperate_size64_none         1544826 ns      1537021 ns          468
deserialize/seperate_size128_none        3757021 ns      3738742 ns          193
deserialize/seperate_size256_none        8622765 ns      8579676 ns           84
deserialize/seperate_size512_none       17967755 ns     17879780 ns           42
deserialize/seperate_size1024_none      37691631 ns     37515724 ns           19
serialize/size-prefix_size1_lz4          1128832 ns      1123317 ns          635
serialize/size-prefix_size2_lz4           964527 ns       960179 ns          736
serialize/size-prefix_size4_lz4           956027 ns       951513 ns          754
serialize/size-prefix_size8_lz4           973301 ns       968312 ns          727
serialize/size-prefix_size16_lz4         1042921 ns      1038808 ns          690
serialize/size-prefix_size32_lz4         1309401 ns      1303252 ns          551
serialize/size-prefix_size64_lz4         1818730 ns      1810253 ns          390
serialize/size-prefix_size128_lz4        3336196 ns      3319701 ns          208
serialize/size-prefix_size256_lz4        6159128 ns      6130865 ns          112
serialize/size-prefix_size512_lz4       11968288 ns     11912581 ns           55
serialize/size-prefix_size1024_lz4      24556571 ns     24437219 ns           30
serialize/seperate_size1_lz4              976425 ns       972050 ns          722
serialize/seperate_size2_lz4              757736 ns       754231 ns          880
serialize/seperate_size4_lz4              725690 ns       722232 ns          986
serialize/seperate_size8_lz4              858580 ns       854541 ns          788
serialize/seperate_size16_lz4            1007146 ns      1002031 ns          678
serialize/seperate_size32_lz4            1299244 ns      1292715 ns          543
serialize/seperate_size64_lz4            1943021 ns      1933078 ns          359
serialize/seperate_size128_lz4           3492425 ns      3475573 ns          196
serialize/seperate_size256_lz4           6451407 ns      6421007 ns          108
serialize/seperate_size512_lz4          12230224 ns     12174660 ns           56
serialize/seperate_size1024_lz4         24333582 ns     24225180 ns           28
deserialize/size-prefix_size1_lz4         629151 ns       626390 ns         1118
deserialize/size-prefix_size2_lz4         419287 ns       417270 ns         1687
deserialize/size-prefix_size4_lz4         450667 ns       448330 ns         1541
deserialize/size-prefix_size8_lz4         501499 ns       499147 ns         1264
deserialize/size-prefix_size16_lz4        874233 ns       870202 ns          815
deserialize/size-prefix_size32_lz4       1261213 ns      1256156 ns          588
deserialize/size-prefix_size64_lz4       2783732 ns      2770686 ns          239
deserialize/size-prefix_size128_lz4      6008026 ns      5978607 ns          100
deserialize/size-prefix_size256_lz4     12705194 ns     12643952 ns           56
deserialize/size-prefix_size512_lz4     26087113 ns     25976776 ns           27
deserialize/size-prefix_size1024_lz4    52783570 ns     52540836 ns           13
deserialize/seperate_size1_lz4            318244 ns       316765 ns         2190
deserialize/seperate_size2_lz4            285874 ns       284391 ns         2411
deserialize/seperate_size4_lz4            306898 ns       305442 ns         2298
deserialize/seperate_size8_lz4            308414 ns       306914 ns         2263
deserialize/seperate_size16_lz4           488924 ns       486779 ns         1301
deserialize/seperate_size32_lz4           704201 ns       701054 ns          990
deserialize/seperate_size64_lz4          2340763 ns      2329487 ns          273
deserialize/seperate_size128_lz4         5313285 ns      5288140 ns          129
deserialize/seperate_size256_lz4        11908472 ns     11846170 ns           61
deserialize/seperate_size512_lz4        24370922 ns     24259110 ns           29
deserialize/seperate_size1024_lz4       49147824 ns     48916845 ns           13
serialize/seperate_size1_lw              1031562 ns      1026655 ns          665
serialize/seperate_size2_lw               829748 ns       825534 ns          754
serialize/seperate_size4_lw               790176 ns       786429 ns          896
serialize/seperate_size8_lw               901711 ns       897577 ns          686
serialize/seperate_size16_lw             1102531 ns      1097022 ns          656
serialize/seperate_size32_lw             1373598 ns      1367408 ns          508
serialize/seperate_size64_lw             2043946 ns      2034246 ns          339
serialize/seperate_size128_lw            3564719 ns      3547529 ns          199
serialize/seperate_size256_lw            6528542 ns      6496751 ns          103
serialize/seperate_size512_lw           12334350 ns     12287682 ns           54
serialize/seperate_size1024_lw          24025008 ns     23910280 ns           29
deserialize/seperate_size1_lw             162378 ns       161569 ns         4481
deserialize/seperate_size2_lw             115540 ns       115108 ns         5261
deserialize/seperate_size4_lw             147741 ns       147171 ns         4692
deserialize/seperate_size8_lw             179458 ns       178758 ns         3889
deserialize/seperate_size16_lw            402493 ns       400512 ns         1945
deserialize/seperate_size32_lw            607923 ns       605556 ns         1167
deserialize/seperate_size64_lw           2336497 ns      2327052 ns          312
deserialize/seperate_size128_lw          5250200 ns      5224075 ns          134
deserialize/seperate_size256_lw         11679677 ns     11627112 ns           59
deserialize/seperate_size512_lw         25201488 ns     25069517 ns           29
deserialize/seperate_size1024_lw        49288346 ns     49040803 ns           14


Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Nov 13, 2024
Copy link
Contributor

ti-chi-bot bot commented Nov 13, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jinhelin, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Nov 13, 2024
@JinheLin
Copy link
Contributor Author

/retest

@JinheLin JinheLin force-pushed the string_serde branch 2 times, most recently from ee13f63 to d925107 Compare November 26, 2024 05:26
@JinheLin JinheLin changed the title WIP: test Storages: New serialization/deserialization for DataTypeString Nov 26, 2024
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance deserialization performance of short string.
3 participants