Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate set retrieve to use the OA implementation #637

Draft
wants to merge 9 commits into
base: dev
Choose a base branch
from

Conversation

PointKernel
Copy link
Member

TBD

@PointKernel PointKernel added type: improvement Improvement / enhancement to an existing function topic: static_set Issue related to the static_set labels Nov 8, 2024
@PointKernel
Copy link
Member Author

PointKernel commented Nov 8, 2024

Compared to the current set algorithm, the OA retrieve achieves comparable performance after adding an early exit for a single hash set, though it still experiences about a 20% slowdown:

yunsongw@0c23fdd-lcedt:~/Work/nvbench/scripts$ ./nvbench_compare.py old_retrieve.json oa-items-block.json 
['old_retrieve.json', 'oa-items-block.json']
# static_set_retrieve_uniform_occupancy

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  Occupancy  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|-------|----------------|-------------|------------|-------------|------------|-------------|-----------|---------|----------|
|  I32  |    UNIFORM     |     0.1     |  31.708 ms |       0.40% |  34.022 ms |       0.11% |  2.314 ms |   7.30% |   FAIL   |
|  I32  |    UNIFORM     |     0.2     |  31.706 ms |       0.32% |  34.080 ms |       0.06% |  2.375 ms |   7.49% |   FAIL   |
|  I32  |    UNIFORM     |     0.3     |  31.689 ms |       0.29% |  34.312 ms |       0.13% |  2.623 ms |   8.28% |   FAIL   |
|  I32  |    UNIFORM     |     0.4     |  32.002 ms |       1.16% |  34.864 ms |       0.04% |  2.862 ms |   8.94% |   FAIL   |
|  I32  |    UNIFORM     |     0.5     |  32.161 ms |       0.07% |  35.880 ms |       0.08% |  3.719 ms |  11.56% |   FAIL   |
|  I32  |    UNIFORM     |     0.6     |  32.711 ms |       0.10% |  37.520 ms |       0.26% |  4.808 ms |  14.70% |   FAIL   |
|  I32  |    UNIFORM     |     0.7     |  33.454 ms |       0.06% |  39.609 ms |       0.01% |  6.155 ms |  18.40% |   FAIL   |
|  I32  |    UNIFORM     |     0.8     |  34.625 ms |       0.17% |  42.471 ms |       0.03% |  7.846 ms |  22.66% |   FAIL   |
|  I32  |    UNIFORM     |     0.9     |  36.350 ms |       0.09% |  46.138 ms |       0.05% |  9.788 ms |  26.93% |   FAIL   |
|  I64  |    UNIFORM     |     0.1     |  33.798 ms |       0.04% |  36.332 ms |       0.24% |  2.534 ms |   7.50% |   FAIL   |
|  I64  |    UNIFORM     |     0.2     |  33.976 ms |       1.76% |  36.435 ms |       0.04% |  2.459 ms |   7.24% |   FAIL   |
|  I64  |    UNIFORM     |     0.3     |  33.978 ms |       0.14% |  36.717 ms |       0.05% |  2.739 ms |   8.06% |   FAIL   |
|  I64  |    UNIFORM     |     0.4     |  34.090 ms |       0.10% |  37.314 ms |       0.04% |  3.223 ms |   9.46% |   FAIL   |
|  I64  |    UNIFORM     |     0.5     |  34.412 ms |       0.06% |  38.389 ms |       0.10% |  3.977 ms |  11.56% |   FAIL   |
|  I64  |    UNIFORM     |     0.6     |  34.927 ms |       0.03% |  40.046 ms |       0.05% |  5.119 ms |  14.66% |   FAIL   |
|  I64  |    UNIFORM     |     0.7     |  35.790 ms |       0.28% |  42.317 ms |       0.04% |  6.527 ms |  18.24% |   FAIL   |
|  I64  |    UNIFORM     |     0.8     |  36.891 ms |       0.06% |  45.269 ms |       0.03% |  8.378 ms |  22.71% |   FAIL   |
|  I64  |    UNIFORM     |     0.9     |  38.686 ms |       0.04% |  49.043 ms |       0.02% | 10.357 ms |  26.77% |   FAIL   |

# static_set_retrieve_uniform_matching_rate

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  MatchingRate  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |     Diff |   %Diff |  Status  |
|-------|----------------|----------------|------------|-------------|------------|-------------|----------|---------|----------|
|  I32  |    UNIFORM     |      0.1       |  32.647 ms |       0.06% |  40.862 ms |       0.12% | 8.214 ms |  25.16% |   FAIL   |
|  I32  |    UNIFORM     |      0.2       |  32.670 ms |       0.37% |  40.687 ms |       0.21% | 8.017 ms |  24.54% |   FAIL   |
|  I32  |    UNIFORM     |      0.3       |  32.532 ms |       0.10% |  40.397 ms |       0.02% | 7.865 ms |  24.18% |   FAIL   |
|  I32  |    UNIFORM     |      0.4       |  32.472 ms |       0.08% |  40.154 ms |       0.04% | 7.682 ms |  23.66% |   FAIL   |
|  I32  |    UNIFORM     |      0.5       |  32.498 ms |       0.35% |  39.601 ms |       0.06% | 7.103 ms |  21.86% |   FAIL   |
|  I32  |    UNIFORM     |      0.6       |  32.392 ms |       0.05% |  38.879 ms |       0.03% | 6.487 ms |  20.03% |   FAIL   |
|  I32  |    UNIFORM     |      0.7       |  32.340 ms |       0.09% |  37.965 ms |       0.04% | 5.625 ms |  17.39% |   FAIL   |
|  I32  |    UNIFORM     |      0.8       |  32.283 ms |       0.06% |  37.166 ms |       0.07% | 4.883 ms |  15.12% |   FAIL   |
|  I32  |    UNIFORM     |      0.9       |  32.241 ms |       0.06% |  36.486 ms |       0.04% | 4.245 ms |  13.17% |   FAIL   |
|  I32  |    UNIFORM     |       1        |  32.243 ms |       0.14% |  35.845 ms |       0.04% | 3.603 ms |  11.17% |   FAIL   |
|  I64  |    UNIFORM     |      0.1       |  34.905 ms |       0.04% |  43.508 ms |       0.06% | 8.603 ms |  24.65% |   FAIL   |
|  I64  |    UNIFORM     |      0.2       |  34.845 ms |       0.03% |  43.344 ms |       0.16% | 8.500 ms |  24.39% |   FAIL   |
|  I64  |    UNIFORM     |      0.3       |  34.806 ms |       0.05% |  43.120 ms |       0.04% | 8.314 ms |  23.89% |   FAIL   |
|  I64  |    UNIFORM     |      0.4       |  34.754 ms |       0.05% |  42.887 ms |       0.07% | 8.132 ms |  23.40% |   FAIL   |
|  I64  |    UNIFORM     |      0.5       |  34.727 ms |       0.18% |  42.269 ms |       0.04% | 7.543 ms |  21.72% |   FAIL   |
|  I64  |    UNIFORM     |      0.6       |  34.656 ms |       0.04% |  41.548 ms |       0.03% | 6.891 ms |  19.89% |   FAIL   |
|  I64  |    UNIFORM     |      0.7       |  34.599 ms |       0.05% |  40.584 ms |       0.02% | 5.985 ms |  17.30% |   FAIL   |
|  I64  |    UNIFORM     |      0.8       |  34.570 ms |       0.04% |  39.717 ms |       0.03% | 5.147 ms |  14.89% |   FAIL   |
|  I64  |    UNIFORM     |      0.9       |  34.533 ms |       0.06% |  38.977 ms |       0.02% | 4.444 ms |  12.87% |   FAIL   |
|  I64  |    UNIFORM     |       1        |  34.453 ms |       0.05% |  38.263 ms |       0.03% | 3.810 ms |  11.06% |   FAIL   |

# static_set_retrieve_uniform_multiplicity

## [0] Quadro RTX 8000

|  Key  |  Distribution  |  Multiplicity  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |     Diff |   %Diff |  Status  |
|-------|----------------|----------------|------------|-------------|------------|-------------|----------|---------|----------|
|  I32  |    UNIFORM     |       1        |  32.236 ms |       0.11% |  35.795 ms |       0.03% | 3.559 ms |  11.04% |   FAIL   |
|  I32  |    UNIFORM     |       2        |  31.634 ms |       0.12% |  34.570 ms |       0.06% | 2.937 ms |   9.28% |   FAIL   |
|  I32  |    UNIFORM     |       4        |  31.494 ms |       0.28% |  33.824 ms |       0.03% | 2.330 ms |   7.40% |   FAIL   |
|  I32  |    UNIFORM     |       8        |  31.412 ms |       0.05% |  33.685 ms |       0.04% | 2.272 ms |   7.23% |   FAIL   |
|  I32  |    UNIFORM     |       16       |  31.345 ms |       0.07% |  33.585 ms |       0.03% | 2.241 ms |   7.15% |   FAIL   |
|  I64  |    UNIFORM     |       1        |  34.478 ms |       0.05% |  38.306 ms |       0.12% | 3.828 ms |  11.10% |   FAIL   |
|  I64  |    UNIFORM     |       2        |  33.858 ms |       0.04% |  36.990 ms |       0.04% | 3.132 ms |   9.25% |   FAIL   |
|  I64  |    UNIFORM     |       4        |  34.050 ms |       3.52% |  36.166 ms |       0.04% | 2.116 ms |   6.22% |   FAIL   |
|  I64  |    UNIFORM     |       8        |  33.701 ms |       0.04% |  36.028 ms |       0.05% | 2.326 ms |   6.90% |   FAIL   |
|  I64  |    UNIFORM     |       16       |  33.645 ms |       0.11% |  35.898 ms |       0.03% | 2.252 ms |   6.69% |   FAIL   |

On the bright side, I realized that cudf’s distinct inner join can utilize find instead of the costly atomic-bounded retrieve operation, providing noticeable speedups (rapidsai/cudf#17278). This allows us to concentrate specifically on multiset use cases.

Is a clean and nice block-wise retrieve API worth a 20% performance slowdown?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic: static_set Issue related to the static_set type: improvement Improvement / enhancement to an existing function
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant