Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Faster slice sampler #2031

Merged
merged 6 commits into from
Mar 22, 2024
Merged

[Performance] Faster slice sampler #2031

merged 6 commits into from
Mar 22, 2024

Conversation

vmoens
Copy link
Contributor

@vmoens vmoens commented Mar 21, 2024

TODO:

  • Make sure that this is working OK with shared buffers (extend on one worker won't erase the buffer on the other!)

cc @ahmed-touati @Cadene

Copy link

pytorch-bot bot commented Mar 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2031

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 20 Unrelated Failures

As of commit 4eb145b with merge base 660d827 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 21, 2024
@vmoens vmoens added the performance Performance issue or suggestion for improvement label Mar 21, 2024
Copy link

github-actions bot commented Mar 21, 2024

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 91. Improved: $\large\color{#35bf28}5$. Worsened: $\large\color{#d91a1a}12$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_single 62.2475ms 54.7210ms 18.2745 Ops/s 17.8861 Ops/s $\color{#35bf28}+2.17\%$
test_sync 30.4470ms 29.1686ms 34.2834 Ops/s 34.4402 Ops/s $\color{#d91a1a}-0.46\%$
test_async 52.1722ms 26.8124ms 37.2962 Ops/s 36.5925 Ops/s $\color{#35bf28}+1.92\%$
test_simple 0.3999s 0.3389s 2.9511 Ops/s 2.9356 Ops/s $\color{#35bf28}+0.53\%$
test_transformed 0.5266s 0.4756s 2.1025 Ops/s 2.1166 Ops/s $\color{#d91a1a}-0.67\%$
test_serial 1.2536s 1.2009s 0.8327 Ops/s 0.8302 Ops/s $\color{#35bf28}+0.31\%$
test_parallel 1.0770s 1.0242s 0.9763 Ops/s 0.9725 Ops/s $\color{#35bf28}+0.40\%$
test_step_mdp_speed[True-True-True-True-True] 0.1539ms 21.1090μs 47.3731 KOps/s 47.0284 KOps/s $\color{#35bf28}+0.73\%$
test_step_mdp_speed[True-True-True-True-False] 39.3930μs 13.1737μs 75.9087 KOps/s 78.2412 KOps/s $\color{#d91a1a}-2.98\%$
test_step_mdp_speed[True-True-True-False-True] 34.7440μs 12.4711μs 80.1855 KOps/s 80.6377 KOps/s $\color{#d91a1a}-0.56\%$
test_step_mdp_speed[True-True-True-False-False] 29.3540μs 7.6922μs 130.0022 KOps/s 134.9625 KOps/s $\color{#d91a1a}-3.68\%$
test_step_mdp_speed[True-True-False-True-True] 58.0770μs 22.5685μs 44.3096 KOps/s 44.3384 KOps/s $\color{#d91a1a}-0.07\%$
test_step_mdp_speed[True-True-False-True-False] 56.4350μs 14.2736μs 70.0596 KOps/s 71.5608 KOps/s $\color{#d91a1a}-2.10\%$
test_step_mdp_speed[True-True-False-False-True] 39.0520μs 13.6201μs 73.4209 KOps/s 73.5524 KOps/s $\color{#d91a1a}-0.18\%$
test_step_mdp_speed[True-True-False-False-False] 35.5260μs 8.8072μs 113.5437 KOps/s 115.9136 KOps/s $\color{#d91a1a}-2.04\%$
test_step_mdp_speed[True-False-True-True-True] 68.0760μs 23.8625μs 41.9068 KOps/s 41.4831 KOps/s $\color{#35bf28}+1.02\%$
test_step_mdp_speed[True-False-True-True-False] 39.9640μs 15.7575μs 63.4619 KOps/s 64.6992 KOps/s $\color{#d91a1a}-1.91\%$
test_step_mdp_speed[True-False-True-False-True] 83.3820μs 13.8522μs 72.1907 KOps/s 73.3301 KOps/s $\color{#d91a1a}-1.55\%$
test_step_mdp_speed[True-False-True-False-False] 33.0720μs 8.7303μs 114.5431 KOps/s 112.7358 KOps/s $\color{#35bf28}+1.60\%$
test_step_mdp_speed[True-False-False-True-True] 51.3560μs 24.7853μs 40.3465 KOps/s 39.7711 KOps/s $\color{#35bf28}+1.45\%$
test_step_mdp_speed[True-False-False-True-False] 42.4190μs 16.8142μs 59.4736 KOps/s 61.2857 KOps/s $\color{#d91a1a}-2.96\%$
test_step_mdp_speed[True-False-False-False-True] 44.7630μs 14.7085μs 67.9879 KOps/s 66.8635 KOps/s $\color{#35bf28}+1.68\%$
test_step_mdp_speed[True-False-False-False-False] 0.1117ms 10.3019μs 97.0695 KOps/s 102.7007 KOps/s $\textbf{\color{#d91a1a}-5.48\%}$
test_step_mdp_speed[False-True-True-True-True] 60.2830μs 23.9268μs 41.7942 KOps/s 41.7524 KOps/s $\color{#35bf28}+0.10\%$
test_step_mdp_speed[False-True-True-True-False] 41.9580μs 15.8159μs 63.2274 KOps/s 64.8198 KOps/s $\color{#d91a1a}-2.46\%$
test_step_mdp_speed[False-True-True-False-True] 58.1890μs 15.9554μs 62.6746 KOps/s 62.7421 KOps/s $\color{#d91a1a}-0.11\%$
test_step_mdp_speed[False-True-True-False-False] 39.6340μs 9.9925μs 100.0752 KOps/s 101.4280 KOps/s $\color{#d91a1a}-1.33\%$
test_step_mdp_speed[False-True-False-True-True] 47.6790μs 25.6084μs 39.0497 KOps/s 39.5810 KOps/s $\color{#d91a1a}-1.34\%$
test_step_mdp_speed[False-True-False-True-False] 36.3670μs 17.1092μs 58.4482 KOps/s 61.2718 KOps/s $\color{#d91a1a}-4.61\%$
test_step_mdp_speed[False-True-False-False-True] 37.9210μs 17.0988μs 58.4837 KOps/s 59.0860 KOps/s $\color{#d91a1a}-1.02\%$
test_step_mdp_speed[False-True-False-False-False] 89.8370μs 11.3255μs 88.2964 KOps/s 91.3664 KOps/s $\color{#d91a1a}-3.36\%$
test_step_mdp_speed[False-False-True-True-True] 0.1011ms 26.2544μs 38.0888 KOps/s 38.0274 KOps/s $\color{#35bf28}+0.16\%$
test_step_mdp_speed[False-False-True-True-False] 50.6640μs 18.2495μs 54.7959 KOps/s 56.4244 KOps/s $\color{#d91a1a}-2.89\%$
test_step_mdp_speed[False-False-True-False-True] 46.3460μs 17.0292μs 58.7228 KOps/s 58.6099 KOps/s $\color{#35bf28}+0.19\%$
test_step_mdp_speed[False-False-True-False-False] 41.0770μs 11.2506μs 88.8838 KOps/s 90.6026 KOps/s $\color{#d91a1a}-1.90\%$
test_step_mdp_speed[False-False-False-True-True] 61.3740μs 27.5870μs 36.2490 KOps/s 36.4507 KOps/s $\color{#d91a1a}-0.55\%$
test_step_mdp_speed[False-False-False-True-False] 85.1390μs 19.5623μs 51.1188 KOps/s 53.1690 KOps/s $\color{#d91a1a}-3.86\%$
test_step_mdp_speed[False-False-False-False-True] 56.4550μs 18.0391μs 55.4350 KOps/s 56.2428 KOps/s $\color{#d91a1a}-1.44\%$
test_step_mdp_speed[False-False-False-False-False] 57.7910μs 12.1922μs 82.0196 KOps/s 83.2132 KOps/s $\color{#d91a1a}-1.43\%$
test_values[generalized_advantage_estimate-True-True] 10.4406ms 9.5631ms 104.5683 Ops/s 108.6454 Ops/s $\color{#d91a1a}-3.75\%$
test_values[vec_generalized_advantage_estimate-True-True] 39.9798ms 35.7996ms 27.9333 Ops/s 28.5532 Ops/s $\color{#d91a1a}-2.17\%$
test_values[td0_return_estimate-False-False] 0.2272ms 0.1766ms 5.6627 KOps/s 5.7591 KOps/s $\color{#d91a1a}-1.67\%$
test_values[td1_return_estimate-False-False] 23.2642ms 22.8395ms 43.7838 Ops/s 43.6765 Ops/s $\color{#35bf28}+0.25\%$
test_values[vec_td1_return_estimate-False-False] 37.4280ms 35.8148ms 27.9214 Ops/s 28.6381 Ops/s $\color{#d91a1a}-2.50\%$
test_values[td_lambda_return_estimate-True-False] 37.0803ms 33.6072ms 29.7556 Ops/s 30.4478 Ops/s $\color{#d91a1a}-2.27\%$
test_values[vec_td_lambda_return_estimate-True-False] 37.1078ms 35.6954ms 28.0148 Ops/s 28.4505 Ops/s $\color{#d91a1a}-1.53\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 8.2732ms 8.1605ms 122.5417 Ops/s 123.1707 Ops/s $\color{#d91a1a}-0.51\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 2.4584ms 1.9294ms 518.2865 Ops/s 555.4366 Ops/s $\textbf{\color{#d91a1a}-6.69\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.5957ms 0.3507ms 2.8514 KOps/s 2.8455 KOps/s $\color{#35bf28}+0.21\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 41.4098ms 39.1982ms 25.5114 Ops/s 22.0761 Ops/s $\textbf{\color{#35bf28}+15.56\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 3.5532ms 3.0319ms 329.8239 Ops/s 331.9050 Ops/s $\color{#d91a1a}-0.63\%$
test_dqn_speed 6.9681ms 1.3551ms 737.9322 Ops/s 695.0187 Ops/s $\textbf{\color{#35bf28}+6.17\%}$
test_ddpg_speed 2.9820ms 2.6830ms 372.7182 Ops/s 378.0220 Ops/s $\color{#d91a1a}-1.40\%$
test_sac_speed 9.7313ms 8.2612ms 121.0474 Ops/s 123.5812 Ops/s $\color{#d91a1a}-2.05\%$
test_redq_speed 14.4897ms 13.2607ms 75.4106 Ops/s 77.9603 Ops/s $\color{#d91a1a}-3.27\%$
test_redq_deprec_speed 14.9728ms 13.3221ms 75.0634 Ops/s 77.9220 Ops/s $\color{#d91a1a}-3.67\%$
test_td3_speed 16.1090ms 8.2068ms 121.8503 Ops/s 124.0594 Ops/s $\color{#d91a1a}-1.78\%$
test_cql_speed 37.3579ms 36.1080ms 27.6947 Ops/s 27.8215 Ops/s $\color{#d91a1a}-0.46\%$
test_a2c_speed 8.0638ms 7.3666ms 135.7482 Ops/s 137.3737 Ops/s $\color{#d91a1a}-1.18\%$
test_ppo_speed 9.1612ms 7.6782ms 130.2393 Ops/s 133.1884 Ops/s $\color{#d91a1a}-2.21\%$
test_reinforce_speed 7.3033ms 6.5535ms 152.5897 Ops/s 154.9115 Ops/s $\color{#d91a1a}-1.50\%$
test_iql_speed 33.2971ms 32.3145ms 30.9458 Ops/s 30.9785 Ops/s $\color{#d91a1a}-0.11\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 2.5375ms 2.2631ms 441.8754 Ops/s 479.0684 Ops/s $\textbf{\color{#d91a1a}-7.76\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 97.7832ms 0.5770ms 1.7332 KOps/s 2.0291 KOps/s $\textbf{\color{#d91a1a}-14.58\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.6711ms 0.4747ms 2.1067 KOps/s 2.1287 KOps/s $\color{#d91a1a}-1.03\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.3964ms 2.3289ms 429.3933 Ops/s 482.0713 Ops/s $\textbf{\color{#d91a1a}-10.93\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1.1012ms 0.4927ms 2.0296 KOps/s 2.0597 KOps/s $\color{#d91a1a}-1.46\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6407ms 0.4687ms 2.1335 KOps/s 2.1795 KOps/s $\color{#d91a1a}-2.11\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.7901ms 1.2096ms 826.7491 Ops/s 774.5112 Ops/s $\textbf{\color{#35bf28}+6.74\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.6407ms 1.1417ms 875.9153 Ops/s 817.8853 Ops/s $\textbf{\color{#35bf28}+7.10\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.4375ms 2.3867ms 418.9900 Ops/s 452.2994 Ops/s $\textbf{\color{#d91a1a}-7.36\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.0785ms 0.6149ms 1.6263 KOps/s 1.6486 KOps/s $\color{#d91a1a}-1.35\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.8939ms 0.5882ms 1.7002 KOps/s 1.7172 KOps/s $\color{#d91a1a}-0.99\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.4560ms 2.2994ms 434.9000 Ops/s 483.7164 Ops/s $\textbf{\color{#d91a1a}-10.09\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.6569ms 0.5022ms 1.9911 KOps/s 2.0223 KOps/s $\color{#d91a1a}-1.54\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 3.9302ms 0.4814ms 2.0775 KOps/s 2.1249 KOps/s $\color{#d91a1a}-2.23\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.4917ms 2.3635ms 423.1000 Ops/s 474.0430 Ops/s $\textbf{\color{#d91a1a}-10.75\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.6125ms 0.4918ms 2.0333 KOps/s 2.0453 KOps/s $\color{#d91a1a}-0.59\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.6826ms 0.4712ms 2.1223 KOps/s 2.1749 KOps/s $\color{#d91a1a}-2.42\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.4840ms 2.3940ms 417.7068 Ops/s 456.8850 Ops/s $\textbf{\color{#d91a1a}-8.58\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.1367ms 0.6168ms 1.6212 KOps/s 1.6450 KOps/s $\color{#d91a1a}-1.45\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.7855ms 0.5879ms 1.7009 KOps/s 1.7209 KOps/s $\color{#d91a1a}-1.17\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.1104s 7.6689ms 130.3973 Ops/s 134.7378 Ops/s $\color{#d91a1a}-3.22\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 14.3786ms 11.9633ms 83.5891 Ops/s 83.8722 Ops/s $\color{#d91a1a}-0.34\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 1.6076ms 1.0607ms 942.8066 Ops/s 958.8463 Ops/s $\color{#d91a1a}-1.67\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.1010s 5.5350ms 180.6675 Ops/s 185.3464 Ops/s $\color{#d91a1a}-2.52\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 14.2861ms 11.8917ms 84.0921 Ops/s 68.6599 Ops/s $\textbf{\color{#35bf28}+22.48\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 3.9687ms 1.1515ms 868.4556 Ops/s 964.6981 Ops/s $\textbf{\color{#d91a1a}-9.98\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.1068s 8.0143ms 124.7775 Ops/s 173.6853 Ops/s $\textbf{\color{#d91a1a}-28.16\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 15.0415ms 12.2902ms 81.3657 Ops/s 81.1138 Ops/s $\color{#35bf28}+0.31\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 4.1109ms 1.4292ms 699.6817 Ops/s 744.2709 Ops/s $\textbf{\color{#d91a1a}-5.99\%}$

Copy link

github-actions bot commented Mar 21, 2024

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 94. Improved: $\large\color{#35bf28}6$. Worsened: $\large\color{#d91a1a}2$.

Expand to view detailed results
Name Max Mean Ops Ops on Repo HEAD Change
test_single 0.1079s 0.1043s 9.5893 Ops/s 9.1300 Ops/s $\textbf{\color{#35bf28}+5.03\%}$
test_sync 91.7501ms 88.0483ms 11.3574 Ops/s 10.7516 Ops/s $\textbf{\color{#35bf28}+5.63\%}$
test_async 0.1819s 90.5767ms 11.0404 Ops/s 11.1805 Ops/s $\color{#d91a1a}-1.25\%$
test_single_pixels 0.1134s 0.1126s 8.8794 Ops/s 8.9232 Ops/s $\color{#d91a1a}-0.49\%$
test_sync_pixels 76.0009ms 68.0311ms 14.6992 Ops/s 14.9715 Ops/s $\color{#d91a1a}-1.82\%$
test_async_pixels 0.1007s 62.8028ms 15.9229 Ops/s 15.6392 Ops/s $\color{#35bf28}+1.81\%$
test_simple 0.7485s 0.6780s 1.4748 Ops/s 1.4887 Ops/s $\color{#d91a1a}-0.93\%$
test_transformed 0.9745s 0.8911s 1.1223 Ops/s 1.1274 Ops/s $\color{#d91a1a}-0.46\%$
test_serial 2.1693s 2.1182s 0.4721 Ops/s 0.4799 Ops/s $\color{#d91a1a}-1.62\%$
test_parallel 1.8928s 1.8162s 0.5506 Ops/s 0.5510 Ops/s $\color{#d91a1a}-0.08\%$
test_step_mdp_speed[True-True-True-True-True] 84.7010μs 33.5420μs 29.8134 KOps/s 28.7615 KOps/s $\color{#35bf28}+3.66\%$
test_step_mdp_speed[True-True-True-True-False] 45.2010μs 19.5138μs 51.2458 KOps/s 50.3475 KOps/s $\color{#35bf28}+1.78\%$
test_step_mdp_speed[True-True-True-False-True] 33.3100μs 18.4865μs 54.0936 KOps/s 52.8192 KOps/s $\color{#35bf28}+2.41\%$
test_step_mdp_speed[True-True-True-False-False] 27.3700μs 11.2500μs 88.8886 KOps/s 88.3643 KOps/s $\color{#35bf28}+0.59\%$
test_step_mdp_speed[True-True-False-True-True] 95.0710μs 34.6286μs 28.8778 KOps/s 28.2402 KOps/s $\color{#35bf28}+2.26\%$
test_step_mdp_speed[True-True-False-True-False] 40.2300μs 21.2517μs 47.0550 KOps/s 45.8746 KOps/s $\color{#35bf28}+2.57\%$
test_step_mdp_speed[True-True-False-False-True] 48.3900μs 20.2790μs 49.3121 KOps/s 48.9478 KOps/s $\color{#35bf28}+0.74\%$
test_step_mdp_speed[True-True-False-False-False] 31.1510μs 13.2582μs 75.4249 KOps/s 75.5476 KOps/s $\color{#d91a1a}-0.16\%$
test_step_mdp_speed[True-False-True-True-True] 60.2510μs 36.8132μs 27.1642 KOps/s 26.6369 KOps/s $\color{#35bf28}+1.98\%$
test_step_mdp_speed[True-False-True-True-False] 90.7920μs 23.3909μs 42.7517 KOps/s 42.2603 KOps/s $\color{#35bf28}+1.16\%$
test_step_mdp_speed[True-False-True-False-True] 36.9910μs 20.1754μs 49.5652 KOps/s 48.8036 KOps/s $\color{#35bf28}+1.56\%$
test_step_mdp_speed[True-False-True-False-False] 30.6100μs 13.1347μs 76.1342 KOps/s 75.1771 KOps/s $\color{#35bf28}+1.27\%$
test_step_mdp_speed[True-False-False-True-True] 71.2310μs 37.8947μs 26.3889 KOps/s 25.6757 KOps/s $\color{#35bf28}+2.78\%$
test_step_mdp_speed[True-False-False-True-False] 61.1910μs 24.8488μs 40.2434 KOps/s 39.5736 KOps/s $\color{#35bf28}+1.69\%$
test_step_mdp_speed[True-False-False-False-True] 39.1610μs 21.9996μs 45.4554 KOps/s 45.1817 KOps/s $\color{#35bf28}+0.61\%$
test_step_mdp_speed[True-False-False-False-False] 29.9510μs 14.7778μs 67.6691 KOps/s 66.7648 KOps/s $\color{#35bf28}+1.35\%$
test_step_mdp_speed[False-True-True-True-True] 55.9110μs 35.9942μs 27.7823 KOps/s 26.7138 KOps/s $\color{#35bf28}+4.00\%$
test_step_mdp_speed[False-True-True-True-False] 47.4600μs 23.2114μs 43.0822 KOps/s 42.3457 KOps/s $\color{#35bf28}+1.74\%$
test_step_mdp_speed[False-True-True-False-True] 50.0700μs 24.1731μs 41.3683 KOps/s 41.4149 KOps/s $\color{#d91a1a}-0.11\%$
test_step_mdp_speed[False-True-True-False-False] 31.9600μs 14.7400μs 67.8427 KOps/s 67.9126 KOps/s $\color{#d91a1a}-0.10\%$
test_step_mdp_speed[False-True-False-True-True] 63.1210μs 38.8398μs 25.7468 KOps/s 25.0651 KOps/s $\color{#35bf28}+2.72\%$
test_step_mdp_speed[False-True-False-True-False] 55.2310μs 25.0496μs 39.9208 KOps/s 39.6872 KOps/s $\color{#35bf28}+0.59\%$
test_step_mdp_speed[False-True-False-False-True] 49.2910μs 25.9779μs 38.4943 KOps/s 38.5669 KOps/s $\color{#d91a1a}-0.19\%$
test_step_mdp_speed[False-True-False-False-False] 45.4210μs 16.4559μs 60.7685 KOps/s 59.9086 KOps/s $\color{#35bf28}+1.44\%$
test_step_mdp_speed[False-False-True-True-True] 57.6310μs 40.3140μs 24.8053 KOps/s 24.3771 KOps/s $\color{#35bf28}+1.76\%$
test_step_mdp_speed[False-False-True-True-False] 43.2510μs 27.1628μs 36.8151 KOps/s 36.5516 KOps/s $\color{#35bf28}+0.72\%$
test_step_mdp_speed[False-False-True-False-True] 93.5210μs 25.9010μs 38.6085 KOps/s 38.2611 KOps/s $\color{#35bf28}+0.91\%$
test_step_mdp_speed[False-False-True-False-False] 35.4010μs 16.4781μs 60.6868 KOps/s 60.0247 KOps/s $\color{#35bf28}+1.10\%$
test_step_mdp_speed[False-False-False-True-True] 68.1510μs 41.6003μs 24.0383 KOps/s 23.7740 KOps/s $\color{#35bf28}+1.11\%$
test_step_mdp_speed[False-False-False-True-False] 64.0400μs 28.9196μs 34.5786 KOps/s 34.7468 KOps/s $\color{#d91a1a}-0.48\%$
test_step_mdp_speed[False-False-False-False-True] 49.1110μs 27.6703μs 36.1399 KOps/s 36.3637 KOps/s $\color{#d91a1a}-0.62\%$
test_step_mdp_speed[False-False-False-False-False] 37.0200μs 18.2675μs 54.7420 KOps/s 54.7225 KOps/s $\color{#35bf28}+0.04\%$
test_values[generalized_advantage_estimate-True-True] 27.0439ms 24.9036ms 40.1548 Ops/s 41.0822 Ops/s $\color{#d91a1a}-2.26\%$
test_values[vec_generalized_advantage_estimate-True-True] 82.3481ms 3.2221ms 310.3539 Ops/s 307.2974 Ops/s $\color{#35bf28}+0.99\%$
test_values[td0_return_estimate-False-False] 93.3020μs 66.4387μs 15.0515 KOps/s 15.3372 KOps/s $\color{#d91a1a}-1.86\%$
test_values[td1_return_estimate-False-False] 55.6854ms 55.4132ms 18.0463 Ops/s 18.2754 Ops/s $\color{#d91a1a}-1.25\%$
test_values[vec_td1_return_estimate-False-False] 2.1001ms 1.7779ms 562.4638 Ops/s 566.4542 Ops/s $\color{#d91a1a}-0.70\%$
test_values[td_lambda_return_estimate-True-False] 96.0303ms 88.8608ms 11.2536 Ops/s 11.7808 Ops/s $\color{#d91a1a}-4.48\%$
test_values[vec_td_lambda_return_estimate-True-False] 2.1137ms 1.7766ms 562.8645 Ops/s 566.8676 Ops/s $\color{#d91a1a}-0.71\%$
test_gae_speed[generalized_advantage_estimate-False-1-512] 24.4621ms 24.0664ms 41.5517 Ops/s 42.2189 Ops/s $\color{#d91a1a}-1.58\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512] 0.9009ms 0.7153ms 1.3981 KOps/s 1.4179 KOps/s $\color{#d91a1a}-1.40\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512] 0.7307ms 0.6627ms 1.5090 KOps/s 1.5309 KOps/s $\color{#d91a1a}-1.43\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512] 1.4907ms 1.4632ms 683.4380 Ops/s 685.2496 Ops/s $\color{#d91a1a}-0.26\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512] 0.9576ms 0.6853ms 1.4592 KOps/s 1.4826 KOps/s $\color{#d91a1a}-1.58\%$
test_dqn_speed 1.8528ms 1.4465ms 691.3075 Ops/s 669.9737 Ops/s $\color{#35bf28}+3.18\%$
test_ddpg_speed 3.1767ms 2.7655ms 361.6029 Ops/s 363.6794 Ops/s $\color{#d91a1a}-0.57\%$
test_sac_speed 8.5929ms 8.1659ms 122.4604 Ops/s 123.5544 Ops/s $\color{#d91a1a}-0.89\%$
test_redq_speed 11.7540ms 10.5931ms 94.4007 Ops/s 95.3757 Ops/s $\color{#d91a1a}-1.02\%$
test_redq_deprec_speed 11.8680ms 11.3237ms 88.3100 Ops/s 89.1693 Ops/s $\color{#d91a1a}-0.96\%$
test_td3_speed 8.1904ms 8.0882ms 123.6370 Ops/s 124.2931 Ops/s $\color{#d91a1a}-0.53\%$
test_cql_speed 26.4281ms 25.6721ms 38.9528 Ops/s 38.9657 Ops/s $\color{#d91a1a}-0.03\%$
test_a2c_speed 5.7440ms 5.5435ms 180.3910 Ops/s 177.6930 Ops/s $\color{#35bf28}+1.52\%$
test_ppo_speed 6.0444ms 5.8701ms 170.3555 Ops/s 167.1658 Ops/s $\color{#35bf28}+1.91\%$
test_reinforce_speed 5.3566ms 4.5343ms 220.5424 Ops/s 220.4312 Ops/s $\color{#35bf28}+0.05\%$
test_iql_speed 0.1142s 21.3877ms 46.7558 Ops/s 50.5884 Ops/s $\textbf{\color{#d91a1a}-7.58\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.1128ms 2.9146ms 343.1051 Ops/s 343.7329 Ops/s $\color{#d91a1a}-0.18\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 1.3394ms 0.5445ms 1.8365 KOps/s 1.8217 KOps/s $\color{#35bf28}+0.81\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 0.7320ms 0.5235ms 1.9104 KOps/s 1.9290 KOps/s $\color{#d91a1a}-0.96\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.1484ms 2.9177ms 342.7352 Ops/s 345.2346 Ops/s $\color{#d91a1a}-0.72\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 1.7446ms 0.5440ms 1.8381 KOps/s 1.5346 KOps/s $\textbf{\color{#35bf28}+19.77\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.7037ms 0.5148ms 1.9427 KOps/s 1.9564 KOps/s $\color{#d91a1a}-0.70\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000] 1.6565ms 1.4761ms 677.4494 Ops/s 653.5166 Ops/s $\color{#35bf28}+3.66\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000] 1.4921ms 1.3946ms 717.0398 Ops/s 680.5898 Ops/s $\textbf{\color{#35bf28}+5.36\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.1308ms 3.0387ms 329.0838 Ops/s 327.9275 Ops/s $\color{#35bf28}+0.35\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 1.2790ms 0.6697ms 1.4933 KOps/s 1.4843 KOps/s $\color{#35bf28}+0.61\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 0.8662ms 0.6497ms 1.5391 KOps/s 1.5194 KOps/s $\color{#35bf28}+1.30\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000] 3.0013ms 2.9008ms 344.7321 Ops/s 344.9155 Ops/s $\color{#d91a1a}-0.05\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000] 0.6692ms 0.5451ms 1.8345 KOps/s 1.8281 KOps/s $\color{#35bf28}+0.35\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000] 4.4281ms 0.5253ms 1.9036 KOps/s 1.9020 KOps/s $\color{#35bf28}+0.08\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000] 3.1169ms 2.9407ms 340.0538 Ops/s 339.1315 Ops/s $\color{#35bf28}+0.27\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000] 0.6577ms 0.5371ms 1.8619 KOps/s 1.8614 KOps/s $\color{#35bf28}+0.03\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000] 0.7231ms 0.5127ms 1.9504 KOps/s 1.9455 KOps/s $\color{#35bf28}+0.25\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000] 3.1592ms 3.0294ms 330.1005 Ops/s 329.0955 Ops/s $\color{#35bf28}+0.31\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000] 0.8353ms 0.6691ms 1.4945 KOps/s 1.4871 KOps/s $\color{#35bf28}+0.50\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000] 4.6174ms 0.6576ms 1.5207 KOps/s 1.5364 KOps/s $\color{#d91a1a}-1.02\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400] 0.1330s 7.3881ms 135.3525 Ops/s 136.0825 Ops/s $\color{#d91a1a}-0.54\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400] 18.8000ms 15.1991ms 65.7934 Ops/s 65.7658 Ops/s $\color{#35bf28}+0.04\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400] 2.4625ms 1.1079ms 902.5690 Ops/s 932.0249 Ops/s $\color{#d91a1a}-3.16\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400] 0.1158s 6.9998ms 142.8608 Ops/s 141.2281 Ops/s $\color{#35bf28}+1.16\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400] 17.4517ms 15.0994ms 66.2278 Ops/s 57.4777 Ops/s $\textbf{\color{#35bf28}+15.22\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400] 2.2616ms 1.1164ms 895.7304 Ops/s 928.5206 Ops/s $\color{#d91a1a}-3.53\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400] 0.1193s 9.6955ms 103.1401 Ops/s 135.4568 Ops/s $\textbf{\color{#d91a1a}-23.86\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400] 17.9278ms 15.4757ms 64.6174 Ops/s 63.8866 Ops/s $\color{#35bf28}+1.14\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400] 2.7688ms 1.4653ms 682.4717 Ops/s 632.8394 Ops/s $\textbf{\color{#35bf28}+7.84\%}$

@vmoens
Copy link
Contributor Author

vmoens commented Mar 21, 2024

I was looking into pre-computing the trajectory indices at write time.

The bottleneck if we're not caching values is at the nonzero call

stop_idx = end.transpose(0, -1).nonzero()

This nonzero is called on the end (smth that roughly looks like a tensor of done states).

If we want to "cache" whatever we can between two updates, we should update the result of nonzero to avoid calling nonzero over the whole end tensor. That's easier said than done!
Look at this example:

import torch
torch.manual_seed(2)

batch = 4
time = 10

ends = torch.zeros(batch, time, dtype=torch.bool).bernoulli_(0.2)

nz = ends.nonzero()
print("original non zero", nz)

ends_slice = torch.zeros(batch, 4, dtype=torch.bool).bernoulli_(0.2)
ends2 = ends.clone()
ends2[:, 2:6] = ends_slice
nz2 = ends2.nonzero()

nz_slice = ends_slice.nonzero()
nz_slice[:, 1] += 2
print("non zero from the slice", nz_slice)

print("updated non zero")
print(nz2)

That will give you

original non zero tensor([[0, 1],
        [0, 6],
        [1, 0],
        [1, 7],
        [1, 9],
        [2, 5],
        [2, 7],
        [2, 8],
        [3, 0],
        [3, 1],
        [3, 9]])
non zero from the slice tensor([[0, 3],
        [2, 5],
        [3, 4]])
updated non zero
tensor([[0, 1],
        [0, 3],
        [0, 6],
        [1, 0],
        [1, 7],
        [1, 9],
        [2, 5],
        [2, 7],
        [2, 8],
        [3, 0],
        [3, 1],
        [3, 4],
        [3, 9]])

So the operation of updating the first non zero to get the second given the update we made is to
(1) look for any value that has an index in the last column that is in between 2 and 6 (the slice that we're updating) and remove them
(2) replace at the right spot the new non-zero elements that we computed from the updated "ends_slice" tensor.

(1) is O(T) and if implemented in python it will basically amend to scanning through the whole set of non-zero end signals and create a new tensor out of it. In short: it will be expensive and tedious.
(2) is even harder because we will need to scan the tensor a second time and insert the values where needed, but that means some complicated shifts from the existing indices.

I don't think this is worth anyone's time so I'm a bit skeptical that we can get this working.

So for now I will not be looking at writing the start/end of trajectories at write time since I can't see a viable implementation. The one I outlined above will be slower than just calling nonzero() on the whole thing.

But there is an intermediate solution if you're doing more than one sample() per collection (which will be the case in many occasions): we could cache the values and erase the cache whenever we call extend. If you're not sharing your storage between two buffers (!) you don't need to re-compute the start and end signals if the content of the storage has not changed.

@vmoens
Copy link
Contributor Author

vmoens commented Mar 21, 2024

Some benchmarks:

import time

import torch
import tqdm
from tensordict import TensorDict

from torchrl.data import ReplayBuffer, SliceSampler, LazyTensorStorage

for compile in [True, False]:
    for cached in [True, False]:

        rb = ReplayBuffer(storage=LazyTensorStorage(1_000_000),
                          sampler=SliceSampler(num_slices=16, traj_key="traj_idx", compile=compile, cache_values=cached),
                          batch_size=256)

        tds = TensorDict({
            "traj_idx": torch.arange(1_000_000) // 100,
            "x": torch.randn(1_000_000),
            ("next", "y"): torch.randn(1_000_000),
        }, [1_000_000]).split(1000)


        def iter_over_tds():
            while True:
                yield from tds


        iterator = iter_over_tds()
        rb.extend(next(iterator))
        rb.sample()

        n_samples = 20
        t0 = time.time()
        for i, data in tqdm.tqdm(enumerate(iterator), total=5000, desc=f"compile={compile}, cache={cached}"):
            rb.extend(data)
            for j in range(n_samples):
                rb.sample()
            if i == 5000:
                break
        print(f"compile={compile}, cache={cached}, time={time.time() - t0: 4.4f}")

Results:

compile=True, cache=True: 100%|██████████| 5000/5000 [00:16<00:00, 307.65it/s]
compile=True, cache=True, time= 16.2532
compile=True, cache=False: 100%|██████████| 5000/5000 [01:39<00:00, 50.17it/s]
compile=True, cache=False, time= 99.6700
compile=False, cache=True: 100%|██████████| 5000/5000 [00:18<00:00, 265.16it/s]
compile=False, cache=True, time= 18.8568
compile=False, cache=False: 100%|██████████| 5000/5000 [06:26<00:00, 12.95it/s]
compile=False, cache=False, time= 386.0987

So we have a clear very impressive gain of cache (compile=True => 625%, compile = False => 2000%) and some gain thanks to compile too (cache = True => 110%, cache=False => 388%)
The total improvement compared to the previous state for this example is 2383% (24x faster)!

@vmoens vmoens marked this pull request as ready for review March 22, 2024 08:50
test/test_rb.py Outdated Show resolved Hide resolved
torchrl/data/replay_buffers/samplers.py Outdated Show resolved Hide resolved
@vmoens vmoens merged commit cd540bf into main Mar 22, 2024
28 of 51 checks passed
@vmoens vmoens deleted the faster-slicesampler branch March 22, 2024 08:56
vmoens added a commit that referenced this pull request Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. performance Performance issue or suggestion for improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants