[Performance] Faster slice sampler #2031

vmoens · 2024-03-21T11:03:27Z

TODO:

Make sure that this is working OK with shared buffers (extend on one worker won't erase the buffer on the other!)

cc @ahmed-touati @Cadene

pytorch-bot · 2024-03-21T11:03:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2031

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 20 Unrelated Failures

As of commit 4eb145b with merge base 660d827 ():

NEW FAILURES - The following jobs have failed:

Unit-tests on MacOS CPU / tests (3.11) / macos-job (gh)
The process '/usr/local/bin/git' failed with exit code 128
Unit-tests on MacOS CPU / tests (3.8) / macos-job (gh)
The process '/usr/local/bin/git' failed with exit code 128
Unit-tests on Windows / unittests-cpu / windows-job (gh)
The process 'C:\Program Files\Git\cmd\git.exe' failed with exit code 128

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

Examples Tests on Linux / tests (3.9, 12.1) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Generate documentation / build-docs (3.9, 12.1) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Habitat Tests on Linux / tests (3.9, 11.6) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Libs Tests on Linux / unittests-brax (3.9, 12.1) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Libs Tests on Linux / unittests-gym (3.9, 12.1) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Libs Tests on Linux / unittests-jumanji (3.9, 12.1) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Libs Tests on Linux / unittests-pettingzoo / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Libs Tests on Linux / unittests-robohive (3.9, 12.1) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Libs Tests on Linux / unittests-sklearn (3.9, 12.1) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Libs Tests on Linux / unittests-vmas (3.9, 12.1) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Lint / python-source-and-configs / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
RLHF Tests on Linux / unittests (3.9, 12.1) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Unit-tests on Linux / tests-cpu (3.10) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Unit-tests on Linux / tests-cpu (3.11) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Unit-tests on Linux / tests-cpu (3.8) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Unit-tests on Linux / tests-cpu (3.9) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Unit-tests on Linux / tests-gpu (3.10, 12.1) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Unit-tests on Linux / tests-olddeps (3.8, 11.6) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Unit-tests on Linux / tests-optdeps (3.10, 12.1) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128
Unit-tests on Linux / tests-stable-gpu (3.10, 11.8) / linux-job (gh)
The process '/usr/bin/git' failed with exit code 128

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2024-03-21T11:11:43Z

$\color{#D29922}\textsf{\Large&#x26A0;\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 91. Improved: $\large\color{#35bf28}5$. Worsened: $\large\color{#d91a1a}12$.

Expand to view detailed results

Name	Max	Mean	Ops	Ops on Repo `HEAD`	Change
test_single	62.2475ms	54.7210ms	18.2745 Ops/s	17.8861 Ops/s	$\color{#35bf28}+2.17\%$
test_sync	30.4470ms	29.1686ms	34.2834 Ops/s	34.4402 Ops/s	$\color{#d91a1a}-0.46\%$
test_async	52.1722ms	26.8124ms	37.2962 Ops/s	36.5925 Ops/s	$\color{#35bf28}+1.92\%$
test_simple	0.3999s	0.3389s	2.9511 Ops/s	2.9356 Ops/s	$\color{#35bf28}+0.53\%$
test_transformed	0.5266s	0.4756s	2.1025 Ops/s	2.1166 Ops/s	$\color{#d91a1a}-0.67\%$
test_serial	1.2536s	1.2009s	0.8327 Ops/s	0.8302 Ops/s	$\color{#35bf28}+0.31\%$
test_parallel	1.0770s	1.0242s	0.9763 Ops/s	0.9725 Ops/s	$\color{#35bf28}+0.40\%$
test_step_mdp_speed[True-True-True-True-True]	0.1539ms	21.1090μs	47.3731 KOps/s	47.0284 KOps/s	$\color{#35bf28}+0.73\%$
test_step_mdp_speed[True-True-True-True-False]	39.3930μs	13.1737μs	75.9087 KOps/s	78.2412 KOps/s	$\color{#d91a1a}-2.98\%$
test_step_mdp_speed[True-True-True-False-True]	34.7440μs	12.4711μs	80.1855 KOps/s	80.6377 KOps/s	$\color{#d91a1a}-0.56\%$
test_step_mdp_speed[True-True-True-False-False]	29.3540μs	7.6922μs	130.0022 KOps/s	134.9625 KOps/s	$\color{#d91a1a}-3.68\%$
test_step_mdp_speed[True-True-False-True-True]	58.0770μs	22.5685μs	44.3096 KOps/s	44.3384 KOps/s	$\color{#d91a1a}-0.07\%$
test_step_mdp_speed[True-True-False-True-False]	56.4350μs	14.2736μs	70.0596 KOps/s	71.5608 KOps/s	$\color{#d91a1a}-2.10\%$
test_step_mdp_speed[True-True-False-False-True]	39.0520μs	13.6201μs	73.4209 KOps/s	73.5524 KOps/s	$\color{#d91a1a}-0.18\%$
test_step_mdp_speed[True-True-False-False-False]	35.5260μs	8.8072μs	113.5437 KOps/s	115.9136 KOps/s	$\color{#d91a1a}-2.04\%$
test_step_mdp_speed[True-False-True-True-True]	68.0760μs	23.8625μs	41.9068 KOps/s	41.4831 KOps/s	$\color{#35bf28}+1.02\%$
test_step_mdp_speed[True-False-True-True-False]	39.9640μs	15.7575μs	63.4619 KOps/s	64.6992 KOps/s	$\color{#d91a1a}-1.91\%$
test_step_mdp_speed[True-False-True-False-True]	83.3820μs	13.8522μs	72.1907 KOps/s	73.3301 KOps/s	$\color{#d91a1a}-1.55\%$
test_step_mdp_speed[True-False-True-False-False]	33.0720μs	8.7303μs	114.5431 KOps/s	112.7358 KOps/s	$\color{#35bf28}+1.60\%$
test_step_mdp_speed[True-False-False-True-True]	51.3560μs	24.7853μs	40.3465 KOps/s	39.7711 KOps/s	$\color{#35bf28}+1.45\%$
test_step_mdp_speed[True-False-False-True-False]	42.4190μs	16.8142μs	59.4736 KOps/s	61.2857 KOps/s	$\color{#d91a1a}-2.96\%$
test_step_mdp_speed[True-False-False-False-True]	44.7630μs	14.7085μs	67.9879 KOps/s	66.8635 KOps/s	$\color{#35bf28}+1.68\%$
test_step_mdp_speed[True-False-False-False-False]	0.1117ms	10.3019μs	97.0695 KOps/s	102.7007 KOps/s	$\textbf{\color{#d91a1a}-5.48\%}$
test_step_mdp_speed[False-True-True-True-True]	60.2830μs	23.9268μs	41.7942 KOps/s	41.7524 KOps/s	$\color{#35bf28}+0.10\%$
test_step_mdp_speed[False-True-True-True-False]	41.9580μs	15.8159μs	63.2274 KOps/s	64.8198 KOps/s	$\color{#d91a1a}-2.46\%$
test_step_mdp_speed[False-True-True-False-True]	58.1890μs	15.9554μs	62.6746 KOps/s	62.7421 KOps/s	$\color{#d91a1a}-0.11\%$
test_step_mdp_speed[False-True-True-False-False]	39.6340μs	9.9925μs	100.0752 KOps/s	101.4280 KOps/s	$\color{#d91a1a}-1.33\%$
test_step_mdp_speed[False-True-False-True-True]	47.6790μs	25.6084μs	39.0497 KOps/s	39.5810 KOps/s	$\color{#d91a1a}-1.34\%$
test_step_mdp_speed[False-True-False-True-False]	36.3670μs	17.1092μs	58.4482 KOps/s	61.2718 KOps/s	$\color{#d91a1a}-4.61\%$
test_step_mdp_speed[False-True-False-False-True]	37.9210μs	17.0988μs	58.4837 KOps/s	59.0860 KOps/s	$\color{#d91a1a}-1.02\%$
test_step_mdp_speed[False-True-False-False-False]	89.8370μs	11.3255μs	88.2964 KOps/s	91.3664 KOps/s	$\color{#d91a1a}-3.36\%$
test_step_mdp_speed[False-False-True-True-True]	0.1011ms	26.2544μs	38.0888 KOps/s	38.0274 KOps/s	$\color{#35bf28}+0.16\%$
test_step_mdp_speed[False-False-True-True-False]	50.6640μs	18.2495μs	54.7959 KOps/s	56.4244 KOps/s	$\color{#d91a1a}-2.89\%$
test_step_mdp_speed[False-False-True-False-True]	46.3460μs	17.0292μs	58.7228 KOps/s	58.6099 KOps/s	$\color{#35bf28}+0.19\%$
test_step_mdp_speed[False-False-True-False-False]	41.0770μs	11.2506μs	88.8838 KOps/s	90.6026 KOps/s	$\color{#d91a1a}-1.90\%$
test_step_mdp_speed[False-False-False-True-True]	61.3740μs	27.5870μs	36.2490 KOps/s	36.4507 KOps/s	$\color{#d91a1a}-0.55\%$
test_step_mdp_speed[False-False-False-True-False]	85.1390μs	19.5623μs	51.1188 KOps/s	53.1690 KOps/s	$\color{#d91a1a}-3.86\%$
test_step_mdp_speed[False-False-False-False-True]	56.4550μs	18.0391μs	55.4350 KOps/s	56.2428 KOps/s	$\color{#d91a1a}-1.44\%$
test_step_mdp_speed[False-False-False-False-False]	57.7910μs	12.1922μs	82.0196 KOps/s	83.2132 KOps/s	$\color{#d91a1a}-1.43\%$
test_values[generalized_advantage_estimate-True-True]	10.4406ms	9.5631ms	104.5683 Ops/s	108.6454 Ops/s	$\color{#d91a1a}-3.75\%$
test_values[vec_generalized_advantage_estimate-True-True]	39.9798ms	35.7996ms	27.9333 Ops/s	28.5532 Ops/s	$\color{#d91a1a}-2.17\%$
test_values[td0_return_estimate-False-False]	0.2272ms	0.1766ms	5.6627 KOps/s	5.7591 KOps/s	$\color{#d91a1a}-1.67\%$
test_values[td1_return_estimate-False-False]	23.2642ms	22.8395ms	43.7838 Ops/s	43.6765 Ops/s	$\color{#35bf28}+0.25\%$
test_values[vec_td1_return_estimate-False-False]	37.4280ms	35.8148ms	27.9214 Ops/s	28.6381 Ops/s	$\color{#d91a1a}-2.50\%$
test_values[td_lambda_return_estimate-True-False]	37.0803ms	33.6072ms	29.7556 Ops/s	30.4478 Ops/s	$\color{#d91a1a}-2.27\%$
test_values[vec_td_lambda_return_estimate-True-False]	37.1078ms	35.6954ms	28.0148 Ops/s	28.4505 Ops/s	$\color{#d91a1a}-1.53\%$
test_gae_speed[generalized_advantage_estimate-False-1-512]	8.2732ms	8.1605ms	122.5417 Ops/s	123.1707 Ops/s	$\color{#d91a1a}-0.51\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512]	2.4584ms	1.9294ms	518.2865 Ops/s	555.4366 Ops/s	$\textbf{\color{#d91a1a}-6.69\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512]	0.5957ms	0.3507ms	2.8514 KOps/s	2.8455 KOps/s	$\color{#35bf28}+0.21\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512]	41.4098ms	39.1982ms	25.5114 Ops/s	22.0761 Ops/s	$\textbf{\color{#35bf28}+15.56\%}$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512]	3.5532ms	3.0319ms	329.8239 Ops/s	331.9050 Ops/s	$\color{#d91a1a}-0.63\%$
test_dqn_speed	6.9681ms	1.3551ms	737.9322 Ops/s	695.0187 Ops/s	$\textbf{\color{#35bf28}+6.17\%}$
test_ddpg_speed	2.9820ms	2.6830ms	372.7182 Ops/s	378.0220 Ops/s	$\color{#d91a1a}-1.40\%$
test_sac_speed	9.7313ms	8.2612ms	121.0474 Ops/s	123.5812 Ops/s	$\color{#d91a1a}-2.05\%$
test_redq_speed	14.4897ms	13.2607ms	75.4106 Ops/s	77.9603 Ops/s	$\color{#d91a1a}-3.27\%$
test_redq_deprec_speed	14.9728ms	13.3221ms	75.0634 Ops/s	77.9220 Ops/s	$\color{#d91a1a}-3.67\%$
test_td3_speed	16.1090ms	8.2068ms	121.8503 Ops/s	124.0594 Ops/s	$\color{#d91a1a}-1.78\%$
test_cql_speed	37.3579ms	36.1080ms	27.6947 Ops/s	27.8215 Ops/s	$\color{#d91a1a}-0.46\%$
test_a2c_speed	8.0638ms	7.3666ms	135.7482 Ops/s	137.3737 Ops/s	$\color{#d91a1a}-1.18\%$
test_ppo_speed	9.1612ms	7.6782ms	130.2393 Ops/s	133.1884 Ops/s	$\color{#d91a1a}-2.21\%$
test_reinforce_speed	7.3033ms	6.5535ms	152.5897 Ops/s	154.9115 Ops/s	$\color{#d91a1a}-1.50\%$
test_iql_speed	33.2971ms	32.3145ms	30.9458 Ops/s	30.9785 Ops/s	$\color{#d91a1a}-0.11\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000]	2.5375ms	2.2631ms	441.8754 Ops/s	479.0684 Ops/s	$\textbf{\color{#d91a1a}-7.76\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000]	97.7832ms	0.5770ms	1.7332 KOps/s	2.0291 KOps/s	$\textbf{\color{#d91a1a}-14.58\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000]	0.6711ms	0.4747ms	2.1067 KOps/s	2.1287 KOps/s	$\color{#d91a1a}-1.03\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000]	3.3964ms	2.3289ms	429.3933 Ops/s	482.0713 Ops/s	$\textbf{\color{#d91a1a}-10.93\%}$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000]	1.1012ms	0.4927ms	2.0296 KOps/s	2.0597 KOps/s	$\color{#d91a1a}-1.46\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000]	0.6407ms	0.4687ms	2.1335 KOps/s	2.1795 KOps/s	$\color{#d91a1a}-2.11\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000]	1.7901ms	1.2096ms	826.7491 Ops/s	774.5112 Ops/s	$\textbf{\color{#35bf28}+6.74\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000]	1.6407ms	1.1417ms	875.9153 Ops/s	817.8853 Ops/s	$\textbf{\color{#35bf28}+7.10\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000]	3.4375ms	2.3867ms	418.9900 Ops/s	452.2994 Ops/s	$\textbf{\color{#d91a1a}-7.36\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000]	1.0785ms	0.6149ms	1.6263 KOps/s	1.6486 KOps/s	$\color{#d91a1a}-1.35\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000]	0.8939ms	0.5882ms	1.7002 KOps/s	1.7172 KOps/s	$\color{#d91a1a}-0.99\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000]	3.4560ms	2.2994ms	434.9000 Ops/s	483.7164 Ops/s	$\textbf{\color{#d91a1a}-10.09\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000]	0.6569ms	0.5022ms	1.9911 KOps/s	2.0223 KOps/s	$\color{#d91a1a}-1.54\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000]	3.9302ms	0.4814ms	2.0775 KOps/s	2.1249 KOps/s	$\color{#d91a1a}-2.23\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000]	3.4917ms	2.3635ms	423.1000 Ops/s	474.0430 Ops/s	$\textbf{\color{#d91a1a}-10.75\%}$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000]	0.6125ms	0.4918ms	2.0333 KOps/s	2.0453 KOps/s	$\color{#d91a1a}-0.59\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000]	0.6826ms	0.4712ms	2.1223 KOps/s	2.1749 KOps/s	$\color{#d91a1a}-2.42\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000]	3.4840ms	2.3940ms	417.7068 Ops/s	456.8850 Ops/s	$\textbf{\color{#d91a1a}-8.58\%}$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000]	1.1367ms	0.6168ms	1.6212 KOps/s	1.6450 KOps/s	$\color{#d91a1a}-1.45\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000]	0.7855ms	0.5879ms	1.7009 KOps/s	1.7209 KOps/s	$\color{#d91a1a}-1.17\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400]	0.1104s	7.6689ms	130.3973 Ops/s	134.7378 Ops/s	$\color{#d91a1a}-3.22\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400]	14.3786ms	11.9633ms	83.5891 Ops/s	83.8722 Ops/s	$\color{#d91a1a}-0.34\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400]	1.6076ms	1.0607ms	942.8066 Ops/s	958.8463 Ops/s	$\color{#d91a1a}-1.67\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400]	0.1010s	5.5350ms	180.6675 Ops/s	185.3464 Ops/s	$\color{#d91a1a}-2.52\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400]	14.2861ms	11.8917ms	84.0921 Ops/s	68.6599 Ops/s	$\textbf{\color{#35bf28}+22.48\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400]	3.9687ms	1.1515ms	868.4556 Ops/s	964.6981 Ops/s	$\textbf{\color{#d91a1a}-9.98\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400]	0.1068s	8.0143ms	124.7775 Ops/s	173.6853 Ops/s	$\textbf{\color{#d91a1a}-28.16\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400]	15.0415ms	12.2902ms	81.3657 Ops/s	81.1138 Ops/s	$\color{#35bf28}+0.31\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400]	4.1109ms	1.4292ms	699.6817 Ops/s	744.2709 Ops/s	$\textbf{\color{#d91a1a}-5.99\%}$

github-actions · 2024-03-21T11:14:50Z

$\color{#D29922}\textsf{\Large&#x26A0;\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 94. Improved: $\large\color{#35bf28}6$. Worsened: $\large\color{#d91a1a}2$.

Expand to view detailed results

Name	Max	Mean	Ops	Ops on Repo `HEAD`	Change
test_single	0.1079s	0.1043s	9.5893 Ops/s	9.1300 Ops/s	$\textbf{\color{#35bf28}+5.03\%}$
test_sync	91.7501ms	88.0483ms	11.3574 Ops/s	10.7516 Ops/s	$\textbf{\color{#35bf28}+5.63\%}$
test_async	0.1819s	90.5767ms	11.0404 Ops/s	11.1805 Ops/s	$\color{#d91a1a}-1.25\%$
test_single_pixels	0.1134s	0.1126s	8.8794 Ops/s	8.9232 Ops/s	$\color{#d91a1a}-0.49\%$
test_sync_pixels	76.0009ms	68.0311ms	14.6992 Ops/s	14.9715 Ops/s	$\color{#d91a1a}-1.82\%$
test_async_pixels	0.1007s	62.8028ms	15.9229 Ops/s	15.6392 Ops/s	$\color{#35bf28}+1.81\%$
test_simple	0.7485s	0.6780s	1.4748 Ops/s	1.4887 Ops/s	$\color{#d91a1a}-0.93\%$
test_transformed	0.9745s	0.8911s	1.1223 Ops/s	1.1274 Ops/s	$\color{#d91a1a}-0.46\%$
test_serial	2.1693s	2.1182s	0.4721 Ops/s	0.4799 Ops/s	$\color{#d91a1a}-1.62\%$
test_parallel	1.8928s	1.8162s	0.5506 Ops/s	0.5510 Ops/s	$\color{#d91a1a}-0.08\%$
test_step_mdp_speed[True-True-True-True-True]	84.7010μs	33.5420μs	29.8134 KOps/s	28.7615 KOps/s	$\color{#35bf28}+3.66\%$
test_step_mdp_speed[True-True-True-True-False]	45.2010μs	19.5138μs	51.2458 KOps/s	50.3475 KOps/s	$\color{#35bf28}+1.78\%$
test_step_mdp_speed[True-True-True-False-True]	33.3100μs	18.4865μs	54.0936 KOps/s	52.8192 KOps/s	$\color{#35bf28}+2.41\%$
test_step_mdp_speed[True-True-True-False-False]	27.3700μs	11.2500μs	88.8886 KOps/s	88.3643 KOps/s	$\color{#35bf28}+0.59\%$
test_step_mdp_speed[True-True-False-True-True]	95.0710μs	34.6286μs	28.8778 KOps/s	28.2402 KOps/s	$\color{#35bf28}+2.26\%$
test_step_mdp_speed[True-True-False-True-False]	40.2300μs	21.2517μs	47.0550 KOps/s	45.8746 KOps/s	$\color{#35bf28}+2.57\%$
test_step_mdp_speed[True-True-False-False-True]	48.3900μs	20.2790μs	49.3121 KOps/s	48.9478 KOps/s	$\color{#35bf28}+0.74\%$
test_step_mdp_speed[True-True-False-False-False]	31.1510μs	13.2582μs	75.4249 KOps/s	75.5476 KOps/s	$\color{#d91a1a}-0.16\%$
test_step_mdp_speed[True-False-True-True-True]	60.2510μs	36.8132μs	27.1642 KOps/s	26.6369 KOps/s	$\color{#35bf28}+1.98\%$
test_step_mdp_speed[True-False-True-True-False]	90.7920μs	23.3909μs	42.7517 KOps/s	42.2603 KOps/s	$\color{#35bf28}+1.16\%$
test_step_mdp_speed[True-False-True-False-True]	36.9910μs	20.1754μs	49.5652 KOps/s	48.8036 KOps/s	$\color{#35bf28}+1.56\%$
test_step_mdp_speed[True-False-True-False-False]	30.6100μs	13.1347μs	76.1342 KOps/s	75.1771 KOps/s	$\color{#35bf28}+1.27\%$
test_step_mdp_speed[True-False-False-True-True]	71.2310μs	37.8947μs	26.3889 KOps/s	25.6757 KOps/s	$\color{#35bf28}+2.78\%$
test_step_mdp_speed[True-False-False-True-False]	61.1910μs	24.8488μs	40.2434 KOps/s	39.5736 KOps/s	$\color{#35bf28}+1.69\%$
test_step_mdp_speed[True-False-False-False-True]	39.1610μs	21.9996μs	45.4554 KOps/s	45.1817 KOps/s	$\color{#35bf28}+0.61\%$
test_step_mdp_speed[True-False-False-False-False]	29.9510μs	14.7778μs	67.6691 KOps/s	66.7648 KOps/s	$\color{#35bf28}+1.35\%$
test_step_mdp_speed[False-True-True-True-True]	55.9110μs	35.9942μs	27.7823 KOps/s	26.7138 KOps/s	$\color{#35bf28}+4.00\%$
test_step_mdp_speed[False-True-True-True-False]	47.4600μs	23.2114μs	43.0822 KOps/s	42.3457 KOps/s	$\color{#35bf28}+1.74\%$
test_step_mdp_speed[False-True-True-False-True]	50.0700μs	24.1731μs	41.3683 KOps/s	41.4149 KOps/s	$\color{#d91a1a}-0.11\%$
test_step_mdp_speed[False-True-True-False-False]	31.9600μs	14.7400μs	67.8427 KOps/s	67.9126 KOps/s	$\color{#d91a1a}-0.10\%$
test_step_mdp_speed[False-True-False-True-True]	63.1210μs	38.8398μs	25.7468 KOps/s	25.0651 KOps/s	$\color{#35bf28}+2.72\%$
test_step_mdp_speed[False-True-False-True-False]	55.2310μs	25.0496μs	39.9208 KOps/s	39.6872 KOps/s	$\color{#35bf28}+0.59\%$
test_step_mdp_speed[False-True-False-False-True]	49.2910μs	25.9779μs	38.4943 KOps/s	38.5669 KOps/s	$\color{#d91a1a}-0.19\%$
test_step_mdp_speed[False-True-False-False-False]	45.4210μs	16.4559μs	60.7685 KOps/s	59.9086 KOps/s	$\color{#35bf28}+1.44\%$
test_step_mdp_speed[False-False-True-True-True]	57.6310μs	40.3140μs	24.8053 KOps/s	24.3771 KOps/s	$\color{#35bf28}+1.76\%$
test_step_mdp_speed[False-False-True-True-False]	43.2510μs	27.1628μs	36.8151 KOps/s	36.5516 KOps/s	$\color{#35bf28}+0.72\%$
test_step_mdp_speed[False-False-True-False-True]	93.5210μs	25.9010μs	38.6085 KOps/s	38.2611 KOps/s	$\color{#35bf28}+0.91\%$
test_step_mdp_speed[False-False-True-False-False]	35.4010μs	16.4781μs	60.6868 KOps/s	60.0247 KOps/s	$\color{#35bf28}+1.10\%$
test_step_mdp_speed[False-False-False-True-True]	68.1510μs	41.6003μs	24.0383 KOps/s	23.7740 KOps/s	$\color{#35bf28}+1.11\%$
test_step_mdp_speed[False-False-False-True-False]	64.0400μs	28.9196μs	34.5786 KOps/s	34.7468 KOps/s	$\color{#d91a1a}-0.48\%$
test_step_mdp_speed[False-False-False-False-True]	49.1110μs	27.6703μs	36.1399 KOps/s	36.3637 KOps/s	$\color{#d91a1a}-0.62\%$
test_step_mdp_speed[False-False-False-False-False]	37.0200μs	18.2675μs	54.7420 KOps/s	54.7225 KOps/s	$\color{#35bf28}+0.04\%$
test_values[generalized_advantage_estimate-True-True]	27.0439ms	24.9036ms	40.1548 Ops/s	41.0822 Ops/s	$\color{#d91a1a}-2.26\%$
test_values[vec_generalized_advantage_estimate-True-True]	82.3481ms	3.2221ms	310.3539 Ops/s	307.2974 Ops/s	$\color{#35bf28}+0.99\%$
test_values[td0_return_estimate-False-False]	93.3020μs	66.4387μs	15.0515 KOps/s	15.3372 KOps/s	$\color{#d91a1a}-1.86\%$
test_values[td1_return_estimate-False-False]	55.6854ms	55.4132ms	18.0463 Ops/s	18.2754 Ops/s	$\color{#d91a1a}-1.25\%$
test_values[vec_td1_return_estimate-False-False]	2.1001ms	1.7779ms	562.4638 Ops/s	566.4542 Ops/s	$\color{#d91a1a}-0.70\%$
test_values[td_lambda_return_estimate-True-False]	96.0303ms	88.8608ms	11.2536 Ops/s	11.7808 Ops/s	$\color{#d91a1a}-4.48\%$
test_values[vec_td_lambda_return_estimate-True-False]	2.1137ms	1.7766ms	562.8645 Ops/s	566.8676 Ops/s	$\color{#d91a1a}-0.71\%$
test_gae_speed[generalized_advantage_estimate-False-1-512]	24.4621ms	24.0664ms	41.5517 Ops/s	42.2189 Ops/s	$\color{#d91a1a}-1.58\%$
test_gae_speed[vec_generalized_advantage_estimate-True-1-512]	0.9009ms	0.7153ms	1.3981 KOps/s	1.4179 KOps/s	$\color{#d91a1a}-1.40\%$
test_gae_speed[vec_generalized_advantage_estimate-False-1-512]	0.7307ms	0.6627ms	1.5090 KOps/s	1.5309 KOps/s	$\color{#d91a1a}-1.43\%$
test_gae_speed[vec_generalized_advantage_estimate-True-32-512]	1.4907ms	1.4632ms	683.4380 Ops/s	685.2496 Ops/s	$\color{#d91a1a}-0.26\%$
test_gae_speed[vec_generalized_advantage_estimate-False-32-512]	0.9576ms	0.6853ms	1.4592 KOps/s	1.4826 KOps/s	$\color{#d91a1a}-1.58\%$
test_dqn_speed	1.8528ms	1.4465ms	691.3075 Ops/s	669.9737 Ops/s	$\color{#35bf28}+3.18\%$
test_ddpg_speed	3.1767ms	2.7655ms	361.6029 Ops/s	363.6794 Ops/s	$\color{#d91a1a}-0.57\%$
test_sac_speed	8.5929ms	8.1659ms	122.4604 Ops/s	123.5544 Ops/s	$\color{#d91a1a}-0.89\%$
test_redq_speed	11.7540ms	10.5931ms	94.4007 Ops/s	95.3757 Ops/s	$\color{#d91a1a}-1.02\%$
test_redq_deprec_speed	11.8680ms	11.3237ms	88.3100 Ops/s	89.1693 Ops/s	$\color{#d91a1a}-0.96\%$
test_td3_speed	8.1904ms	8.0882ms	123.6370 Ops/s	124.2931 Ops/s	$\color{#d91a1a}-0.53\%$
test_cql_speed	26.4281ms	25.6721ms	38.9528 Ops/s	38.9657 Ops/s	$\color{#d91a1a}-0.03\%$
test_a2c_speed	5.7440ms	5.5435ms	180.3910 Ops/s	177.6930 Ops/s	$\color{#35bf28}+1.52\%$
test_ppo_speed	6.0444ms	5.8701ms	170.3555 Ops/s	167.1658 Ops/s	$\color{#35bf28}+1.91\%$
test_reinforce_speed	5.3566ms	4.5343ms	220.5424 Ops/s	220.4312 Ops/s	$\color{#35bf28}+0.05\%$
test_iql_speed	0.1142s	21.3877ms	46.7558 Ops/s	50.5884 Ops/s	$\textbf{\color{#d91a1a}-7.58\%}$
test_rb_sample[TensorDictReplayBuffer-ListStorage-RandomSampler-4000]	3.1128ms	2.9146ms	343.1051 Ops/s	343.7329 Ops/s	$\color{#d91a1a}-0.18\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000]	1.3394ms	0.5445ms	1.8365 KOps/s	1.8217 KOps/s	$\color{#35bf28}+0.81\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000]	0.7320ms	0.5235ms	1.9104 KOps/s	1.9290 KOps/s	$\color{#d91a1a}-0.96\%$
test_rb_sample[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000]	3.1484ms	2.9177ms	342.7352 Ops/s	345.2346 Ops/s	$\color{#d91a1a}-0.72\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000]	1.7446ms	0.5440ms	1.8381 KOps/s	1.5346 KOps/s	$\textbf{\color{#35bf28}+19.77\%}$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000]	0.7037ms	0.5148ms	1.9427 KOps/s	1.9564 KOps/s	$\color{#d91a1a}-0.70\%$
test_rb_sample[TensorDictReplayBuffer-LazyMemmapStorage-sampler6-10000]	1.6565ms	1.4761ms	677.4494 Ops/s	653.5166 Ops/s	$\color{#35bf28}+3.66\%$
test_rb_sample[TensorDictReplayBuffer-LazyTensorStorage-sampler7-10000]	1.4921ms	1.3946ms	717.0398 Ops/s	680.5898 Ops/s	$\textbf{\color{#35bf28}+5.36\%}$
test_rb_sample[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000]	3.1308ms	3.0387ms	329.0838 Ops/s	327.9275 Ops/s	$\color{#35bf28}+0.35\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000]	1.2790ms	0.6697ms	1.4933 KOps/s	1.4843 KOps/s	$\color{#35bf28}+0.61\%$
test_rb_sample[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000]	0.8662ms	0.6497ms	1.5391 KOps/s	1.5194 KOps/s	$\color{#35bf28}+1.30\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-RandomSampler-4000]	3.0013ms	2.9008ms	344.7321 Ops/s	344.9155 Ops/s	$\color{#d91a1a}-0.05\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-10000]	0.6692ms	0.5451ms	1.8345 KOps/s	1.8281 KOps/s	$\color{#35bf28}+0.35\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-10000]	4.4281ms	0.5253ms	1.9036 KOps/s	1.9020 KOps/s	$\color{#35bf28}+0.08\%$
test_rb_iterate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-4000]	3.1169ms	2.9407ms	340.0538 Ops/s	339.1315 Ops/s	$\color{#35bf28}+0.27\%$
test_rb_iterate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-10000]	0.6577ms	0.5371ms	1.8619 KOps/s	1.8614 KOps/s	$\color{#35bf28}+0.03\%$
test_rb_iterate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-10000]	0.7231ms	0.5127ms	1.9504 KOps/s	1.9455 KOps/s	$\color{#35bf28}+0.25\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-ListStorage-None-4000]	3.1592ms	3.0294ms	330.1005 Ops/s	329.0955 Ops/s	$\color{#35bf28}+0.31\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-10000]	0.8353ms	0.6691ms	1.4945 KOps/s	1.4871 KOps/s	$\color{#35bf28}+0.50\%$
test_rb_iterate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-10000]	4.6174ms	0.6576ms	1.5207 KOps/s	1.5364 KOps/s	$\color{#d91a1a}-1.02\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-RandomSampler-400]	0.1330s	7.3881ms	135.3525 Ops/s	136.0825 Ops/s	$\color{#d91a1a}-0.54\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-RandomSampler-400]	18.8000ms	15.1991ms	65.7934 Ops/s	65.7658 Ops/s	$\color{#35bf28}+0.04\%$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-RandomSampler-400]	2.4625ms	1.1079ms	902.5690 Ops/s	932.0249 Ops/s	$\color{#d91a1a}-3.16\%$
test_rb_populate[TensorDictReplayBuffer-ListStorage-SamplerWithoutReplacement-400]	0.1158s	6.9998ms	142.8608 Ops/s	141.2281 Ops/s	$\color{#35bf28}+1.16\%$
test_rb_populate[TensorDictReplayBuffer-LazyMemmapStorage-SamplerWithoutReplacement-400]	17.4517ms	15.0994ms	66.2278 Ops/s	57.4777 Ops/s	$\textbf{\color{#35bf28}+15.22\%}$
test_rb_populate[TensorDictReplayBuffer-LazyTensorStorage-SamplerWithoutReplacement-400]	2.2616ms	1.1164ms	895.7304 Ops/s	928.5206 Ops/s	$\color{#d91a1a}-3.53\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-ListStorage-None-400]	0.1193s	9.6955ms	103.1401 Ops/s	135.4568 Ops/s	$\textbf{\color{#d91a1a}-23.86\%}$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyMemmapStorage-None-400]	17.9278ms	15.4757ms	64.6174 Ops/s	63.8866 Ops/s	$\color{#35bf28}+1.14\%$
test_rb_populate[TensorDictPrioritizedReplayBuffer-LazyTensorStorage-None-400]	2.7688ms	1.4653ms	682.4717 Ops/s	632.8394 Ops/s	$\textbf{\color{#35bf28}+7.84\%}$

vmoens · 2024-03-21T13:11:21Z

I was looking into pre-computing the trajectory indices at write time.

The bottleneck if we're not caching values is at the nonzero call

rl/torchrl/data/replay_buffers/samplers.py

Line 839 in 660d827

stop_idx = end.transpose(0, -1).nonzero()

This nonzero is called on the end (smth that roughly looks like a tensor of done states).

If we want to "cache" whatever we can between two updates, we should update the result of nonzero to avoid calling nonzero over the whole end tensor. That's easier said than done!
Look at this example:

import torch
torch.manual_seed(2)

batch = 4
time = 10

ends = torch.zeros(batch, time, dtype=torch.bool).bernoulli_(0.2)

nz = ends.nonzero()
print("original non zero", nz)

ends_slice = torch.zeros(batch, 4, dtype=torch.bool).bernoulli_(0.2)
ends2 = ends.clone()
ends2[:, 2:6] = ends_slice
nz2 = ends2.nonzero()

nz_slice = ends_slice.nonzero()
nz_slice[:, 1] += 2
print("non zero from the slice", nz_slice)

print("updated non zero")
print(nz2)

That will give you

original non zero tensor([[0, 1],
        [0, 6],
        [1, 0],
        [1, 7],
        [1, 9],
        [2, 5],
        [2, 7],
        [2, 8],
        [3, 0],
        [3, 1],
        [3, 9]])
non zero from the slice tensor([[0, 3],
        [2, 5],
        [3, 4]])
updated non zero
tensor([[0, 1],
        [0, 3],
        [0, 6],
        [1, 0],
        [1, 7],
        [1, 9],
        [2, 5],
        [2, 7],
        [2, 8],
        [3, 0],
        [3, 1],
        [3, 4],
        [3, 9]])

So the operation of updating the first non zero to get the second given the update we made is to
(1) look for any value that has an index in the last column that is in between 2 and 6 (the slice that we're updating) and remove them
(2) replace at the right spot the new non-zero elements that we computed from the updated "ends_slice" tensor.

(1) is O(T) and if implemented in python it will basically amend to scanning through the whole set of non-zero end signals and create a new tensor out of it. In short: it will be expensive and tedious.
(2) is even harder because we will need to scan the tensor a second time and insert the values where needed, but that means some complicated shifts from the existing indices.

I don't think this is worth anyone's time so I'm a bit skeptical that we can get this working.

So for now I will not be looking at writing the start/end of trajectories at write time since I can't see a viable implementation. The one I outlined above will be slower than just calling nonzero() on the whole thing.

But there is an intermediate solution if you're doing more than one sample() per collection (which will be the case in many occasions): we could cache the values and erase the cache whenever we call extend. If you're not sharing your storage between two buffers (!) you don't need to re-compute the start and end signals if the content of the storage has not changed.

vmoens · 2024-03-21T15:25:32Z

Some benchmarks:

import time

import torch
import tqdm
from tensordict import TensorDict

from torchrl.data import ReplayBuffer, SliceSampler, LazyTensorStorage

for compile in [True, False]:
    for cached in [True, False]:

        rb = ReplayBuffer(storage=LazyTensorStorage(1_000_000),
                          sampler=SliceSampler(num_slices=16, traj_key="traj_idx", compile=compile, cache_values=cached),
                          batch_size=256)

        tds = TensorDict({
            "traj_idx": torch.arange(1_000_000) // 100,
            "x": torch.randn(1_000_000),
            ("next", "y"): torch.randn(1_000_000),
        }, [1_000_000]).split(1000)


        def iter_over_tds():
            while True:
                yield from tds


        iterator = iter_over_tds()
        rb.extend(next(iterator))
        rb.sample()

        n_samples = 20
        t0 = time.time()
        for i, data in tqdm.tqdm(enumerate(iterator), total=5000, desc=f"compile={compile}, cache={cached}"):
            rb.extend(data)
            for j in range(n_samples):
                rb.sample()
            if i == 5000:
                break
        print(f"compile={compile}, cache={cached}, time={time.time() - t0: 4.4f}")

Results:

compile=True, cache=True: 100%|██████████| 5000/5000 [00:16<00:00, 307.65it/s]
compile=True, cache=True, time= 16.2532
compile=True, cache=False: 100%|██████████| 5000/5000 [01:39<00:00, 50.17it/s]
compile=True, cache=False, time= 99.6700
compile=False, cache=True: 100%|██████████| 5000/5000 [00:18<00:00, 265.16it/s]
compile=False, cache=True, time= 18.8568
compile=False, cache=False: 100%|██████████| 5000/5000 [06:26<00:00, 12.95it/s]
compile=False, cache=False, time= 386.0987

So we have a clear very impressive gain of cache (compile=True => 625%, compile = False => 2000%) and some gain thanks to compile too (cache = True => 110%, cache=False => 388%)
The total improvement compared to the previous state for this example is 2383% (24x faster)!

test/test_rb.py

torchrl/data/replay_buffers/samplers.py

(cherry picked from commit cd540bf)

init

530d7dc

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 21, 2024

vmoens added the performance Performance issue or suggestion for improvement label Mar 21, 2024

vmoens added 4 commits March 21, 2024 15:37

amend

bd398eb

amend

a174e3c

amend

cf22af4

amend

f48a43d

vmoens marked this pull request as ready for review March 22, 2024 08:50

vmoens commented Mar 22, 2024

View reviewed changes

test/test_rb.py Outdated Show resolved Hide resolved

torchrl/data/replay_buffers/samplers.py Outdated Show resolved Hide resolved

Apply suggestions from code review

4eb145b

vmoens merged commit cd540bf into main Mar 22, 2024
28 of 51 checks passed

vmoens deleted the faster-slicesampler branch March 22, 2024 08:56

vmoens added a commit that referenced this pull request Mar 25, 2024

[Performance] Faster slice sampler (#2031)

d74fc05

(cherry picked from commit cd540bf)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Faster slice sampler #2031

[Performance] Faster slice sampler #2031

vmoens commented Mar 21, 2024 •

edited

Loading

pytorch-bot bot commented Mar 21, 2024 •

edited

Loading

github-actions bot commented Mar 21, 2024 •

edited

Loading

github-actions bot commented Mar 21, 2024 •

edited

Loading

vmoens commented Mar 21, 2024

vmoens commented Mar 21, 2024

[Performance] Faster slice sampler #2031

[Performance] Faster slice sampler #2031

Conversation

vmoens commented Mar 21, 2024 • edited Loading

pytorch-bot bot commented Mar 21, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/rl/2031

❌ 3 New Failures, 20 Unrelated Failures

github-actions bot commented Mar 21, 2024 • edited Loading

$\color{#D29922}\textsf{\Large&amp;#x26A0;\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

Total Benchmarks: 91. Improved: $\large\color{#35bf28}5$. Worsened: $\large\color{#d91a1a}12$.

github-actions bot commented Mar 21, 2024 • edited Loading

$\color{#D29922}\textsf{\Large&amp;#x26A0;\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests

Total Benchmarks: 94. Improved: $\large\color{#35bf28}6$. Worsened: $\large\color{#d91a1a}2$.

vmoens commented Mar 21, 2024

vmoens commented Mar 21, 2024

vmoens commented Mar 21, 2024 •

edited

Loading

pytorch-bot bot commented Mar 21, 2024 •

edited

Loading

github-actions bot commented Mar 21, 2024 •

edited

Loading

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of CPU Benchmark Tests

github-actions bot commented Mar 21, 2024 •

edited

Loading

$\color{#D29922}\textsf{\Large⚠\kern{0.2cm}\normalsize Warning}$ Result of GPU Benchmark Tests