From eeefdedb2ceee3ae1abfe88896cae3b8b62b4c05 Mon Sep 17 00:00:00 2001 From: whj Date: Mon, 26 Aug 2024 14:04:04 +0800 Subject: [PATCH] Update irs.md Co-Authored-By: Roger Creus <31919499+roger-creus@users.noreply.github.com> --- docs/tutorials/mt/irs.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/tutorials/mt/irs.md b/docs/tutorials/mt/irs.md index d42020e0..df618b70 100644 --- a/docs/tutorials/mt/irs.md +++ b/docs/tutorials/mt/irs.md @@ -33,9 +33,9 @@ The workflow of RLeXplore is illustrated as follows: Commonly, at each time step, the agent receives observations from the environment and predicts actions. The environment then executes the actions and returns feedback to the agent, which consists of a next observation, a reward, and a terminal signal. During the data collection process, the ***.watch()*** function is used to monitor the agent-environment interactions. For instance, E3B [1] updates an estimate of an ellipsoid in an embedding space after observing every state. At the end of the data collection rollouts, ***.compute()*** computes the corresponding intrinsic rewards. Note that ***.compute()*** is only called once per rollout using batched operations, which makes RLeXplore a highly efficient framework. Additionally, RLeXplore provides several utilities for reward and observation normalization. Finally, the ***.update()*** function is called immediately after ***.compute()*** to update the reward module if necessary (e.g., train the forward dynamics models in Disagreement [2] or the predictor network in RND [3]). All operations are subject to the standard workflow of the Gymnasium API. ``` -[1] \cite{henaff2022exploration} -[2] \cite{pathak2019self} -[3] \cite{burda2018exploration} +[1] Henaff M, Raileanu R, Jiang M, et al. Exploration via elliptical episodic bonuses[J]. Advances in Neural Information Processing Systems, 2022, 35: 37631-37646. +[2] Pathak D, Agrawal P, Efros A A, et al. Curiosity-driven exploration by self-supervised prediction[C]//International conference on machine learning. PMLR, 2017: 2778-2787. +[3] Burda Y, Edwards H, Storkey A, et al. Exploration by random network distillation[C]//Seventh International Conference on Learning Representations. 2019: 1-17. ``` ## Workflow @@ -227,11 +227,11 @@ Run `example.py` and you'll see the intrinsic reward module is invoked: ``` ## RLeXplore with Stable-baselines, CleanRL, ... -**RLeXplore** can be seamlessly integrated with existing RL libraries, such as Stable-Baselines3, CleanRL, etc. We provide specific examples [GitHub](https://github.com/RLE-Foundation/RLeXplore#tutorials). +**RLeXplore** can be seamlessly integrated with existing RL libraries, such as Stable-Baselines3, CleanRL, etc. We provide specific examples on [GitHub](https://github.com/RLE-Foundation/RLeXplore#tutorials). ## Custom Intrinsic Reward Since **RLeXplore** provides a standardized workflow and modular components of intrinsic rewards, which facilitates the creation, modification, and testing of new ideas. See the example of creating custom intrinsic rewards on [GitHub](https://github.com/RLE-Foundation/RLeXplore/blob/main/5%20custom_intrinsic_reward.ipynb). ## Benchmark Results -We conducted extensive experiments to evaluate the performance of **RLeXplore** on multiple well-recognized exploration tasks, such as *Super Mario Bros*, *MiniGrid*, etc. Please view our [Wandb](https://wandb.ai/yuanmingqi/RLeXplore/reportlist) space for the benchmark results. \ No newline at end of file +We conducted extensive experiments to evaluate the performance of **RLeXplore** on multiple well-recognized exploration tasks, such as *SuperMarioBros*, *MiniGrid*, etc. Please view our [Wandb](https://wandb.ai/yuanmingqi/RLeXplore/reportlist) space for the benchmark results. \ No newline at end of file