Multi-Agent RLlib (MARLlib) is a MARL benchmark based on Ray and one of its toolkits RLlib. It provides MARL research community a unified platform for developing and evaluating the new ideas in various multi-agent environments. There are four core features of MARLlib.
- it collects most of the existing MARL algorithms widely acknowledged by the community and unifies them under one framework.
- it gives a solution that enables different multi-agent environments using the same interface to interact with the agents.
- it guarantees excellent efficiency in both the training and sampling process.
- it provides trained results, including learning curves and pretrained models specific to each task and algorithm's combination, with finetuned hyper-parameters to guarantee credibility.
Project Website: https://sites.google.com/view/marllib
We collected most of the existing multi-agent environment and multi-agent reinforcement learning algorithms and unified them under one framework based on Ray 's RLlib to boost the MARL research.
The MARL baselines include independence learning (IQL, A2C, DDPG, TRPO, PPO), centralized critic learning (COMA, MADDPG, MAPPO, HATRPO), and value decomposition (QMIX, VDN, FACMAC, VDA2C) are all implemented.
Popular environments like SMAC, MaMujoco, and Google Research Football are provided with a unified interface.
The algorithm code and environment code are fully separated. Changing the environment needs no modification on the algorithm side and vice versa.
Here we provide a table for comparison of MARLlib and before benchmarks.
Benchmark | Github Stars | Learning Mode | Available Env | Algorithm Type | Algorithm Number | Continues Control | Asynchronous Interact | Distributed Training | Framework | Last Update |
---|---|---|---|---|---|---|---|---|---|---|
PyMARL | CP | 1 | VD | 5 | * | |||||
PyMARL2 | CP | 1 | VD | 12 | PyMARL | |||||
MARL-Algorithms | CP | 1 | VD+Comm | 9 | * | |||||
EPyMARL | CP | 4 | IL+VD+CC | 10 | PyMARL | |||||
Marlbenchmark | CP+CL | 4 | VD+CC | 5 | ✔️ | pytorch-a2c-ppo-acktr-gail | ||||
MAlib | SP | 8 | SP | 9 | ✔️ | * | ||||
MARLlib | CP+CL+CM+MI | 10 | IL+VD+CC | 18 | ✔️ | ✔️ | ✔️ | Ray/RLlib |
CP, CL, CM, and MI represent cooperative, collaborative, competitive, and mixed task learning modes. IL, VD, and CC represent independent learning, value decomposition, and centralized critic categorization. SP represents self-play. Comm represents communication-based learning. Asterisk denotes that the benchmark uses its framework.
The tutorial of RLlib can be found at this link. Fast examples can be found at this link. These will help you quickly dive into RLlib.
We hope MARLlib can benefit everyone interested in RL/MARL.
Most of the popular environment in MARL research has been incorporated in this benchmark:
Env Name | Learning Mode | Observability | Action Space | Observations |
---|---|---|---|---|
LBF | Mixed | Both | Discrete | Discrete |
RWARE | Collaborative | Partial | Discrete | Discrete |
MPE | Mixed | Both | Both | Continuous |
SMAC | Cooperative | Partial | Discrete | Continuous |
MetaDrive | Collaborative | Partial | Continuous | Continuous |
MAgent | Mixed | Partial | Discrete | Discrete |
Pommerman | Mixed | Both | Discrete | Discrete |
MaMujoco | Cooperative | Partial | Continuous | Continuous |
GRF | Collaborative | Full | Discrete | Continuous |
Hanabi | Cooperative | Partial | Discrete | Discrete |
Each environment has a readme file, standing as the instruction for this task, talking about env settings, installation, and some important notes.
We provide three types of MARL algorithms as our baselines including:
Independent Learning: IQL DDPG PG A2C TRPO PPO
Centralized Critic: COMA MADDPG MAAC MAPPO MATRPO HATRPO HAPPO
Value Decomposition: VDN QMIX FACMAC VDAC VDPPO
Here is a chart describing the characteristics of each algorithm:
Algorithm | Support Task Mode | Need Global State | Action | Learning Mode | Type |
---|---|---|---|---|---|
IQL* | Mixed | No | Discrete | Independent Learning | Off Policy |
PG | Mixed | No | Both | Independent Learning | On Policy |
A2C | Mixed | No | Both | Independent Learning | On Policy |
DDPG | Mixed | No | Continuous | Independent Learning | Off Policy |
TRPO | Mixed | No | Both | Independent Learning | On Policy |
PPO | Mixed | No | Both | Independent Learning | On Policy |
COMA | Mixed | Yes | Both | Centralized Critic | On Policy |
MADDPG | Mixed | Yes | Continuous | Centralized Critic | Off Policy |
MAA2C* | Mixed | Yes | Both | Centralized Critic | On Policy |
MATRPO* | Mixed | Yes | Both | Centralized Critic | On Policy |
MAPPO | Mixed | Yes | Both | Centralized Critic | On Policy |
HATRPO | Cooperative | Yes | Both | Centralized Critic | On Policy |
HAPPO | Cooperative | Yes | Both | Centralized Critic | On Policy |
VDN | Cooperative | No | Discrete | Value Decomposition | Off Policy |
QMIX | Cooperative | Yes | Discrete | Value Decomposition | Off Policy |
FACMAC | Cooperative | Yes | Continuous | Value Decomposition | Off Policy |
VDAC | Cooperative | Yes | Both | Value Decomposition | On Policy |
VDPPO* | Cooperative | Yes | Both | Value Decomposition | On Policy |
IQL is the multi-agent version of Q learning. MAA2C and MATRPO are the centralized version of A2C and TRPO. VDPPO is the value decomposition version of PPO.
Current Task & Available algorithm mapping: Y for available, N for not suitable, P for partially available on some scenarios. (Note: in our code, independent algorithms may not have I as prefix. For instance, PPO = IPPO)
Env w Algorithm | IQL | PG | A2C | DDPG | TRPO | PPO | COMA | MADDPG | MAAC | MATRPO | MAPPO | HATRPO | HAPPO | VDN | QMIX | FACMAC | VDAC | VDPPO |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
LBF | Y | Y | Y | N | Y | Y | Y | N | Y | Y | Y | Y | Y | P | P | P | P | P |
RWARE | Y | Y | Y | N | Y | Y | Y | N | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
MPE | P | Y | Y | P | Y | Y | P | P | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
SMAC | Y | Y | Y | N | Y | Y | Y | N | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
MetaDrive | N | Y | Y | Y | Y | Y | N | N | N | N | N | N | N | N | N | N | N | N |
MAgent | Y | Y | Y | N | Y | Y | Y | N | Y | Y | Y | Y | Y | N | N | N | N | N |
Pommerman | Y | Y | Y | N | Y | Y | P | N | Y | Y | Y | Y | Y | P | P | P | P | P |
MaMujoco | N | Y | Y | Y | Y | Y | N | Y | Y | Y | Y | Y | Y | N | N | Y | Y | Y |
GRF | Y | Y | Y | N | Y | Y | Y | N | Y | Y | Y | Y | Y | Y | Y | Y | Y | Y |
Hanabi | Y | Y | Y | N | Y | Y | Y | N | Y | Y | Y | Y | Y | N | N | N | N | N |
You can find a comprehensive list of existing MARL algorithms in different environments here.
Install Ray
pip install ray==1.8.0 # version sensitive
Add patch of MARLlib
cd patch
python add_patch.py
Y to replace source-packages code
Attention: Above is the common installation. Each environment needs extra dependency. Please read the installation instruction in envs/base_env/install.
python marl/main.py --algo_config=MAPPO [--finetuned] --env-config=smac with env_args.map_name=3m
--finetuned is optional, force using the finetuned hyperparameter
We provide an introduction to the code directory to help you get familiar with the codebase:
- top level directory structure
This picture is in image/code-MARLlib.png
- MARL directory structure
This picture is in image/code-MARL.png.png
- ENVS directory structure
This picture is in image/code-ENVS.png.png
MARLlib is friendly to incorporating a new environment. Besides the ten we already implemented, we support almost all kinds of MARL environments. Before contributing new environment, you must know:
Things you ought to cover:
- provide a new environment interface python file, follow the style of MARLlib/envs/base_env
- provide a corresponding config yaml file, follow the style of MARLlib/envs/base_env/config
- provide a corresponding instruction readme file, follow the style of MARLlib/envs/base_env/install
Things not essential:
- change the MARLlib data processing pipeline
- provide a unique runner or controller specific to the environment
- concern about the data logging
The ten environments we already contained have covered great diversity in action space, observation space, agent-env interaction style, task mode, additional information like action mask, etc. The best practice to incorporate your environment is to find an existing similar one and provide the same interface.
Our patch files fix most RLlib-related errors on MARL.
Here we only list the common bugs, not RLlib-related. (Mostly is your mistake)
-
observation/action out of space bug:
- make sure the observation/action space defined in env init function
- has same data type with env returned data (e.g., float32/64)
- env returned data range is in the space scope (e.g., box(-2,2))
- the returned env observation contained the required key (e.g., action_mask/state)
- make sure the observation/action space defined in env init function
-
Action NaN is invaild bug
- this is common bug espectially in continues control problem, carefully finetune the algorithm's hyperparameter
- smaller learning rate
- set some action value bound
- this is common bug espectially in continues control problem, carefully finetune the algorithm's hyperparameter