Framework to support LTL task allocation to a team of cooperative robots acting under multiple objectives.
Explore the docs »
·
Report Bug
·
Request Feature
- About The Project
- Installation
- Usage
- Specifying Tasks
- Learning Policies for Agents
- Training
- Visualisation
- Data and Results
- Contact
- Acknowledgments
This framework supports multiagent reinforcment learning in enormous or unknown environments. The key idea of this framework is using a shared parameter network for the learning phase and then each agent executes its own policy. This is the so-called Centralised Training Decentralised Execution (CTDE) paradigm. This framework implements deterministic task allocation to cooperative robot teams, and parameterises the task allocation space for agents, updating the allocation parameters using a policy gradient approach. Parameterising the task allocation space is a unique way of scalarising the the multi-objective problem specifically in terms of task allocation.
- Create a new anaconda environment.
- There is an
environment.yml
file in the root directory which can be installed with:conda env create -n myenv -f environment.yml
- Clone the environment.
- To do development run
pip install . -e
- Troubleshooting step: if there are issues with Pytorch go to Pytorch Website and get the correct drivers to configure your GPU.
The example implemented in this research is a multiagent environment where agents interact together to learn how to complete a set of tasks. The environment is an augmented version of the teamgrid environment In this setting there are a number of challenges to overcome, including, how the agents learn to resolve conflicts such as moving to same square, or avoiding picking up the same object.
Environments are customised in the mappo/envs
directory. For example a simple test environment with two agents and two tasks can be found in mappo/envs/so_small_ltl_env.py
. The environments included are:
-
so_small_ltl_env.py
a 2 agent 2 task small grid environment. -
task${x}_agent2_simple.py
2 agents must complete$x$ -number of tasks, i.e. interacting with different objects to complete a mission.
In this framework tasks are specified using LTL trees. An example of an LTL parse tree for the formula
U
/ \
not and
| / \
r j U
/ \
not k
|
p
There are a number of representations, however, this framework uses prefix notation, e.g. for the LTL parse tree above
The prefix notations are injected directly into the environment in the mappo/envs
directory.
The main network used in for experiments in this framework is the AC_MA_MO_LTL_Model
which can be imported from mappo/networks
from mappo.networks.mo_ma_ltlnet import AC_MA_MO_LTL_Model
The task learning network is an Actor-Critic network where the critic network outputs a tensor with shape (...,num_objectives)
.
The network architecture to learn teamgrid policies which address LTL task specifications is shown below. Notice that, different from independent agent architectures, a single shared network is used for all agents. An agent does not 'see' the observations of other agents but the network is trained on the observations of all agents. In this way the CTDE requirement is met.
Input: img, task, ltl
|
v
Image Convolution (64 filters)
|
v
Memory LSTM (if use_memory=True) LSTM LTL Embedding
| |
| v
| LTL GRU
| |
| v
-------------------------------------------
v
Embedding: embedding
|
v
Composed Features: composed_x ------------|
| |
v v
Actor Network: logits for actions Critic Network: critic values
The algorithm for training utilises a novel multiagent multi-objective Proximal Policy Optimisation algorithm which uses mini-batch. In the multiagent version of PPO the idea is to share the parameters of the policy so that each agent directly learns from the trajectories of all other agents.
Training the model occurs in mappo/eval/team_grid
directory, for example, mappo/eval/team_grid/experiments.py
.
The following steps are followed:
- Register the environment and initialise data recorder.
- Initialised input parameters.
- Construct an observation environment for each agent.
- Specify the device and call the model constructor.
- There are two sets of parameters to update:
$\kappa$ -task allocation, and$\theta$ -policy. - Initialise the PPO algorithm and parameters.
- While the frames is less than the number of total frames and the best score for each objective is less than some threshold, collect experiences for each agent and update
$\kappa, \theta$ .- The loss function is based on both parameters:
where the first term manages the cost of each agent $i$ performing tasks while the second term manages the probability of completion for each task $j$.
- Print the outputs of the training and save models based on high-performing outputs.
After training has completed the learned policies can be visualised using:
python mappo/eval/team_grid/dec_visualisation.py
An example of the simple training environment can be seen below.
Results for experiments can be found in the data folder.
Distributed under the Apache-2.0 License. See LICENSE
for more information.
Thomas Robinson - @tmrobai - Email
Project Link: https://github.com/tmrob2/ltl2teambot