Skip to content

Latest commit

 

History

History
171 lines (85 loc) · 9.89 KB

RL-application.mdown

File metadata and controls

171 lines (85 loc) · 9.89 KB

Application

RL is currently not very popular in deployment of production (as of 2020/2021), as the sample efficiency are low and it may have no significant edge over expert systems and traditional analytical models in real-life.

Recommending Systems

RL is used in recommending systems due to its inherent interactive and dynamic nature; however, the sparsity of data is the greatest pain. Both GAN and supervised-learning with importance sampling are developed to address the issue.

  • Generative Adversarial User Model for Reinforcement Learning Based Recommendation System (ICML 19')

This paper propose a novel model-based RL (for higher sample efficiency) framework to imitate user behavior dynamics & learn reward function.

The generative model yields the optimal policy (distribution over actions) given particular state and action set. With careful choice of regularizer, we can derive a closed-form solution for the generative model.

(This could be related to Fenchel-young-ization?) Thus the distribution is determined by reward function (This form clearly relates with maximum entropy IRL.)

The adversarial training is a minimax process, where one generates reward that differentiate the actual action from those generated by the generative model (serves as discriminator/outer min), the other maximizes the policy return given true state (serve as generator/inner max).

Since (# candidate items) can be large, the author proposes cascading DQN. Cascading Q-network breaks the multiple-dimensional action one by one; when making decision for one dimension, we use the "optimal" Q-value for the actions that are already considered.

  • Deep Reinforcement Learning for List-wise Recommendations (2017)

This paper defines the recommendation process as a sequential decision making process and uses actor-critic to implement list-wise recommendation. Not many new ideas from today's perspective.

  • Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems (2019)

This paper improves the state design by considering the length of time for users staying on the page.

The common state design (including this paper) is a sequence of {u, [(item_i, action_mask_i, action_i, other_info_i) for i in range(n)]}.

This paper uses a S-network for the simulation of environment. It uses an imporance-weighted loss from ground truth to mimic the real-world situation and provide samples.

Note that Q-network uses multiple LSTM, as one LSTM is not enough for extracting all information.

  • Recommendations with Negative Feedback via Pairwise Deep Reinforcement Learning (2018)

This paper considers an important prior knowledge in recommending systems, that positive feedbacks (likes) are far more fewer than negative (dislikes) ones; it embeds both clicks and skips into the state space.

The proposed DEERS framework not only minimizes TD-losses, but also maximizes the difference between Q-values of the chosen action and a proposed "competitor" a to ensure "rankings" between competing items.

  • Toward Simulating Environments in Reinforcement Learning Based Recommendations

This paper also uses GAN to make up for the sparsity of data as that in the ICML 19' paper. Such method can be viewed as opponent modeling in MARL, as you imagine the customer being the "opponent" of your recommending system.

See https://zhuanlan.zhihu.com/p/77332847 for reference and a more detailed explanation.

Packet Switching

Packet switching is a decision-making process which can be solved by RL.

  • Neural Packet Classification

The main solution for packet switching is decision tree based on IP ranges (i.e. high-dimensional squares).

This paper uses RL to generate better decision tree; it sees each split of decision tree as an "action", and minimizes the loss weighted by time and space usage.

To better parallelize the training process, the author takes the construction of each node as one episode, effectively trainining a 1-step RL agent (but not contextual bandit since those 1-steps are correlated).

  • A Deep Reinforcement Learning Perspective on Internet Congestion Control (ICML 19')

This paper presents a test suite for RL-guided congestion control based on the OpenAI Gym interface.

For congestion control. actions are changes to sending rate and states are bounded histories of network statistics.

Network Intrusion Detection

There are some works of RL on network intrusion by modeling the problem as a competitive MARL environment / Bayesian security game, but there is no technical improvement w.r.t. RL in this field.

Network intrusion detection can be divided roughly into host-based (based on log files) and network based (based on flow & packet); it can also be partitioned into signature-based and anomaly-based.

  • An Overview of Flow-based and Packet-based Intrusion Detection Performance in High-Speed Networks

  • PHY-layer Spoofing Detection with Reinforcement Learning in Wireless Networks

Anti-poaching

Anti-poaching process can be viewed as a Green security game that is solvable by RL. Prof. Fang Fei in CMU has conducted lots of research in this field.

  • Dual-Mandate Patrols: Multi-Armed Bandits for Green Security (AAAI 21' best paper runner-up)

[Under construction]

  • Green Security Game with Community Engagement (AAMAS 20')

This is a game-theoretic paper, but not a RL paper.

"Community" means that the defender can "recruit" informants, but the (possibly multiple) attacker may update the strategy accordingly. The informant-atacker social network forms a bipartite graph.

One point worth noting is that to solve such competitive multi-agent problem, the author analyzes the parameterized fixed point of the game, then convert it to bi-level optimization problem.

Such paradigm can be used in competitive MARL settings.

  • Deep Reinforcement Learning for Green Security Games with Real-Time Information (AAAI 19')

The author uses 7*7 grid for experiment; such size can serve as a sample for our experiment design.

Traffic Control

Traffic control is one of the most common applications for multi-agent RL.

  • Multi-agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control (2019)

Indepdendent RL agent though multi-agent environment; the state of neighbouring lights are fed into the network for communication.

  • Integrating Independent and Centralized Multi-Agent Reinforcement Learning For Traffic Signal Network Optimization

This paper makes an improvement from independent RL.

It solves a set of problem with two assumptions:

  1. The credit assignment is clear (i.e. local & global reward are apparent). This is a strong assumption in MARL;

  2. no agent is particular; all agents are homogeneous.

It designs a global Q-value, yet every individual agent maximizes local reward; the gradient descent considers both global and local Q-value.

Then, the author imposes a regularizer to limit the value difference between any local Q to global Q to maintain a "balance" between traffic lights.

Public Health

RL could be used in public health for both government control and individual agent simulation.

  • A Microscopic Epidemic Model and Pandemic Prediction Using Multi-Agent Reinforcement Learning

Individual agents can be seen as agents optimizing its rewards. However, to actually use the model, it is hard to pin down the reward function in the first place.

One point worth noting (though not RL-related) is that "microscopic" is actually not precise in this title; "agent-based" would be more accurate.

  • A brief review of synthetic population generation practices in agent-based social simulation

  • Generating Realistic Synthetic Population Datasets

The above two papers are not direct applications of RL, but related to agent-based epidemic modeling.

  • Reinforcement learning for optimization of covid-19 mitigation policies

A work to apply RL on Covid-19 mitigation. Nothing new technologically.

Resource Provisioning

Resource provisioning requires prediction for the resource and optimization with given situation; due to the high-dimensional essence of provisioning, it is widely solved by two-stage method or differentiable optimization neural layer.

This category includes both resource distribution and mitigation strategy (i.e. handling VM error online).

  • A Hierarchical Framework of Cloud Resource Allocation and Power Management Using Deep Reinforcement Learning (2017)

  • Energy-Efficient Virtual Machines Consolidation in Cloud Data Centers Using Reinforcement Learning

  • Resource Management with Deep Reinforcement Learning (2016)

A work by MIT's Hongzi Mao; one of the earliest and classic work about resource management.

The paper uses policy gradient method (NOT actor-critic since return can be calculated precisely instead of bootstrapping)

The state and step design is the most important point of this paper. Resource allocation in one time step can be very complicated, therefore the author models the process as filling "slots" in a square, and the execution one real timestep is modeled as a special action to move slots upward for one unit.

  • Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions (2020)

Utilizing multi-arm bandit to optimize mitigation strategy. The core contribution of this work is the design of the MDP; there are many devils in the details when it comes to production line (e.g. data correlations; how to retrieve data efficiently as one may need long observation to determine the effect of an action?)

Autonomous Driving

Yet, it could be long before RL-powered autonomous driving actually put in use.

[Under construction]

  • Reinforcement Learning for Autonomous Driving with Latent State Inference and Spatial-Temporal Relationships

  • Driverless Car: Autonomous Driving Using Deep Reinforcement Learning in Urban Environment