Skip to content

Latest commit

 

History

History
139 lines (96 loc) · 6.89 KB

README.md

File metadata and controls

139 lines (96 loc) · 6.89 KB

RL_history_evolution

Common RL Algorithms and Their History and Evolution


1. Early RL Algorithms (Pre-2000s)

These algorithms laid the theoretical foundation for modern RL.

a. Dynamic Programming (Bellman, 1950s)

  • Key Idea: Solve RL problems using recursive Bellman Equations for value functions.
  • Examples:
    • Value Iteration: Iteratively update value functions for each state.
    • Policy Iteration: Alternate between policy evaluation and policy improvement.
  • Limitations:
    • Requires a perfect model of the environment.
    • Computationally expensive for large state spaces.
  • Additional Context: Learning automata originated in Russia with Tsetlin's work in 1973 [1], marking another important milestone in the early history of RL.

b. Temporal Difference (TD) Learning (1980s)

  • Key Idea: Combine dynamic programming with Monte Carlo methods to learn from incomplete episodes.
  • Example: TD(0).
  • Significance: Introduced the concept of bootstrapping, estimating the value function from partial data.

c. Q-Learning (1989)

  • Key Idea: Learn an action-value function ( Q(s, a) ) to maximize expected reward without needing a model of the environment.
  • Algorithm: Update ( Q(s, a) ) iteratively using: [ Q(s, a) \leftarrow Q(s, a) + \alpha \big(r + \gamma \max_{a'} Q(s', a') - Q(s, a)\big) ]
  • Limitations: Struggles with high-dimensional state-action spaces and continuous spaces.
  • Performance: Recent research shows Q-Learning achieved higher cumulative rewards (112,000) compared to SARSA (-43,400) [2].

d. SARSA (1996)

  • Key Idea: A model-free, on-policy alternative to Q-learning that updates the Q-value based on the policy's current action: [ Q(s, a) \leftarrow Q(s, a) + \alpha \big(r + \gamma Q(s', a') - Q(s, a)\big) ]
  • Recent Insights: Q-Learning shows better processing time performance, while SARSA demonstrates better accuracy in decision-making [2].

2. Function Approximation and Early Neural Networks (1990s)

  • Problem: Classic algorithms struggled with high-dimensional state spaces.
  • Solution: Introduce function approximation to generalize across states.
  • Notable Work:
    • TD-Gammon (1992): First use of neural networks in RL to train a backgammon-playing agent, developed by Gerald Tesauro. It used 198 input signals and achieved intermediate-level play through self-play and TD((\lambda)) [14].

3. Modern RL with Deep Learning (2010s)

This era integrated deep learning with RL to handle large state-action spaces.

a. Deep Q-Networks (DQN, 2013)

  • Key Paper: "Playing Atari with Deep Reinforcement Learning" (DeepMind, 2013).
  • Key Idea: Combine Q-learning with deep neural networks to approximate ( Q(s, a) ).
  • Innovations:
    • Experience Replay: Store past experiences in a buffer and sample them randomly to break correlation in training data.
    • Target Networks: Use a separate network to compute the target Q-value, improving stability.
  • Applications: Mastered Atari games directly from pixels [3].

b. Policy Gradient Methods

  • Address limitations of value-based methods like DQN by directly optimizing policies.
  • Examples:
    • REINFORCE (1992): Basic policy gradient algorithm using Monte Carlo estimation of returns.
    • Actor-Critic Methods (2000s): Combine a policy model (actor) and a value function (critic) to reduce variance in gradient estimation.

c. Trust Region Policy Optimization (TRPO, 2015)

  • Key Paper: "Trust Region Policy Optimization" (Schulman et al.).
  • Key Idea: Constrain policy updates to stay within a trust region, ensuring stable updates and avoiding catastrophic policy changes.
  • Limitation: Computationally expensive due to second-order optimization.

d. Proximal Policy Optimization (PPO, 2017)

  • Key Paper: "Proximal Policy Optimization Algorithms" (Schulman et al., OpenAI).
  • Key Idea: Simplify TRPO by replacing trust-region constraints with a clipping mechanism in the objective function.
  • Advantages: Easier to implement and scales well for large-scale applications.
  • Significance: Specifically designed to avoid too large policy updates during training, ensuring stability [15].

e. Advantage Actor-Critic (A2C/A3C, 2016)

  • Key Idea: Extend actor-critic methods by asynchronously training multiple agents in parallel environments.
  • Advantage: Faster convergence due to parallelization.

4. Evolution to Model-Based RL (Late 2010s - Present)

  • Problem: Model-free methods are sample-inefficient, requiring millions of interactions with the environment.
  • Solution: Leverage a learned model of the environment to improve sample efficiency.
  • Notable Algorithms:
    • World Models (2018): Learn a generative model of the environment to simulate and plan.
    • Model-Based Value Expansion (MBVE, 2019): Use short-term rollouts from a learned model to augment training.

5. Multi-Agent RL (MARL)

  • Key Challenge: Environments with multiple interacting agents (e.g., cooperative or competitive games).
  • Notable Algorithms:
    • Independent Q-Learning (IQL): Treat each agent as an independent Q-learner.
    • Multi-Agent Deep Deterministic Policy Gradient (MADDPG, 2017): Extend DDPG for multi-agent settings by including other agents’ actions in the critic.

6. RL with Human Feedback (2020s)

  • Key Development: Combine RL with human feedback to train AI systems aligned with human preferences.
  • Examples:
    • RLHF (Reinforcement Learning from Human Feedback): Used to train large language models like ChatGPT.
  • Origins: Foundational research on RLHF began in 2017 by OpenAI and DeepMind [7].
  • Purpose: Specifically designed for tasks with complex, ill-defined goals that are difficult to specify mathematically [7].

Summary of Algorithm Evolution

Era Key Algorithms Highlights
Pre-2000s Dynamic Programming, Q-Learning Foundations of RL, model-free learning.
1990s TD-Gammon, Function Approximation Early neural networks in RL.
2010-2015 DQN, Policy Gradient Methods Deep RL revolution for high-dimensional tasks.
2015-2020 TRPO, PPO, A3C, DDPG Stability and scalability improvements.
2020s Model-Based RL, RLHF, MARL Sample efficiency, multi-agent, human alignment.

Reinforcement learning has evolved from simple theoretical constructs to robust algorithms capable of tackling complex, real-world problems. Each new wave of development addresses specific limitations, driving RL closer to general-purpose AI.