This program intends to apply a simple single state Q-Learning algorithm to learn the optimal bid for different single sealed bid auctions with customizable parameters. Below, the following components of the program will be described:
- The Game
- The Algorithm
- Customizable Parameters
- Game Parameters
- Algorithm Parameters
- Agent Parameters
- Auction Types
- Traditional > First-Price Sealed Bid Auction
- Reverse > Second-Price Sealed Bid Auction
- Loser Pays
- Output
- Future Directions
The game implemented here is a two-player sealed bid auction for an item. Essentially, we have our automated agent who is trying to optimize their bid to gain the highest possible utility. The other player with which our agent repeatedly plays against is one who determines their bid based on a normal distribution dictated by input parameters. Additionally, the values of the item being bid on is determined in random normal fashion again based on user input parameters.
One important rule is that our agent wins all ties between itself and the other player.
I employ a simple, model-free reinforcement algorithm to learn the optimal bidding strategy. Essentially, a table of action values are iteratively updated until convergence on an optimal distribution of action choices. The updating algorithm is as such. The algorithm can be fully described here https://en.wikipedia.org/wiki/Q-learning. In our case, we only have one state, the initial state where no bids have been selected. Also, our reward function is going to be based on the derived utility of the agent rather than the absolute winnings of the game.
As an additional note, epsilon annealling is applied in this algorithm so that exploration is reduced as we head towards a convergence state.
Game Parameters:
- Auction Type: The style of auction to be played by the agents. This will determine how winnings are distributed from each round of the game
Algorithm Parameters:
- Rounds: The numbers of times you would like the agent to play the auction, a higher number will lead to more desirable convergence properties
- Alpha: The learning rate of the algorithm, essentially how much our agent updates its choices with each round of the auction
- Gamma: The discount factor of the algorithm, how much does the agent prefer future rewards as opposed to current ones. For desirable properties, it is advised this is placed at 0.5
- Epsilon: The exploration rate of the alorithm, or the rate at which the algorithm explores the action space initially
Agent Parameters:
- Risk Aversion: A multiplier which scales the losing outcomes experienced by the agent linearly to make losing outcomes more drastic to experienced utility
- Risk Loving: A multiplier which scales the winning outcomes experienced by the agent linearly to make winning outcomes more drastic to experienced utility
- Other Mean: The mean value of the normal distribution with which the other player selects their bid
- Other Std: The standard deviation value of the normal distribution with which the other player selects their bid
- Value Mean: The mean value of the normal distribution describing the value of the item being bid on
- Value Std: The standard deviation of the normal distribution describing the value of the item being bid on
- Traditional: This is a typical auction where the winner pays the amount they bid, winnings are determined by the value of the item being bid on minus the winning bid
- Reverse: This is essentially a "Second-Price" auction in this case, where the winner pays the amount of the second highest bid (or losing bidder in this case). Thus, the winnings for the winner is the value of the item being bid on minus the value of the losing bid
- Loser Pays: This is an unconventional auction where the loser pays for the winning bid and does not win the item being bid on, this the winnings for the winner is the entirety of the value of the item being bid on
The current state of the program outputs three metrics and four graphs.
The three metrics are:
- The average value of the winnings for our agent through all rounds played.
- The average utility derived by our agent through all rounds played
- The optimal bid for the agent to play based on the ending convergence of the Q-Learning iterative algorithm
The four charts are:
- A histogram of the distribution of winning values from all rounds
- A histogram of the distribution of derived utility from all rounds
- The ending L1-normalized Q-values for all of the possible bids (higher amounts mean the bid is more optimal to play)
- An overlaid histogram of all of the bids played by our agent compared to the bids played by the other agent
Some interesting future directions to possibly explore are:
- Different or more complex utility functions for our agent
- Multiple other players rather than a single other players
- Different reinforcement algorithms for optimal choice convergence
- Functionality to run the simulation multiple times to make population inference
- More complex other player bidding strategies
- Different auction types