Reinforcement Learning and how it works

The tic-tac-toe agent uses reinforcement learning to improve its gameplay. Each turn, it examines the board's state and decides where to place its token (either O or X). It decides where to play its token by looking at a table of values. In the table, each state and possible action (token placement) will have a value ranging from -1 to 1. During play, the agent records every state-action pair that happens in a list, reviewing its moves post-game.  After the game ends, the agent looks at the states it visited and the actions it took and adjusts the values based on if it won, draw or lost. Winning state-action pairs adjust closer to 1. If the agent loses, they move towards -1, and in case of a draw, they shift closer to 0. The trained agent played against itself 200,000 times. This type of learning, where we wait for a whole episode to finish and then review our actions afterwards,  is called Monte Carlo learning.

Curiosities 

  • Do me a favour: either reload the game or play until you're making the first move. Start by placing your token in the right-center and then on your next turn place it bottom-center. You will see that the agent has the ability to win in the next turn by creating a diagonal line. Now place your third token in the left-center and you will notice a curious thing happen. The agent didn't win the game, instead it placed its token in a place that, once again, guarantees a win on the next move. Wherever you place your next token is up to you, but you will lose. This is because I didn't incorporate discounting into my algorithm. Humans use discounting in our everyday lives. If I gave you an option of £5 today or £6 in a year, which would you pick? Many would choose the immediate £5, even though the future offer is greater. That's because now is certain, while the future is unpredictable. In reinforcement learning, we sometimes apply discounting to emphasize that a present reward is more valuable than the same one in the future. Now is guaranteed but the future changes. Since I didn't use discounting, the agent doesn't really have a concept of time. This is why the agent didn't select the winning move straight away, because a guaranteed win in the future is worth the exact same as a win right now. In fact, a guaranteed win in a thousand years time is worth the exact same to it as a win right now, which isn't very humanlike. 

Problems

  • I had an issue with my agent not learning, even though the algorithm was correct, parameters were tuned and enough episodes had been run. The problem was with the board's state. The board's state was represented by a string of nine numbers: 0 for an empty space, 1 for X, and 2 for O. However, a critical oversight was not incorporating which token (X or O) the agent was playing as within that state. Imagine if there's an X on both the top left and top right corners with an empty spot in between, and you're told it's your turn but not which token you're holding (X or O), you'd be at a loss for the best move. This ambiguity also muddles the value of state-action pairs. For instance, if the centre spot holds a value of 1, implying a certain win, then this would mean placing a token in this spot is the optimal choice, regardless of the token you hold, which is wrong. I could have fixed this by incorporating the agent's token into the state. Instead, I changed the state representation: '1' now means 'my token' and '2' signifies the 'opponent's token'. Both the agent and player see mirrored versions of the same state.


Leave a comment

Log in with itch.io to leave a comment.