Monte Carlo Tree Search in NetLogo

Última modificación: 18 de Noviembre de 2020, y ha tenido 438 vistas

Etiquetas utilizadas: || || || ||

This post is the continuation of a previous post (in spanish) about the fundamentals of Monte Carlo Tree Search method for solving adversarial games. Maybe it is a good idea to start with it before reading this one (or any other source where this background is explained).

You can find all the code in the Adversarial Search folder from NetLogo-IA Project.

To implement Monte Carlo Tree Search (MCTS) in NetLogo we will need two main structures: states (with rules or actions) that reflect every configuration of the game/process, and nodes that will allow to build the tree game.

Although nodes will have a very specific structure, because they will apply directly the MCTS algorithm, states only have some few requirements, and their inner structure will depend on the game features. Indeed, here we will use only a kind of API providing methods to access, modify and create states (and rules/actions):

• state: it will be any structure that works with this interface:
get-content s         : get the content of the state s (game configuration)
get-playerJustMoved s : player (1 or 2) that has generated this state s
create-state c p : returns a new state with content c made by player p
• rules: structure that allows to execute actions on states. It works with this interface:
apply r s      : returns a new state, the result to apply the rule r over the state s
get-rules s : returns a list with the rules applicable to state s
get-result s p : returns the result of the game in state s from the point of view of player p

Note: In the implementation all the procedures are prefixed by MCTS:.

We will show some examples of how to correctly define this structure at the end of this post, when building real game players for some games by using MCTS algorithm.

The data structure that we will use to store the nodes of the tree built by MCTS algorithm will be:

• MCTSnode: agent that stores the nodes structure of the MCTS tree for the execution of the algorithm.

It has the following inner structure:

state       : state stored in the node
wins : accumulated value of the reward got by the tree from this node
visits : number of plays in the tree that uses this node
untriedRules: rules applicable to the state that have not been already used to generate children

And works with the following interface:

create-MCTSnode s : returns a MCTSnode as root with state=s, wins=0, visits=0, unTriedRules=all applicable rules.

to-report create-MCTSnode [s]
let rep nobody
create-MCTSnodes 1 [
set state s
set wins 0
set visits 0
set untriedRules get-rules s
set rep self
set shape "circle"
set color blue
set label (word state " W/V:" wins "/" visits)
ht
]
report rep
end

UCTSelectChild : returns the child of the node with best UCT.

to-report UCTSelectChild
let v visits
let child max-one-of out-link-neighbors [(wins / visits) + sqrt(2 * (ln v) / visits)]
report child
end

As we have used $\sqrt{2}$ as constant to balance exploration and explotation, this UCT function will work fine when the rewards (wins) from the terminal states are in $[0,1]$.

AddChild r : creates and returns a new child by applying the rule r to the current node.

let rep nobody
hatch-MCTSnodes 1 [
set state apply r state
set wins 0
set visits 0
set untriedRules get-rules state
create-MCTSrule-from myself [
set rule r
set label rule
]
set rep self
set label state
]
set untriedRules remove r untriedRules
report rep
end

Update result : updates the values of the node taking result as reward.

to Update [result]
set visits visits + 1
set wins wins + result
end

parent : returns the parent of the current node.

to-report parent
report one-of in-MCTSrule-neighbors
end

Once we have defined the auxiliary data structures and the interface to manipulate them, we can program the main procedure, UCT, that will return the best action to take on the current state to maximize our reward.

UCT root-state N

Following MCTS, this procedure will repeat N times from current state (the root node of the tree) the next steps:

1. Select a node in the current tree to be expanded.
2. Expand the selected node.
3. Simulate(Rollout) one complete game from the current expanded node.
4. Propagate the result of this game to all the involved nodes in the tree.

Finally, returns the best child of the root following the results obtained in the simulations.

The implementation in NetLogo can be as follows:

to-report UCT [root-state iter]
; Create the root node (with the input state)
let rootnode create-MCTSnode root-state

;Main loop for iter iterations
repeat iter [
; Start again from the root node
let cnode rootnode

; Selection stage:
; Using UCT selection. we go from the root node to a current leaf that
; has to be expanded

while [empty? [untriedRules] of cnode and any? [my-out-MCTSrules] of cnode]
[
set cnode [UCTSelectChild] of cnode
]

; Expand stage:
; Expand the previous node with one child (any of them)

; if we can expand (i.e. state/node is non-terminal)
if not empty? [untriedRules] of cnode
[
let r one-of untriedRules
]
]

; Simulation/Rollout stage:
; From this new node we play a random complete game (we use only states,
; not nodes, because we only need to know the final state and who is the
; winner)

let s [state] of cnode
while [not empty? get-rules s]
[
let r one-of (get-rules s)
set s (apply r s)
]

; Backpropagate stage:
; Update the nodes from the expanded one to the root

while [cnode != nobody]
[
Update get-result s (get-playerJustMoved state)
set label (word state " W/V:" wins "/" visits)
set cnode parent
]
]
]

; Return the best found rule

let rep nobody
; return the move that was most visited
set rep max-one-of out-MCTSrule-neighbors [visits]
]
report [rule] of MCTSrule [who] of rootnode [who] of rep
end

By default, the tree is hidden... you can show it adding some simple instructions to see how it evolves.

Some examples

Nim

Nim is a mathematical strategy game where two players take turns removing (i.e., nimming) objects/chips from one heap or pile (there are also variants with several heaps). On every turn, a player can remove from 1 to 3 objects. The winner is who removes the last chip.

Although there exists a winner strategy for this game, and it is possible to provide a perfect heuristic to play it, but we will ignore it and will prepare the state structure in order to play a blind MCTS:

The states will be given by a list with two numbers:

[content player]

In this case, the content is the number of chips in the heap, and the player will be 1 or 2.

The rules are numbers... the number of chips to remove.

The interface to manage correctly these states are:

; Get the content of the state
to-report get-content [s]
report first s
end
; Get the player that generates the state
to-report get-playerJustMoved [s]
report last s
end
; Create a state from the content and player
to-report create-state [c p]
report (list c p)
end
; Get the rules applicable to the state
to-report get-rules [s]
report (range 1 (min (list 4 (1 + get-content s))))
end
; Apply the rule r to the state s
to-report apply [r s]
let c get-content s
let p get-playerJustMoved s
report create-state (c - r) (3 - p)
end
; Move the result from the last state to the current one
to-report get-result [s p]
let pl get-playerJustMoved s
ifelse pl = p [report 1] [report 0]
end

Indeed, it will be very common to use a very similar structure as this one, and in a lot of cases we will have to only adapt the procedures get-rules, apply and get-result to the features of our game.

After that, preparing a main procedure to play against the computer is as easy as the following one:

; A function that removes n chips (for human player). After that, it
; tests if the game is over and, otherwise, let the computer play (with UCT).
to remove-chips [n]
if n <= num_chips [
set num_chips num_chips - n
]
ifelse num_chips = 0
[ user-message "You Win!!!"]
[ let m UCT (list num_chips 1) Max_iterations
user-message (word "I remove " m)
set num_chips num_chips - m
if num_chips = 0 [user-message "I Win!!!"]
]
end

where num_chips is a global variable where the current number of chips in the heap is stored, and Max_iterations is the number of iterations that UCT algorithm will perform to search for the optimal move.

Tic Tac Toe

In this case the states will keep the same structure as the previous example, but storing in the content a list with the values of the board [$p_0$ $p_1$ $p_2$ $p_3$ $p_4$ $p_5$ $p_6$ $p_7$ $p_8$], where $p_i$ can be $0$ (no piece), $1$ (piece from player 1), or $2$ (piece from player 2):

p0 p1 p2
p3 p4 p5
p6 p7 p8

The rules are positions (from 0 to 9) where the player put his piece.

The procedures that we must be adapted for these structures are:

; Get the rules applicable to the state
to-report get-rules [s]
let c get-content s
; Filter the empty places in the board
report filter [x -> item x c = 0] (range 0 9)
end
; Apply the rule r to the state s
to-report apply [r s]
let c get-content s
let p get-playerJustMoved s
; Fill the r place with the number of the current player
report create-state (replace-item r c (3 - p)) (3 - p)
end
; Move the result from the last state to the current one
to-report get-result [s p]
let pl get-playerJustMoved s
let c get-content s
; L have the 3-lines of the board
let L [ [0 1 2] [3 4 5] [6 7 8] [0 3 6] [1 4 7] [2 5 8] [0 4 8] [2 4 6] ]
; For every line, we see if it is filled with the same player
foreach L [
lin ->
let val map [x -> (item x c)] lin
if first val = last val and first val = (first bf val) [
ifelse first val = p [report 1] [report 0]
]
]
; if there is no winner lines, and the board is full, then it is a draw
if empty? get-rules s [report 0.5]
report [false]
end

In order to make an interactive interface for the game, we will use a world of $3\times 3$ patches that will store the board, and the pieces will be a breed of turtles:

breed [pieces piece]
patches-own [
value ; to store the piece (0/1/2) of this place
]

We need an auxiliary procedure to transform between the states (lists) and patches-board:

to-report board-to-state
report map [x -> [value] of x] (sort patches)
end

Then, a procedure to start a new game:

to start
ca
ask patches with [(pxcor + pycor) mod 2 = 0] [set pcolor grey]
end

and a forever procedure (that is, a procedure that is executed in a forever button) to catch the interaction of the user (with the mouse) and call the actions and the UCT algorithm when needed (when the IA plays):

to go
let played? false
if mouse-down? [
if not any? pieces-here [
set value 1
sprout-pieces 1 [
set shape "o"
set color white]
set played? true
]
]
if get-result (list (board-to-state) 1) 1 = 1 [
user-message "You win!!!"
stop
]
if get-result (list (board-to-state) 2) 2 = 0.5 [
user-message "Draw!!!"
stop
]
wait .1
if played? [
let m UCT (list (board-to-state) 1) Max_iterations
ask (item m (sort patches)) [
set value 2
sprout-pieces 1 [
set shape "x"
set color white]
]
]
if get-result (list (board-to-state) 2) 2 = 1 [
user-message "I win!!!"
stop
]
if get-result (list (board-to-state) 2) 2 = 0.5 [
user-message "Draw!!!"
stop
]
]
end