代写0CCS0CSE编程、代做Episode

Introduction to CS & Engineering (0CCS0CSE)

Assignment 23: Episode

1 Value Function

Implementing Eq. 1 can cause confusion because V (S) is on both sides of the equation and

in Python V (S) is a dictionary. This document will help explain lines 23−25 in Algorithm 1.

V (St) = V (St) + α[Rt+1 + γV (St+1) − V (St)] (1)

Although lines 23 and 24 appear to update the valueFunction dictionary in Algorithm 1,

they do not. Lines 23 and 24 are Retrieve information from the value function dictionary.

The introduction of two new variables, v st1 and v st0, to replace V (St+1) and V (St), would

help to clarify that only line 25 changes the dictionary.

v st1 ⇐ GetValueOf(board)

v st0 ⇐ GetValueOf(previousState)

V (St) ⇐ v st0+session.learningRate×(reward+(session.discountRate×v st1)−v st0)

Furthermore, GetValueOf(...) is a multistep process (1) get the key from the board (2)

check if the key is in valueFunction, either i. the key is in valueFunction —return the

value associated with the key in the dictionary, e.g., return self.valueFunction[key] or

ii. the key is not in valueFunction —add the key to the dictionary, initialise its value

to zero and return 0. It would be best to add a new method, getValueOf(self, board),

which does all of this. In Algorithm 1, lines 23 and 24, both board and previousState are

TicTacToe objects.

Algorithm 1 This method executes a single tictactoe game and updates the state value

table after every move played by the RL agent.

1: procedure episode(board, Opponent, session)

3: result ⇐ True

4: turn ⇐ 0

5: previousState ⇐ CopyBoard()

7: while not board.isGameOver() and result do

8: if turn > 1 then :

9: turn ⇐ 0

10: end if

11:

12: agentMoved ⇐ False

13:

14: if turn is 0 and Session.agentFirst or turn is 1 and not session.agentFirst then

15: result ⇐ makeTrainingMove(board, session.epsilon)

16: agentMoved ⇐ True

17: else

18: result ⇐ opponent.makeMove(board)

19: end if

20:

21: if agentMoved then

22: reward ⇐ getReward(board)

23: V (St+1) ⇐ GetValueOf(board)

24: V (St) ⇐ GetValueOf(previousState)

25: V (St) ⇐ V (St) +session.learningRate ×(reward + (session.discountRate ×

V (St+1)) − V (St))

26: previousState ⇐ CopyBoard()

27:

28: end if

29:

30: turn ⇐ turn + 1

31: end while

32:

33: reward ⇐ GetReward(board)

34: V (St+1) ⇐ GetValueOf(board)

35: V (St+1) ⇐= V (St+1) + session.learningRate ∗ reward

36: end procedure

请加QQ：99515681 或邮箱：99515681@qq.com WX：codehelp

IT界

珠穆朗玛网 > 科技 > IT界 >

代写0CCS0CSE编程、代做Episode

频道精选

最火资讯