There is a nice tutorial that explains how Q-Learning works here. The following python code implements the basic principals of Q-Learning:
Let’s assume we have a state matrix defining how we can transition between states, and a goal state (5):
GOAL_STATE = 5 # rows are states, columns are actions STATE_MATRIX = np.array([[np.nan, np.nan, np.nan, np.nan, 0., np.nan], [np.nan, np.nan, np.nan, 0., np.nan, 100.], [np.nan, np.nan, np.nan, 0., np.nan, np.nan], [np.nan, 0., 0., np.nan, 0., np.nan], [0., np.nan, np.nan, 0, np.nan, 100.], [np.nan, 0., np.nan, np.nan, 0, 100.]]) Q_MATRIX = np.zeros(STATE_MATRIX.shape)
Visually this can be represented as follows:
For example if you are in state 0, we can go to state 4, define by the 0. . If we are in state 4, we can directly goto to state 5 define by the 100. . np.nan define impossible transitions. Finally we initialize an empty Q-Matrix.
Now the Q-Learning algorithm is simple. The comments in the following code segment will guide through the steps:
i = 0 while i < MAX_EPISODES: # pick a random state state = random.randint(0, 5) while state != goal_state: # find possible actions for this state. candidate_actions = _find_next(STATE_MATRIX[state]) # randomly pick one action. action = random.choice(candidate_actions) # determine what the next states could be for this action... next_actions = _find_next(STATE_MATRIX[action]) values = [] for item in next_actions: values.append(Q_MATRIX[action][item]) # add some exploration randomness... if random.random() < EPSILON: # so we do not always select the best... max_val = random.choice(values) else: max_val = max(values) # calc the Q value matrix... Q_MATRIX[state][action] = STATE_MATRIX[state][action] + \ EPSILON * max_val # next step. state = action i += 1
We need one little helper routine for this – it will help in determine the next possible step I can do:
def _find_next(items): res = [] i = 0 for item in items: if item >= 0: res.append(i) i += 1 return res
Finally we can output the results:
Q_MATRIX = Q_MATRIX / Q_MATRIX.max() np.set_printoptions(formatter={'float': '{: 0.3f}'.format}) print Q_MATRIX
This will output the following Q-Matrix:
[[ 0.000 0.000 0.000 0.000 0.800 0.000] [ 0.000 0.000 0.000 0.168 0.000 1.000] [ 0.000 0.000 0.000 0.107 0.000 0.000] [ 0.000 0.800 0.134 0.000 0.134 0.000] [ 0.044 0.000 0.000 0.107 0.000 1.000] [ 0.000 0.000 0.000 0.000 0.000 0.000]]
This details for example the best path to get from e.g. state 2 to state 5 is: 2 -> 3 (0.107), 3 -> 1 (0.8), 1 -> 5 (1.0).