DQN 自主學習玩轉 AI 俄羅斯方塊

作者 | 李秋鍵

責編 | Elle

自從20世紀80年代以來，遊戲AI產生巨大的變化，誕生了「自主思考型AI」,遊戲中的NPC會在遊戲中觀察及分析，根據玩家的行為做出針對性的應對，不再只是按照一個目標一直走下去，而是更加靈活多樣。

自主思考性的AI是基於有限狀態機與行為樹，也就是多個if-else的組合。有限狀態機是以電腦AI的當前狀態為主體，通過編寫不同的狀態之間的轉換條件來控制電腦AI，不同的狀態下擁有不同的目標、策略與行動。

行為樹則是以目前的行為作為主題，通過條件判斷，得出接下來應該採取的行為策略和行為內容。其中最為典型的便為吃豆人遊戲的應用:

而近幾年來人工智慧發展的迅速，直接推動了AI遊戲領域的發展，其中最為代表性的算法便是DQN自主學習，即通過計算機自己嘗試去遊戲，從失敗中總結經驗達到高水平AI的效果。

其中就包括王者榮耀等遊戲的AI機制。今天我們就將利用DQN去教會計算機玩轉俄羅斯方塊。

其中訓練的部分效果見如圖所示：

開始之前我們有必要去了解下DQN算法的基礎知識。

基本原理

1. 什麼是DQN？

DQN（Deep Q-Learning）可謂是深度強化學習（Deep Reinforcement Learning，DRL）的開山之作，是將深度學習與強化學習結合起來從而實現從感知（Perception）到動作（ Action ）的端對端（End-to-end）學習的一種全新的算法。

2. DQN是如何運算的？

（1）通過Q-Learning使用reward來構造標籤

（2）通過experience replay（經驗池）的方法來解決相關性及非靜態分布問題

（3）使用一個CNN（MainNet）產生當前Q值，使用另外一個CNN（Target）產生Target Q值

3、DQN的網絡模型？

下面是代碼部分的講解，我將在關鍵代碼處進行注釋說明。

代碼實現

1、所使用的模塊的導入：

#!/usr/bin/env python

from __future__ import print_function

import tensorflow as tf

import cv2

import sys

import random

import numpy as np

from collections import deque#導入雙端隊列

sys.path.append("game/")#添加Game目錄到系統環境變量

import dummy_game

import tetris_fun as game

2、訓練前的定義及初始化：

GAME = 'fangkuai' # 設置遊戲名稱

ACTIONS = 5 # 設置遊戲動作數目

GAMMA = 0.99 # 設置增強學習更新公式中的累計折扣因子

OBSERVE = 100000. # 設置觀察期的疊代次數

EXPLORE = 2000000. # 設置探索期的觀察次數

FINAL_EPSILON = 0.0001 # 設置 ε的最終最小值

INITIAL_EPSILON = 0.0001 # 設置 ε的初始值

REPLAY_MEMORY = 50000 # 設置replay memory的容量

BATCH = 32 # 設置每次網絡參數更新世用的樣本數目

FRAME_PER_ACTION = 1#設置幾幀圖像進行一次動作

3. 神經網絡調用函數參數的定義：

#設置w參數的函數，分布符合正太分布，且方差為0.01

def weight_variable(shape):

initial = tf.truncated_normal(shape, stddev = 0.01)

return tf.Variable(initial)

#設置B參數的函數，為偏執項，初始值為0.01

def bias_variable(shape):

initial = tf.constant(0.01, shape = shape)

return tf.Variable(initial)

#定義卷積操作，實現卷積核w在數據x卷積操作

def conv2d(x, W, stride):

return tf.nn.conv2d(x, W, strides = [1, stride, stride, 1], padding = "SAME")

#定義池化函數，大小為2*2，步長為2

def max_pool_2x2(x):

return tf.nn.max_pool(x, ksize = [1, 2, 2, 1], strides = [1, 2, 2, 1], padding = "SAME")

#定義神經網絡參數值函數，目的是用來判斷狀態帶來的結果

def createNetwork:

# 卷積層參數

W_conv1 = weight_variable([8, 8, 4, 32])

b_conv1 = bias_variable([32])

W_conv2 = weight_variable([4, 4, 32, 64])

b_conv2 = bias_variable([64])

W_conv3 = weight_variable([3, 3, 64, 64])

b_conv3 = bias_variable([64])

W_fc1 = weight_variable([1600, 512])

b_fc1 = bias_variable([512])

W_fc2 = weight_variable([512, ACTIONS])

b_fc2 = bias_variable([ACTIONS])

# 輸入層

s = tf.placeholder("float", [None, 80, 80, 4])

# 隱藏層，用relu激活函數

h_conv1 = tf.nn.relu(conv2d(s, W_conv1, 4) + b_conv1)

h_pool1 = max_pool_2x2(h_conv1)

h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2, 2) + b_conv2)

#h_pool2 = max_pool_2x2(h_conv2)

h_conv3 = tf.nn.relu(conv2d(h_conv2, W_conv3, 1) + b_conv3)

#h_pool3 = max_pool_2x2(h_conv3)

#h_pool3_flat = tf.reshape(h_pool3, [-1, 256])

h_conv3_flat = tf.reshape(h_conv3, [-1, 1600])

h_fc1 = tf.nn.relu(tf.matmul(h_conv3_flat, W_fc1) + b_fc1)

#輸出層

readout = tf.matmul(h_fc1, W_fc2) + b_fc2

return s, readout, h_fc1

4. 訓練部分的代碼：

#定義訓練神經網絡函數，目的由執行的動作判斷結果

def trainNetwork(s, readout, h_fc1, sess):

# 定義損失函數

a = tf.placeholder("float", [None, ACTIONS])

y = tf.placeholder("float", [None])

readout_action = tf.reduce_sum(tf.multiply(readout, a), reduction_indices=1)

cost = tf.reduce_mean(tf.square(y - readout_action))

train_step = tf.train.AdamOptimizer(1e-6).minimize(cost)

# 開啟遊戲模擬器，會打開一個模擬器窗口，實時顯示遊戲信息

game_state = game.GameState

# 創建雙端隊列用來存放replay memory

D = deque

# 參數保存在文檔中

a_file = open("logs_" + GAME + "/readout.txt", 'w')

h_file = open("logs_" + GAME + "/hidden.txt", 'w')

# 設置遊戲的初始狀態，設置初始動作為不執行，並將初始狀態修改為80*80*4的大小

do_nothing = np.zeros(ACTIONS)

do_nothing[0] = 1

x_t, r_0, terminal = game_state.frame_step(do_nothing)#frame_step是遊戲程序中的參數

x_t = cv2.cvtColor(cv2.resize(x_t, (80, 80)), cv2.COLOR_BGR2GRAY)

ret, x_t = cv2.threshold(x_t,1,255,cv2.THRESH_BINARY)#圖像轉為黑白

s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)

# 用於加載或保存網絡參數

saver = tf.train.Saver

sess.run(tf.initialize_all_variables)

checkpoint = tf.train.get_checkpoint_state("saved_networks")

if checkpoint and checkpoint.model_checkpoint_path:

saver.restore(sess, checkpoint.model_checkpoint_path)#模型保存的文件夾名稱

print("Successfully loaded:", checkpoint.model_checkpoint_path)

else:

print("Could not find old network weights")

# 開始訓練

epsilon = INITIAL_EPSILON#設置 ε的初始值

t = 0#設置t為學習次數

while "flappy bird" != "angry bird":

#使用ε貪心策略選擇一個動作

readout_t = readout.eval(feed_dict={s : [s_t]})[0]

a_t = np.zeros([ACTIONS])

action_index = 0

if t % FRAME_PER_ACTION == 0:

#執行一個隨機動作

if random.random <= epsilon:

print("----------Random Action----------")

action_index = random.randrange(ACTIONS)

a_t[random.randrange(ACTIONS)] = 1

#由神經網絡計算的Q（s,a）值確定執行的動作

else:

action_index = np.argmax(readout_t)

a_t[action_index] = 1

else:

a_t[0] = 1 # 不執行任何動作

# 隨著遊戲的進行不斷降低ε的值，減少隨機動作

if epsilon > FINAL_EPSILON and t > OBSERVE:

epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE

# 執行選擇的動作，並獲得下一狀態及回報

x_t1_colored, r_t, terminal = game_state.frame_step(a_t)#分別為執行動作，與模擬器交互獲得獎勵和下一幀圖像以及遊戲是否終止

x_t1 = cv2.cvtColor(cv2.resize(x_t1_colored, (80, 80)), cv2.COLOR_BGR2GRAY)

ret, x_t1 = cv2.threshold(x_t1, 1, 255, cv2.THRESH_BINARY)

x_t1 = np.reshape(x_t1, (80, 80, 1))

#s_t1 = np.append(x_t1, s_t[:,:,1:], axis = 2)

s_t1 = np.append(x_t1, s_t[:, :, :3], axis=2)

# 將狀態轉移過程存儲到D中，用於更新參數時採樣

D.append((s_t, a_t, r_t, s_t1, terminal))

if len(D) > REPLAY_MEMORY:

D.popleft

# 過了觀察期才會進行網絡參數的更新

if t > OBSERVE:

# 從D中隨機採樣，用於參數更新

minibatch = random.sample(D, BATCH)

# 分別將當前的狀態，採取的動作，獲得的回報，下一狀態分組存放

s_j_batch = [d[0] for d in minibatch]

a_batch = [d[1] for d in minibatch]

r_batch = [d[2] for d in minibatch]

s_j1_batch = [d[3] for d in minibatch]

#計算Q（s,a）的新值

y_batch = 

readout_j1_batch = readout.eval(feed_dict = {s : s_j1_batch})

for i in range(0, len(minibatch)):

terminal = minibatch[i][4]

# 如果遊戲結束則只有反饋值

if terminal:

y_batch.append(r_batch[i])

else:

y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))

# 使用梯度下降更新網絡參數

train_step.run(feed_dict = {

y : y_batch,

a : a_batch,

s : s_j_batch}

)

# 狀態發生改變時用於下次循環

s_t = s_t1

t += 1

# 每進行10000次疊代保留一下參數

if t % 10000 == 0:

saver.save(sess, 'saved_networks/' + GAME + '-dqn', global_step = t)

# 列印遊戲信息

state = ""

if t <= OBSERVE:

state = "observe"

elif t > OBSERVE and t <= OBSERVE + EXPLORE:

state = "explore"

else:

state = "train"

print("TIMESTEP", t, "/ STATE", state,

"/ EPSILON", epsilon, "/ ACTION", action_index, "/ REWARD", r_t,

"/ Q_MAX %e" % np.max(readout_t))

# write info to files

'''

if t % 10000 <= 100:

a_file.write(",".join([str(x) for x in readout_t]) + ' ')

h_file.write(",".join([str(x) for x in h_fc1.eval(feed_dict={s:[s_t]})[0]]) + ' ')

cv2.imwrite("logs_tetris/frame" + str(t) + ".png", x_t1)

'''

總結

下面開始展示遊戲訓練效果：

訓練5秒後得到的分數：

訓練10秒後得到的分數：

由此可見隨著計算機的學習，他對於俄羅斯方塊的操作是越來越好。

作者：李秋鍵，CSDN 博客專家，CSDN達人課作者。碩士在讀於中國礦業大學，開發有安卓武俠遊戲一部，vip視頻解析，文意轉換寫作機器人等項目，發表論文若干，多次高數競賽獲獎等等。

蘋果官網以舊換新價格暴跌；戴威退出 ofo 法人代表及高管；TensorFlow 2.1.0 發布| 極客頭條