一、作業(yè)描述
在這個HW中,你可以自己實現(xiàn)一些深度強化學(xué)習(xí)方法:
1、策略梯度Policy Gradient
2、Actor-Critic
這個HW的環(huán)境是OpenAI gym的月球著陸器。希望這個月球著陸器落在兩個旗子中間。
什么是月球著陸器?
“LunarLander-v2”是模擬飛行器在月球表面著陸時的情況。
這項任務(wù)是使飛機能夠“安全地”降落在兩個黃色旗幟之間的停機坪上。著陸平臺始終位于坐標(biāo)(0,0)處。坐標(biāo)是狀態(tài)向量中的前兩個數(shù)字。“LunarLander-v2”實際上包括“代理”和“環(huán)境”。在本作業(yè)中,我們將利用函數(shù)“step()”來控制“代理”的動作。
那么‘step()’將返回由“環(huán)境”給出的觀察/狀態(tài)和獎勵…
Box(8,)意味著觀察是一個8維向量
‘Discrete(4)’意味著代理可以采取四種行動。
- 0表示代理不會采取任何操作
- 2意味著代理將向下加速
- 1,3意味著代理將向左和向右加速
接下來,我們將嘗試讓代理與環(huán)境進(jìn)行交互。
在采取任何行動之前,我們建議調(diào)用’ reset()'函數(shù)來重置環(huán)境。此外,該函數(shù)將返回環(huán)境的初始狀態(tài)。
1、Policy Gradient
直接根據(jù)狀態(tài)輸出動作或者動作的概率。那么怎么輸出呢,最簡單的就是使用神經(jīng)網(wǎng)絡(luò)。網(wǎng)絡(luò)應(yīng)該如何訓(xùn)練來實現(xiàn)最終的收斂呢?反向傳播算法,我們需要一個誤差函數(shù),通過梯度下降來使我們的損失最小。但對于強化學(xué)習(xí)來說,我們不知道動作的正確與否,只能通過獎勵值來判斷這個動作的相對好壞。如果一個動作得到的reward多,那么我們就使其出現(xiàn)的概率增加,如果一個動作得到的reward少,我們就使其出現(xiàn)的概率減小。
每次循環(huán)都要收集很多數(shù)據(jù)才進(jìn)行一次參數(shù)更新。
2、Actor-Critic
增加基準(zhǔn)判斷這個動作是否是真的好!
分配不同的權(quán)重
組合衰減因子。
二、實驗
1、simple
#torch.set_deterministic(True)
torch.use_deterministic_algorithms(True)
training result:
testing:
test reward:
server:
score:
2、medium
……
NUM_BATCH = 500 # totally update the agent for 400 time
rate = 0.99
……
while True:
action, log_prob = agent.sample(state) # at, log(at|st)
next_state, reward, done, _ = env.step(action)
log_probs.append(log_prob) # [log(a1|s1), log(a2|s2), ...., log(at|st)]
seq_rewards.append(reward)
state = next_state
total_reward += reward
total_step += 1
if done:
final_rewards.append(reward)
total_rewards.append(total_reward)
# calculate accumulative rewards
for i in range(2, len(seq_rewards)+1):
seq_rewards[-i] += rate * (seq_rewards[-i+1])
rewards += seq_rewards
break
training result:
testing:
test reward:
server:
score:
3、strong
from torch.optim.lr_scheduler import StepLR
class ActorCritic(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Sequential(
nn.Linear(8, 16),
nn.Tanh(),
nn.Linear(16, 16),
nn.Tanh()
)
self.actor = nn.Linear(16, 4)
self.critic = nn.Linear(16, 1)
self.values = []
self.optimizer = optim.SGD(self.parameters(), lr=0.001)
def forward(self, state):
hid = self.fc(state)
self.values.append(self.critic(hid).squeeze(-1))
return F.softmax(self.actor(hid), dim=-1)
def learn(self, log_probs, rewards):
values = torch.stack(self.values)
loss = (-log_probs * (rewards - values.detach())).sum()
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
self.values = []
def sample(self, state):
action_prob = self(torch.FloatTensor(state))
action_dist = Categorical(action_prob)
action = action_dist.sample()
log_prob = action_dist.log_prob(action)
return action.item(), log_prob
training result:
testing:
test reward:
server:
score:
三、代碼
** 準(zhǔn)備工作**
首先,我們需要安裝所有必要的軟件包。
其中一個是由OpenAI構(gòu)建的gym,它是一個開發(fā)強化學(xué)習(xí)算法的工具包。
可以利用“step()”使代理按照隨機選擇的“random_action”動作。
“step()”函數(shù)將返回四個值:
-觀察/狀態(tài)
-獎勵
-完成 done(對/錯)
-其他信息
observation, reward, done, info = env.step(random_action)
print(done)
獎勵
著陸平臺始終位于坐標(biāo)(0,0)處。坐標(biāo)是狀態(tài)向量中的前兩個數(shù)字。獎勵從屏幕頂部移動到著陸墊和零速度大約是100…140分。如果著陸器離開著陸平臺,它將失去回報。如果著陸器崩潰或停止,本集結(jié)束,獲得額外的-100或+100點。每個支腿接地觸點為+10。點燃主引擎每幀扣-0.3分。解決了就是200分。
隨機代理
開始訓(xùn)練之前,我們可以看看一個隨機的智能體能否成功登陸月球。
env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
done = False
while not done:
action = env.action_space.sample()
observation, reward, done, _ = env.step(action)
img.set_data(env.render(mode='rgb_array'))
display.display(plt.gcf())#展示當(dāng)前圖窗的句柄
display.clear_output(wait=True)
政策梯度
現(xiàn)在,我們可以建立一個簡單的政策網(wǎng)絡(luò)。網(wǎng)絡(luò)將返回動作空間中的一個動作。
class PolicyGradientNetwork(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(8, 16)
self.fc2 = nn.Linear(16, 16)
self.fc3 = nn.Linear(16, 4)
def forward(self, state):
hid = torch.tanh(self.fc1(state))
hid = torch.tanh(self.fc2(hid))
return F.softmax(self.fc3(hid), dim=-1)
然后,我們需要構(gòu)建一個簡單的代理。代理將根據(jù)上述策略網(wǎng)絡(luò)的輸出進(jìn)行操作。代理可以做幾件事:
-
learn()
:根據(jù)對數(shù)概率和獎勵更新策略網(wǎng)絡(luò)。 -
sample()
:在從環(huán)境接收到觀察之后,利用策略網(wǎng)絡(luò)來告知要采取哪個動作。該函數(shù)的返回值包括動作概率和對數(shù)概率。
from torch.optim.lr_scheduler import StepLR
class PolicyGradientAgent():
def __init__(self, network):
self.network = network
self.optimizer = optim.SGD(self.network.parameters(), lr=0.001)
def forward(self, state):
return self.network(state)
def learn(self, log_probs, rewards):
loss = (-log_probs * rewards).sum() # You don't need to revise this to pass simple baseline (but you can)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
def sample(self, state):
action_prob = self.network(torch.FloatTensor(state))
action_dist = Categorical(action_prob)
action = action_dist.sample()
log_prob = action_dist.log_prob(action)
return action.item(), log_prob
訓(xùn)練代理
現(xiàn)在讓我們開始訓(xùn)練我們的代理人。
通過將代理和環(huán)境之間的所有交互作為訓(xùn)練數(shù)據(jù),策略網(wǎng)絡(luò)可以從所有這些嘗試中學(xué)習(xí)。
agent.network.train() # Switch network into training mode
EPISODE_PER_BATCH = 5 # update the agent every 5 episode
NUM_BATCH = 500 # totally update the agent for 400 time
avg_total_rewards, avg_final_rewards = [], []
prg_bar = tqdm(range(NUM_BATCH))#進(jìn)度條
for batch in prg_bar:
log_probs, rewards = [], []
total_rewards, final_rewards = [], []
# collect trajectory
for episode in range(EPISODE_PER_BATCH):
state = env.reset()
total_reward, total_step = 0, 0
seq_rewards = []
while True:
action, log_prob = agent.sample(state) # at, log(at|st)
next_state, reward, done, _ = env.step(action)
log_probs.append(log_prob) # [log(a1|s1), log(a2|s2), ...., log(at|st)]
# seq_rewards.append(reward)
state = next_state
total_reward += reward
total_step += 1
rewards.append(reward) # change here
# ! IMPORTANT !
# Current reward implementation: immediate reward, given action_list : a1, a2, a3 ......
# rewards : r1, r2 ,r3 ......
# medium:change "rewards" to accumulative decaying reward, given action_list : a1, a2, a3, ......
# rewards : r1+0.99*r2+0.99^2*r3+......, r2+0.99*r3+0.99^2*r4+...... , r3+0.99*r4+0.99^2*r5+ ......
# boss : implement Actor-Critic
if done:
final_rewards.append(reward)
total_rewards.append(total_reward)
break
print(f"rewards looks like ", np.shape(rewards))
print(f"log_probs looks like ", np.shape(log_probs))
# record training process
avg_total_reward = sum(total_rewards) / len(total_rewards)
avg_final_reward = sum(final_rewards) / len(final_rewards)
avg_total_rewards.append(avg_total_reward)
avg_final_rewards.append(avg_final_reward)
prg_bar.set_description(f"Total: {avg_total_reward: 4.1f}, Final: {avg_final_reward: 4.1f}")
# update agent
# rewards = np.concatenate(rewards, axis=0)
rewards = (rewards - np.mean(rewards)) / (np.std(rewards) + 1e-9) # normalize the reward ,std求標(biāo)準(zhǔn)差
agent.learn(torch.stack(log_probs), torch.from_numpy(rewards))#torch.from_numpy創(chuàng)建一個張量,torch.stack沿一個新維度對輸入張量序列進(jìn)行連接,序列中所有張量應(yīng)為相同形狀
print("logs prob looks like ", torch.stack(log_probs).size())
print("torch.from_numpy(rewards) looks like ", torch.from_numpy(rewards).size())
訓(xùn)練結(jié)果
在訓(xùn)練過程中,我們記錄了“avg_total_reward ”,它表示在更新策略網(wǎng)絡(luò)之前集的平均總獎勵。理論上,如果代理變得更好,avg_total_reward會增加。
plt.plot(avg_total_rewards)
plt.title("Total Rewards")
plt.show()
此外,“avg_final_reward”表示集的平均最終獎勵。具體來說,最終獎勵是一集最后收到的獎勵,表示飛行器是否成功著陸。
plt.plot(avg_final_rewards)
plt.title("Final Rewards")
plt.show()
測試
測試結(jié)果將是5次測試的平均獎勵
fix(env, seed)
agent.network.eval() # set the network into evaluation mode
NUM_OF_TEST = 5 # Do not revise this !!!
test_total_reward = []
action_list = []
for i in range(NUM_OF_TEST):
actions = []
state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
total_reward = 0
done = False
while not done:
action, _ = agent.sample(state)
actions.append(action)
state, reward, done, _ = env.step(action)
total_reward += reward
img.set_data(env.render(mode='rgb_array'))
display.display(plt.gcf())
display.clear_output(wait=True)
print(total_reward)
test_total_reward.append(total_reward)
action_list.append(actions) # save the result of testing
動作分布文章來源:http://www.zghlxwxcb.cn/news/detail-498326.html
distribution = {}
for actions in action_list:
for action in actions:
if action not in distribution.keys():
distribution[action] = 1
else:
distribution[action] += 1
print(distribution)
服務(wù)器
下面的代碼模擬了judge服務(wù)器上的環(huán)境??捎糜跍y試。文章來源地址http://www.zghlxwxcb.cn/news/detail-498326.html
action_list = np.load(PATH,allow_pickle=True) # The action list you upload
seed = 543 # Do not revise this
fix(env, seed)
agent.network.eval() # set network to evaluation mode
test_total_reward = []
if len(action_list) != 5:
print("Wrong format of file !!!")
exit(0)
for actions in action_list:
state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))
total_reward = 0
done = False
for action in actions:
state, reward, done, _ = env.step(action)
total_reward += reward
if done:
break
print(f"Your reward is : %.2f"%total_reward)
test_total_reward.append(total_reward)
到了這里,關(guān)于【李宏毅】HW12的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!