Gazebo仿真環(huán)境下的強化學習實現(xiàn)
主體源碼參照《Goal-Driven Autonomous Exploration Through Deep Reinforcement Learning》
1. 源碼拉取
git clone https://github.com/reiniscimurs/DRL-robot-navigation
筆者采用其強化學習方法,但是對于仿真環(huán)境以及機器人模型仍然用自己的包,源碼中采用了與論文強相關的用法
2. 強化學習實現(xiàn)
2.1 環(huán)境
源碼:
class GazeboEnv:
"""Superclass for all Gazebo environments."""
def __init__(self, launchfile, environment_dim):
self.environment_dim = environment_dim
self.odom_x = 0
self.odom_y = 0
self.goal_x = 1
self.goal_y = 0.0
self.upper = 5.0
self.lower = -5.0
self.velodyne_data = np.ones(self.environment_dim) * 10
self.last_odom = None
self.set_self_state = ModelState()
self.set_self_state.model_name = "r1"
self.set_self_state.pose.position.x = 0.0
self.set_self_state.pose.position.y = 0.0
self.set_self_state.pose.position.z = 0.0
self.set_self_state.pose.orientation.x = 0.0
self.set_self_state.pose.orientation.y = 0.0
self.set_self_state.pose.orientation.z = 0.0
self.set_self_state.pose.orientation.w = 1.0
self.gaps = [[-np.pi / 2 - 0.03, -np.pi / 2 + np.pi / self.environment_dim]]
for m in range(self.environment_dim - 1):
self.gaps.append(
[self.gaps[m][1], self.gaps[m][1] + np.pi / self.environment_dim]
)
self.gaps[-1][-1] += 0.03
port = "11311"
subprocess.Popen(["roscore", "-p", port])
print("Roscore launched!")
# Launch the simulation with the given launchfile name
rospy.init_node("gym", anonymous=True)
if launchfile.startswith("/"):
fullpath = launchfile
else:
fullpath = os.path.join(os.path.dirname(__file__), "assets", launchfile)
if not path.exists(fullpath):
raise IOError("File " + fullpath + " does not exist")
subprocess.Popen(["roslaunch", "-p", port, fullpath])
print("Gazebo launched!")
# Set up the ROS publishers and subscribers
self.vel_pub = rospy.Publisher("/r1/cmd_vel", Twist, queue_size=1)
self.set_state = rospy.Publisher(
"gazebo/set_model_state", ModelState, queue_size=10
)
self.unpause = rospy.ServiceProxy("/gazebo/unpause_physics", Empty)
self.pause = rospy.ServiceProxy("/gazebo/pause_physics", Empty)
self.reset_proxy = rospy.ServiceProxy("/gazebo/reset_world", Empty)
self.publisher = rospy.Publisher("goal_point", MarkerArray, queue_size=3)
self.publisher2 = rospy.Publisher("linear_velocity", MarkerArray, queue_size=1)
self.publisher3 = rospy.Publisher("angular_velocity", MarkerArray, queue_size=1)
self.velodyne = rospy.Subscriber(
"/velodyne_points", PointCloud2, self.velodyne_callback, queue_size=1
)
self.odom = rospy.Subscriber(
"/r1/odom", Odometry, self.odom_callback, queue_size=1
)
強化學習中環(huán)境用于產(chǎn)生狀態(tài)輸入,并通過智能體的動作產(chǎn)生新的狀態(tài),在類中設定了目標和里程計的私有域,代表了自己當前的目標點和位姿信息。
初始化方法中設置了若干發(fā)布者和訂閱者/gazebo/unpause_physics
用于在Gazebo仿真環(huán)境中取消暫停物理模擬。當Gazebo仿真環(huán)境處于暫停狀態(tài)時,物理模擬會停止,仿真中的物體將不再移動或交互。通過調(diào)用 /gazebo/unpause_physics 服務,你可以使仿真環(huán)境恢復到正常運行狀態(tài),允許物理模擬繼續(xù)進行。/gazebo/pause_physics
是一個ROS服務,用于在Gazebo仿真環(huán)境中暫停物理模擬。通過調(diào)用 /gazebo/pause_physics 服務,你可以將仿真環(huán)境的物理模擬暫停,這將導致仿真中的物體停止運動,仿真時間暫停。
注意,在初始化方法中可以看出,roslaunch也是通過程序啟動,節(jié)點一并啟動(在本機有Anaconda的情況下會有麻煩)
2.2 動作空間
由發(fā)布者可以看出強化學習的策略的動作空間:linear_velocity
:輸出線速度angular_velocity
:輸出角速度
2.3 狀態(tài)空間
如圖,作者的強化學習架構(gòu)為其輸入空間為24維向量,其中,180度掃描的激光點進行了20等分,選取每個等分區(qū)域的最小值作為顯著性代表值,剩下的4維為與目標的距離、與目標的轉(zhuǎn)角差、機器人的線速度、機器人的角速度
distance = np.linalg.norm(
[self.odom_x - self.goal_x, self.odom_y - self.goal_y]
)
theta = beta - angle
if theta > np.pi:
theta = np.pi - theta
theta = -np.pi - theta
if theta < -np.pi:
theta = -np.pi - theta
theta = np.pi - theta
vel_cmd = Twist()
vel_cmd.linear.x = action[0]
vel_cmd.angular.z = action[1]
robot_state = [distance, theta, action[0], action[1]]
def velodyne_callback(self, v):
data = list(pc2.read_points(v, skip_nans=False, field_names=("x", "y", "z")))
self.velodyne_data = np.ones(self.environment_dim) * 10
for i in range(len(data)):
if data[i][2] > -0.2:
dot = data[i][0] * 1 + data[i][1] * 0
mag1 = math.sqrt(math.pow(data[i][0], 2) + math.pow(data[i][1], 2))
mag2 = math.sqrt(math.pow(1, 2) + math.pow(0, 2))
beta = math.acos(dot / (mag1 * mag2)) * np.sign(data[i][1])
dist = math.sqrt(data[i][0] ** 2 + data[i][1] ** 2 + data[i][2] ** 2)
for j in range(len(self.gaps)):
if self.gaps[j][0] <= beta < self.gaps[j][1]:
self.velodyne_data[j] = min(self.velodyne_data[j], dist)
break
2.4 獎勵空間
在每一步仿真后都會計算獎勵,獎勵的準則為到達目標加200分,碰撞-100分,再根據(jù)角速度和線速度以及顯著性的最小激光點進行動態(tài)調(diào)整,引導機器人
def step(self, action):
...
reward = self.get_reward(target, collision, action, min_laser)
return state, reward, done, target
def get_reward(target, collision, action, min_laser):
if target:
return 100.0
elif collision:
return -100.0
else:
r3 = lambda x: 1 - x if x < 1 else 0.0
return action[0] / 2 - abs(action[1]) / 2 - r3(min_laser) / 2
2.5 TD3訓練
采用Actor-Critic的方法進行訓練,訓練為常規(guī)的強化學習架構(gòu),本文按照讀者有強化學習基礎行進,重要的是輸入的參數(shù),要定義好維度問題。
import os
import time
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from numpy import inf
from torch.utils.tensorboard import SummaryWriter
from replay_buffer import ReplayBuffer
from velodyne_env import GazeboEnv
def evaluate(network, epoch, eval_episodes=10):
avg_reward = 0.0
col = 0
for _ in range(eval_episodes):
count = 0
state = env.reset()
done = False
while not done and count < 501:
action = network.get_action(np.array(state))
a_in = [(action[0] + 1) / 2, action[1]]
state, reward, done, _ = env.step(a_in)
avg_reward += reward
count += 1
if reward < -90:
col += 1
avg_reward /= eval_episodes
avg_col = col / eval_episodes
print("..............................................")
print(
"Average Reward over %i Evaluation Episodes, Epoch %i: %f, %f"
% (eval_episodes, epoch, avg_reward, avg_col)
)
print("..............................................")
return avg_reward
class Actor(nn.Module):
def __init__(self, state_dim, action_dim):
super(Actor, self).__init__()
self.layer_1 = nn.Linear(state_dim, 800)
self.layer_2 = nn.Linear(800, 600)
self.layer_3 = nn.Linear(600, action_dim)
self.tanh = nn.Tanh()
def forward(self, s):
s = F.relu(self.layer_1(s))
s = F.relu(self.layer_2(s))
a = self.tanh(self.layer_3(s))
return a
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.layer_1 = nn.Linear(state_dim, 800)
self.layer_2_s = nn.Linear(800, 600)
self.layer_2_a = nn.Linear(action_dim, 600)
self.layer_3 = nn.Linear(600, 1)
self.layer_4 = nn.Linear(state_dim, 800)
self.layer_5_s = nn.Linear(800, 600)
self.layer_5_a = nn.Linear(action_dim, 600)
self.layer_6 = nn.Linear(600, 1)
def forward(self, s, a):
s1 = F.relu(self.layer_1(s))
self.layer_2_s(s1)
self.layer_2_a(a)
s11 = torch.mm(s1, self.layer_2_s.weight.data.t())
s12 = torch.mm(a, self.layer_2_a.weight.data.t())
s1 = F.relu(s11 + s12 + self.layer_2_a.bias.data)
q1 = self.layer_3(s1)
s2 = F.relu(self.layer_4(s))
self.layer_5_s(s2)
self.layer_5_a(a)
s21 = torch.mm(s2, self.layer_5_s.weight.data.t())
s22 = torch.mm(a, self.layer_5_a.weight.data.t())
s2 = F.relu(s21 + s22 + self.layer_5_a.bias.data)
q2 = self.layer_6(s2)
return q1, q2
# TD3 network
class TD3(object):
def __init__(self, state_dim, action_dim, max_action):
# Initialize the Actor network
self.actor = Actor(state_dim, action_dim).to(device)
self.actor_target = Actor(state_dim, action_dim).to(device)
self.actor_target.load_state_dict(self.actor.state_dict())
self.actor_optimizer = torch.optim.Adam(self.actor.parameters())
# Initialize the Critic networks
self.critic = Critic(state_dim, action_dim).to(device)
self.critic_target = Critic(state_dim, action_dim).to(device)
self.critic_target.load_state_dict(self.critic.state_dict())
self.critic_optimizer = torch.optim.Adam(self.critic.parameters())
self.max_action = max_action
self.writer = SummaryWriter()
self.iter_count = 0
def get_action(self, state):
# Function to get the action from the actor
state = torch.Tensor(state.reshape(1, -1)).to(device)
return self.actor(state).cpu().data.numpy().flatten()
# training cycle
def train(
self,
replay_buffer,
iterations,
batch_size=100,
discount=1,
tau=0.005,
policy_noise=0.2, # discount=0.99
noise_clip=0.5,
policy_freq=2,
):
av_Q = 0
max_Q = -inf
av_loss = 0
for it in range(iterations):
# sample a batch from the replay buffer
(
batch_states,
batch_actions,
batch_rewards,
batch_dones,
batch_next_states,
) = replay_buffer.sample_batch(batch_size)
state = torch.Tensor(batch_states).to(device)
next_state = torch.Tensor(batch_next_states).to(device)
action = torch.Tensor(batch_actions).to(device)
reward = torch.Tensor(batch_rewards).to(device)
done = torch.Tensor(batch_dones).to(device)
# Obtain the estimated action from the next state by using the actor-target
next_action = self.actor_target(next_state)
# Add noise to the action
noise = torch.Tensor(batch_actions).data.normal_(0, policy_noise).to(device)
noise = noise.clamp(-noise_clip, noise_clip)
next_action = (next_action + noise).clamp(-self.max_action, self.max_action)
# Calculate the Q values from the critic-target network for the next state-action pair
target_Q1, target_Q2 = self.critic_target(next_state, next_action)
# Select the minimal Q value from the 2 calculated values
target_Q = torch.min(target_Q1, target_Q2)
av_Q += torch.mean(target_Q)
max_Q = max(max_Q, torch.max(target_Q))
# Calculate the final Q value from the target network parameters by using Bellman equation
target_Q = reward + ((1 - done) * discount * target_Q).detach()
# Get the Q values of the basis networks with the current parameters
current_Q1, current_Q2 = self.critic(state, action)
# Calculate the loss between the current Q value and the target Q value
loss = F.mse_loss(current_Q1, target_Q) + F.mse_loss(current_Q2, target_Q)
# Perform the gradient descent
self.critic_optimizer.zero_grad()
loss.backward()
self.critic_optimizer.step()
if it % policy_freq == 0:
# Maximize the actor output value by performing gradient descent on negative Q values
# (essentially perform gradient ascent)
actor_grad, _ = self.critic(state, self.actor(state))
actor_grad = -actor_grad.mean()
self.actor_optimizer.zero_grad()
actor_grad.backward()
self.actor_optimizer.step()
# Use soft update to update the actor-target network parameters by
# infusing small amount of current parameters
for param, target_param in zip(
self.actor.parameters(), self.actor_target.parameters()
):
target_param.data.copy_(
tau * param.data + (1 - tau) * target_param.data
)
# Use soft update to update the critic-target network parameters by infusing
# small amount of current parameters
for param, target_param in zip(
self.critic.parameters(), self.critic_target.parameters()
):
target_param.data.copy_(
tau * param.data + (1 - tau) * target_param.data
)
av_loss += loss
self.iter_count += 1
# Write new values for tensorboard
self.writer.add_scalar("loss", av_loss / iterations, self.iter_count)
self.writer.add_scalar("Av. Q", av_Q / iterations, self.iter_count)
self.writer.add_scalar("Max. Q", max_Q, self.iter_count)
def save(self, filename, directory):
torch.save(self.actor.state_dict(), "%s/%s_actor.pth" % (directory, filename))
torch.save(self.critic.state_dict(), "%s/%s_critic.pth" % (directory, filename))
def load(self, filename, directory):
self.actor.load_state_dict(
torch.load("%s/%s_actor.pth" % (directory, filename))
)
self.critic.load_state_dict(
torch.load("%s/%s_critic.pth" % (directory, filename))
)
# Set the parameters for the implementation
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # cuda or cpu
seed = 0 # Random seed number
eval_freq = 5e3 # After how many steps to perform the evaluation
max_ep = 500 # maximum number of steps per episode
eval_ep = 10 # number of episodes for evaluation
max_timesteps = 5e6 # Maximum number of steps to perform
expl_noise = 1 # Initial exploration noise starting value in range [expl_min ... 1]
expl_decay_steps = (
500000 # Number of steps over which the initial exploration noise will decay over
)
expl_min = 0.1 # Exploration noise after the decay in range [0...expl_noise]
batch_size = 40 # Size of the mini-batch
discount = 0.99999 # Discount factor to calculate the discounted future reward (should be close to 1)
tau = 0.005 # Soft target update variable (should be close to 0)
policy_noise = 0.2 # Added noise for exploration
noise_clip = 0.5 # Maximum clamping values of the noise
policy_freq = 2 # Frequency of Actor network updates
buffer_size = 1e6 # Maximum size of the buffer
file_name = "TD3_velodyne" # name of the file to store the policy
save_model = True # Weather to save the model or not
load_model = False # Weather to load a stored model
random_near_obstacle = True # To take random actions near obstacles or not
# Create the network storage folders
if not os.path.exists("./results"):
os.makedirs("./results")
if save_model and not os.path.exists("./pytorch_models"):
os.makedirs("./pytorch_models")
# Create the training environment
environment_dim = 20
robot_dim = 4
env = GazeboEnv("multi_robot_scenario.launch", environment_dim)
time.sleep(5)
torch.manual_seed(seed)
np.random.seed(seed)
state_dim = environment_dim + robot_dim
action_dim = 2
max_action = 1
# Create the network
network = TD3(state_dim, action_dim, max_action)
# Create a replay buffer
replay_buffer = ReplayBuffer(buffer_size, seed)
if load_model:
try:
network.load(file_name, "./pytorch_models")
except:
print(
"Could not load the stored model parameters, initializing training with random parameters"
)
# Create evaluation data store
evaluations = []
timestep = 0
timesteps_since_eval = 0
episode_num = 0
done = True
epoch = 1
count_rand_actions = 0
random_action = []
# Begin the training loop
while timestep < max_timesteps:
# On termination of episode
if done:
if timestep != 0:
network.train(
replay_buffer,
episode_timesteps,
batch_size,
discount,
tau,
policy_noise,
noise_clip,
policy_freq,
)
if timesteps_since_eval >= eval_freq:
print("Validating")
timesteps_since_eval %= eval_freq
evaluations.append(
evaluate(network=network, epoch=epoch, eval_episodes=eval_ep)
)
network.save(file_name, directory="./pytorch_models")
np.save("./results/%s" % (file_name), evaluations)
epoch += 1
state = env.reset()
done = False
episode_reward = 0
episode_timesteps = 0
episode_num += 1
# add some exploration noise
if expl_noise > expl_min:
expl_noise = expl_noise - ((1 - expl_min) / expl_decay_steps)
action = network.get_action(np.array(state))
action = (action + np.random.normal(0, expl_noise, size=action_dim)).clip(
-max_action, max_action
)
# If the robot is facing an obstacle, randomly force it to take a consistent random action.
# This is done to increase exploration in situations near obstacles.
# Training can also be performed without it
if random_near_obstacle:
if (
np.random.uniform(0, 1) > 0.85
and min(state[4:-8]) < 0.6
and count_rand_actions < 1
):
count_rand_actions = np.random.randint(8, 15)
random_action = np.random.uniform(-1, 1, 2)
if count_rand_actions > 0:
count_rand_actions -= 1
action = random_action
action[0] = -1
# Update action to fall in range [0,1] for linear velocity and [-1,1] for angular velocity
a_in = [(action[0] + 1) / 2, action[1]]
next_state, reward, done, target = env.step(a_in)
done_bool = 0 if episode_timesteps + 1 == max_ep else int(done)
done = 1 if episode_timesteps + 1 == max_ep else int(done)
episode_reward += reward
# Save the tuple in replay buffer
replay_buffer.add(state, action, reward, done_bool, next_state)
# Update the counters
state = next_state
episode_timesteps += 1
timestep += 1
timesteps_since_eval += 1
# After the training is done, evaluate the network and save it
evaluations.append(evaluate(network=network, epoch=epoch, eval_episodes=eval_ep))
if save_model:
network.save("%s" % file_name, directory="./models")
np.save("./results/%s" % file_name, evaluations)
3. 總結(jié)
實際上利用Gazebo進行強化學習無非是環(huán)境獲取上的不同,Gazebo的環(huán)境控制需要使用ROS服務進行控制,狀態(tài)可以通過Gazebo進行獲取,同時某些必要數(shù)據(jù)需要從話題中獲取,最重要的是組織獲取 的數(shù)據(jù)(通過話題等)與控制Gazebo的仿真步驟之間的組合。
強化學習的手段與傳統(tǒng)算法無異,強化學習方法最重要的是要確定動作空間、狀態(tài)空間、獎勵空間,這三個空間是work的前提。
在筆者自己的環(huán)境中,將獲取里程計的話題以及獲取激光點的數(shù)據(jù)替換成為了自己的本機的仿真環(huán)境匹配的話題。
self.point_cloud = rospy.Subscriber(
"/scan", LaserScan, self.laser_callback, queue_size=1
)
self.odom = rospy.Subscriber(
"/odom", Odometry, self.odom_callback, queue_size=1
)
self.frontiers_sub = rospy.Subscriber(
"/frontiers", Marker, self.frontiers, queue_size=1
)
self.center_frontiers_sub = rospy.Subscriber(
"/center_frontiers", Marker, self.center_frontiers, queue_size=1
)
與Gazebo仿真環(huán)境相關的話題有很多,用于在ROS中與仿真環(huán)境進行通信和交互。以下是一些常見的與Gazebo相關的話題以及它們的作用:
/gazebo/model_states
:這個話題發(fā)布了所有仿真中模型的狀態(tài)信息,包括它們的位置、姿態(tài)、線速度、角速度等。
/gazebo/link_states
:類似于 /gazebo/model_states
,但是發(fā)布了模型的每個鏈接的狀態(tài)信息,適用于多鏈接模型。
/gazebo/set_model_state
:這是一個用于設置模型狀態(tài)的話題。通過向該話題發(fā)送消息,你可以控制模型的位置、姿態(tài)等屬性。
/gazebo/set_link_state
:類似于 /gazebo/set_model_state
,但用于設置鏈接的狀態(tài),允許你更精細地控制模型的不同部分。
/gazebo/unpause_physics
和 /gazebo/pause_physics
:這些話題用于控制仿真的暫停和繼續(xù)。通過向 /gazebo/unpause_physics 發(fā)送請求,可以取消暫停仿真物理模擬,而通過向 /gazebo/pause_physics 發(fā)送請求,可以將仿真暫停。
/gazebo/reset_world
和 /gazebo/reset_simulation
:這些話題用于重置仿真環(huán)境。 /gazebo/reset_world 用于重置仿真世界的狀態(tài),而 /gazebo/reset_simulation 用于重置仿真整個仿真會話的狀態(tài)。
/gazebo/delete_model
:通過向這個話題發(fā)送消息,你可以請求刪除指定名稱的模型,從仿真環(huán)境中移除它。
/gazebo/spawn_sdf_model
和 /gazebo/spawn_urdf_model
:這些話題用于在仿真環(huán)境中生成新的SDF或URDF模型。通過向這些話題發(fā)送消息,你可以在仿真環(huán)境中動態(tài)生成模型。
/gazebo/apply_body_wrench
:這個話題用于對仿真環(huán)境中的物體施加力或扭矩,以模擬外部力或操作。文章來源:http://www.zghlxwxcb.cn/news/detail-688767.html
/gazebo/camera/*
:用于模擬相機傳感器,例如 /gazebo/camera/image
可以獲取相機的圖像數(shù)據(jù)。文章來源地址http://www.zghlxwxcb.cn/news/detail-688767.html
到了這里,關于Gazebo仿真環(huán)境下的強化學習實現(xiàn)的文章就介紹完了。如果您還想了解更多內(nèi)容,請在右上角搜索TOY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關文章,希望大家以后多多支持TOY模板網(wǎng)!