Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems,2014, Derong Liu, Fellow, IEEE, and Qinglai Wei, Member, IEEE
本文是第一次對(duì)離散非線性系統(tǒng)采用策略迭代的方法分析收斂性和穩(wěn)定性。反復(fù)實(shí)驗(yàn)獲得初始的可容許控制策略,迭代值函數(shù)是單調(diào)不增,收斂到HJB方程的最優(yōu)值。證明任意迭代控制策略使非線性系統(tǒng)穩(wěn)定。神經(jīng)網(wǎng)絡(luò)近似值函數(shù)和求最優(yōu)控制,且分析權(quán)重矩陣的收斂性。
根據(jù)Discrete-Time Nonlinear HJB Solution Using Approximate Dynamic Programming: Convergence Proof ,2008 Asma Al-Tamimi; Frank L. Lewis; Murad Abu-Khalaf IEEE Transactions on Systems。對(duì)初始值函數(shù)為0,其值迭代VI算法迭代控制策略使得系統(tǒng)不能保證穩(wěn)定。收斂的控制策略才可控制系統(tǒng)。
常見的VI算法是離線進(jìn)行的,而PI算法幾乎是在線迭代ADP。
- 策略迭代算法尋找最優(yōu)控制策略
- 迭代控制策略可穩(wěn)定非線性系統(tǒng)
- 迭代控制策略采用策略迭代算法是迭代性能指標(biāo)(代價(jià)函數(shù)收斂到最優(yōu)值)
獲得最優(yōu)控制策略,必須先獲得最優(yōu)值函數(shù)。但在考慮迭代控制策略前,
J
?
(
x
k
)
J*\left( x_k \right)
J?(xk?)未知。傳統(tǒng)的DP方法會(huì)面臨維數(shù)災(zāi)難問題,控制序列是無限的,幾乎不可能由HJB方程獲得最優(yōu)控制。
當(dāng)?shù)螖?shù)趨于無窮時(shí),PI算法收斂,有迭迭代控制策略近似最優(yōu)控制策略,迭代值函數(shù)是單調(diào)不增收斂到最優(yōu)。
Theorem3.1給出在PI算法下,迭代值函數(shù)是單調(diào)不增的。進(jìn)而Corollary3.1給出當(dāng)任意可容許控制,后續(xù)迭代控制策略均是可容許的。Theorem3.2給出迭代值函數(shù)收斂到最優(yōu)性能指標(biāo)。證明a:迭代性能指標(biāo)的極限滿足HJB方程;b:對(duì)任意可容許控制策略,無限次迭代下值函數(shù)小于等于迭代值函數(shù)
V
∞
(
x
k
)
≤
P
(
x
k
)
V_{\infty}\left( x_k \right) \le P\left( x_k \right)
V∞?(xk?)≤P(xk?);c:無限次迭代下值函數(shù)等價(jià)于最優(yōu)值函數(shù)。同樣Corollary3.2給出迭代控制策略收斂到最優(yōu)控制策略。連續(xù)時(shí)間和離散時(shí)間的策略迭代算法不同,首先HJB方程不同,且連續(xù)時(shí)間下的分析方法基于微分。
Algorithm1給出選定半正定函數(shù),獲得可容許控制策略,兩個(gè)神經(jīng)網(wǎng)絡(luò)滿足收斂到一定精度。Algorithm2給出離散時(shí)間下的PI算法。創(chuàng)建動(dòng)作網(wǎng)絡(luò)后,在評(píng)價(jià)網(wǎng)絡(luò)的權(quán)重收斂之前,控制策略的可容許性是未知的,則說明Algorithm1是離線的。
Theorem3.3給出須獲得可容許控制的值函數(shù)條件。
兩種方法得到迭代值函數(shù):
定義內(nèi)迭代和外迭代分別更新
和直接根據(jù)求和
Theorem4.1給出目標(biāo)的迭代值函數(shù)和迭代控制策略的權(quán)重收斂到最優(yōu)。
對(duì)不穩(wěn)定系統(tǒng),盡管可通過值迭代和策略迭代獲得最優(yōu)控制策略,但值迭代下并非所有的控制策略是可穩(wěn)定系統(tǒng)的,且VI算法的迭代控制性質(zhì)不能分析,只能離線實(shí)現(xiàn)。而策略迭代使系統(tǒng)的穩(wěn)定性得到保證。
仿真:1線性系統(tǒng)中比較策略迭代算法與一般的求解代數(shù)里卡提方程;2 非線性系統(tǒng)中比較策略迭代與值迭代 3 倒立擺系統(tǒng) 4非線性衛(wèi)星姿態(tài)
展望:由于神經(jīng)網(wǎng)絡(luò)近似迭代值函數(shù)和迭代控制策略,近似誤差不可避免,這種情況下不能證明迭代性能指標(biāo)函數(shù)的收斂性和系統(tǒng)在迭代控制策略下的穩(wěn)定性。需要后續(xù)額外的證明判據(jù)。
matlab實(shí)現(xiàn)
檢驗(yàn)PI算法的可行性,線性
預(yù)訓(xùn)練狀態(tài)數(shù)據(jù)
給定系統(tǒng)矩陣和權(quán)矩陣,通過隨機(jī)函數(shù)得到狀態(tài)的訓(xùn)練數(shù)據(jù)
%------------------------- generate training data & system information ----------------------------
clear; close all; clc;
% system matrices
A = [ 0, 0.1;...
0.3, -1 ];
B = [ 0;...
0.5 ];
state_dim = size(A,1);
control_dim = size(B,2);
% cost function parameters
Q = 1*eye(state_dim);
R = 1*eye(control_dim);
% training data
x_train = zeros(state_dim,1);
x0 = [1;-1];
for i = 1:50
x_train = [x_train, zeros(state_dim,1)];
x_train = [x_train,2*(rand(state_dim,1)-0.5)];
x_train = [x_train,1*(rand(state_dim,1)-0.5)];
x_train = [x_train,0.5*(rand(state_dim,1)-0.5)];
end
r = randperm(size(x_train,2)); % randomization according to column
x_train = x_train(:, r); % reorder
save training_data/state_data x_train state_dim control_dim A B Q R x0;
獲取可容許策略
比較LQR求得到最優(yōu)增益和黎卡提方程的解 與 動(dòng)作網(wǎng)絡(luò)訓(xùn)練輸出的初始控制策略。并測(cè)試兩者控制策略所得到性能指標(biāo)函數(shù)
%-------------------------- obtain the initial admissible control ----------------------------
clear; close all; clc;
load training_data/state_data.mat
[K, P] = dlqr(A,B,Q,100*R); % 離散線性二次型函數(shù)得到最優(yōu)增益K和黎卡提方程的解P
actor_target = -K*x_train; % 最優(yōu)增益求得控制策略
cover = 1;
if isempty(dir('training_data/actor_init.mat')) == 1 || cover == 1 % 訓(xùn)練初始可容許控制不存在時(shí)
% action network
actor_init_middle_num = 15; % 動(dòng)作網(wǎng)絡(luò)輸入層數(shù)
actor_init_epoch = 10000;
actor_init_err_goal = 1e-9;
actor_init_lr = 0.01;
actor_init = newff(minmax(x_train), [actor_init_middle_num control_dim], {'tansig' 'purelin'},'trainlm');
actor_init.trainParam.epochs = actor_init_epoch;
actor_init.trainParam.goal = actor_init_err_goal;
actor_init.trainParam.show = 10;
actor_init.trainParam.lr = actor_init_lr;
actor_init.biasConnect = [1;0];
actor_init = train(actor_init, x_train, actor_target);
save training_data/actor_init actor_init % 保存生成的可容許控制
else
load training_data/actor_init
end
%-------------------------- test the initial control ----------------------------
x = x0;
x_net = x;
xx = x;
xx_net = x_net;
uu = [];
uu_net = [];
Fsamples = 200;
JK = 0; % 由dlqr得到最優(yōu)增益K得到最優(yōu)控制計(jì)算性能指標(biāo)
Jnet = 0; % 由動(dòng)作網(wǎng)絡(luò)生成初始可容許控制計(jì)算性能指標(biāo)
h = waitbar(0,'Please wait');
for k = 1:Fsamples
u = -K*x;
u_net = actor_init(x_net);
JK = JK + x'*Q*x + u'*R*u;
Jnet = Jnet + x_net'*Q*x_net + u_net'*R*u_net;
x = A*x + B*u;
xx = [xx x];
x_net = A*x_net + B*u_net;
xx_net = [xx_net x_net];
uu = [uu u];
uu_net = [uu_net u_net];
waitbar(k/Fsamples,h,['Running...',num2str(k/Fsamples*100),'%']);
end
close(h)
JK
Jnet
figure,
plot(0:Fsamples,xx,'b-',0:Fsamples,xx_net,'r--','linewidth',1)
xlabel('Time steps');
ylabel('States');
set(gca,'FontName','Times New Roman','FontSize',14,'linewidth',1);
grid on;
figure,
plot(0:Fsamples-1,uu,'b-',0:Fsamples-1,uu_net,'r--','linewidth',1)
xlabel('Time steps');
ylabel('Control');
set(gca,'FontName','Times New Roman','FontSize',14,'linewidth',1);
grid on;
PI算法實(shí)現(xiàn)
加載訓(xùn)練數(shù)據(jù),構(gòu)建Action和Critic網(wǎng)絡(luò)分別進(jìn)行策略提升和策略改進(jìn),更新值函數(shù)和控制策略文章來源:http://www.zghlxwxcb.cn/news/detail-842271.html
function pi_algorithm
% This demo checks the feasibility of the policy iteration adaptive dynamic
% programming algorithm
%-------------------------------- start -----------------------------------
clear; close all; clc;
% information of system & cost function
global A; global B; global Q; global R;
load training_data/state_data.mat;
load training_data/actor_init.mat; % 加載預(yù)訓(xùn)練離散狀態(tài)數(shù)據(jù)和可容許控制策略
% action network
actor = actor_init;
actor_epoch = 20000;
actor_err_goal = 1e-9;
actor_lr = 0.01;
actor.trainParam.epochs = actor_epoch;
actor.trainParam.goal = actor_err_goal;
actor.trainParam.show = 10;
actor.trainParam.lr = actor_lr;
% critic network
critic_middle_num = 15;
critic_epoch = 10000;
critic_err_goal = 1e-9;
critic_lr = 0.01;
critic = newff(minmax(x_train), [critic_middle_num 1], {'tansig' 'purelin'},'trainlm'); % 構(gòu)造評(píng)價(jià)網(wǎng)絡(luò)
critic.trainParam.epochs = critic_epoch;
critic.trainParam.goal = critic_err_goal; % 訓(xùn)練指標(biāo)達(dá)到多少時(shí)停止
critic.trainParam.show = 10; % 每多少數(shù)據(jù)刷新依次
critic.trainParam.lr = critic_lr;
critic.biasConnect = [1;0];
epoch = 10;
eval_step = 400;
performance_index = ones(1,epoch + 1);
figure(1),hold on;
h = waitbar(0,'Please wait');
for i = 1:epoch
% update critic
% evaluate policy
critic_target = evaluate_policy(actor, x_train, eval_step);
critic = train(critic,x_train,critic_target);
performance_index(i) = critic(x0);
figure(1),plot(i,performance_index(i),'*'),xlim([1 epoch]),hold on;
waitbar(i/epoch,h,['Training controller...',num2str(i/epoch*100),'%']);
if i == epoch
break;
end
% update actor
actor_target = zeros(control_dim,size(x_train,2));
for j = 1:size(x_train,2)
x = x_train(:,j);
if x == zeros(state_dim,1)
ud = zeros(control_dim,1);
else
objective = @(u) cost_function(x,u) + critic(controlled_system(x,u));
u0 = actor(x);
ud = fminunc(objective, u0);
end
actor_target(:,j) = ud;
end
actor = train(actor, x_train, actor_target);
end
close(h)
figure(1),
xlabel('Iterations');
ylabel('$V(x_0)$','Interpreter','latex');
set(gca,'FontName','Times New Roman','FontSize',14,'linewidth',1);
grid on;
hold off;
save training_results/actor_critic actor critic
end
%---------------------------- evaluate policy 策略評(píng)估-----------------------------
function y = evaluate_policy(actor,x,eval_step)
critic_target = zeros(1,size(x,2));
for k = 1:eval_step
uep = actor(x); % 策略評(píng)估下動(dòng)作網(wǎng)絡(luò)的輸出
critic_target = critic_target + cost_function(x,uep); % 策略評(píng)估下將控制代入到值函數(shù)
x = controlled_system(x,uep); % 代入被控系統(tǒng)得到下一步狀態(tài)
end
y = critic_target;
end
%--------------------------- outpout of system 被控系統(tǒng)輸出狀態(tài)值----------------------------
function y = controlled_system(x,u)
% system matrices
global A; global B;
y = A*x + B*u; % dot product should be adopt in nolinear systems
end
%----------------------------- cost function ------------------------------
function y = cost_function(x,u)
global Q; global R;
y = (diag(x'*Q*x) + diag(u'*R*u))';
end
PI算法的驗(yàn)證
加載預(yù)訓(xùn)練的狀態(tài)數(shù)據(jù)和評(píng)價(jià)、動(dòng)作網(wǎng)絡(luò)數(shù)據(jù)。
給出LQR計(jì)算下得到的最優(yōu)控制策略和黎卡提方程的解(HJB方程的精確解)比較
迭代值函數(shù)和迭代控制策略收斂的近似最優(yōu)值代入到性能指標(biāo)函數(shù)中文章來源地址http://www.zghlxwxcb.cn/news/detail-842271.html
clear; close all; clc;
load training_data/state_data.mat;
load training_results/actor_critic.mat
[Kopt, Popt] = dlqr(A,B,Q,R);
x = x0;
x_net = x;
xx = x;
xx_net = x_net;
uu_opt = [];
uu_net = [];
Jreal = 0;
Fsamples = 50;
h = waitbar(0,'Please wait');
for k = 1:Fsamples
uopt = -Kopt*x;
x = A*x + B*(uopt);
xx = [xx x];
u_net = sim(actor,x_net);
Jreal = Jreal + x_net'*Q*x_net + u_net'*R*u_net;
x_net = A*x_net + B*u_net;
xx_net = [xx_net x_net];
uu_opt = [uu_opt uopt];
uu_net = [uu_net u_net];
waitbar(k/Fsamples,h,['Running...',num2str(k/Fsamples*100),'%']);
end
close(h)
Jopt = x0'*Popt*x0
Jnet = critic(x0)
Jreal
figure(1),
plot(0:Fsamples,xx,'b-',0:Fsamples,xx_net,'r--','linewidth',1)
legend('Optimal ','NN','Interpreter','latex');
xlabel('Time steps');
ylabel('States');
set(gca,'FontName','Times New Roman','FontSize',14,'linewidth',1);
grid on;
figure(2),
plot(0:Fsamples-1,uu_opt,'b-',0:Fsamples-1,uu_net,'r--','linewidth',1)
legend('Optimal ','NN','Interpreter','latex');
xlabel('Time steps');
ylabel('Control');
set(gca,'FontName','Times New Roman','FontSize',14,'linewidth',1);
grid on;
到了這里,關(guān)于Policy Iteration Adaptive Dynamic Programming Algorithm for Discrete-Time Nonlinear Systems的文章就介紹完了。如果您還想了解更多內(nèi)容,請(qǐng)?jiān)谟疑辖撬阉鱐OY模板網(wǎng)以前的文章或繼續(xù)瀏覽下面的相關(guān)文章,希望大家以后多多支持TOY模板網(wǎng)!