0% found this document useful (0 votes)
8 views7 pages

KDDP

Uploaded by

Rajesh Varagani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views7 pages

KDDP

Uploaded by

Rajesh Varagani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/311413324

Kernel Dynamic Policy Programming: Practical Reinforcement Learning for


High-dimensional Robots

Conference Paper · November 2016


DOI: 10.1109/HUMANOIDS.2016.7803345

CITATIONS READS
7 1,214

3 authors:

Yunduan Cui Takamitsu Matsubara


Chinese Academy of Sciences Nara Institute of Science and Technology
91 PUBLICATIONS 1,004 CITATIONS 244 PUBLICATIONS 2,835 CITATIONS

SEE PROFILE SEE PROFILE

Kenji Sugimoto
Nara Institute of Science and Technology
217 PUBLICATIONS 998 CITATIONS

SEE PROFILE

All content following this page was uploaded by Yunduan Cui on 28 March 2018.

The user has requested enhancement of the downloaded file.


Kernel Dynamic Policy Programming: Practical
Reinforcement Learning for High-dimensional Robots
Yunduan Cui, Takamitsu Matsubara, and Kenji Sugimoto

Abstract— Applying value function based reinforcement spite of the requirement of preparing suitable parameterized
learning algorithms to real robots has been infeasible because control policies (e.g., central pattern generator or dynamic
the approximation of high-dimensional value function is dif- movement primitive) based on prior knowledge of tasks
ficult. The difficulty of such high-dimensional value function
approximation in previous methods are twofold: 1) instability of and the careful selection of the initial policy’s parameters,
value function approximation by non-smooth policy update and policy search algorithms are popular in robot control and be
2) computational complexity associated with high-dimensional successfully applied to several robot tasks [4]–[6].
state-action space. To cope with these issues, in this paper, Kernel Dynamic Policy Programming (KDPP) is proposed
we propose Kernel Dynamic Policy Programming (KDPP) that
in this study to overcome the limitation of value function
smoothly updates value function in an implicit high-dimensional
feature space. The smooth policy update is promoted by adding approach in robot control. To stabilize the value function
the Kullback-Leibler divergence between current and updated approximation, KDPP exploits the nature of smooth policy
policies in reward function as a regularization term to stabilize from Dynamic Policy Programming (DPP) [7, 8] to limit the
the value function approximation. The computational complex- overly large policy update by introducing Kullback-Leibler
ity is reduced by applying the kernel trick in the value function
divergence into the reward function as one regularization
approximation. Therefore, KDPP can be interpreted as a novel
yet practical extension of Dynamic Policy Programming (DPP) term. The kernel trick [9] is employed to implicitly represent
and kernelized value function-based reinforcement learning and update this high-dimensional approximate value function
methods to combine the strengths of them. We successfully using the inner product of pairs of generated samples to
applied KDPP to learn unscrewing bottle cap in a Pneumatic considerably reduce the computational complexity. KDPP
Artificial Muscles (PAMs) driven humanoid robot hand, a
can be interpreted as a novel but practical extension of DPP
system with 24 dimensional state space, with limited number
of samples and commonplace computational resource. and kernelized value function-based RL methods [10, 11] to
combine the strength of them.
I. INTRODUCTION To investigate the performance of the KDPP, we applied it
Compared with typical electric motor robots that follow to N DOF manipulator reaching task. We show the scalability
repeatable commands in fixed environment, more and more of KDPP with the increase of N. Based on the success,
humanoids robots with compliance to humans (e.g., pneu- we implement KDPP to control Shadow dexterous hand, a
matic artificial muscles and series elastic actuator driven Pneumatic Artificial Muscles (PAMs) driven humanoid robot
robots) are required to self-update according to feedback hand, to unscrew bottle cap. The whole system has 24 dimen-
and learn tasks to assist us in daily life without supervision. sional state with 625 discrete actions which is impractical to
As one good solution that enables robots to iteratively conventional value function approach algorithms according
search optimal behaviors by interacting environment without to our knowledge. KDPP successfully converged to good
knowledge of model, Reinforcement learning (RL) [1] is solutions with small number of both samples and iterations
becoming an important part in robot learning [2]. in our experiment.
RL algorithms are divided in two groups, value function This paper is organized as follows. The smooth policy
approach that obtains optimal behaviors by following a update inheriting from DPP is introduced in Section II. Sec-
globally optimal value function among all states and actions, tion III gives details of kernel method and KDPP algorithm.
and policy search that reaches a local optima gradually by Sections IV and V present results of simulation and real robot
updating a given policy around its neighborhood. If the experiment respectively. The conclusions and future work are
optimal value function is known, we can find a globally discussed in Section VI.
optimal solution by greedily choosing actions based on it.
However, learning an accurate approximate value function in II. DYNAMIC P OLICY P ROGRAMMING
high-dimensional robot systems has been difficult. The value
function approximation is unstable without sufficient samples A. Dynamic Policy Programming Recursion
covering all states and actions. Overly large policy update Optimal control methods find the optimal policy of a
without smoothness easily results to diverged learning due to Markov decision process defined by (S, A, T , R, γ), S =
this instability [3]. Moreover, the computational complexity {s1, ..., sn } is a finite set of states, A = {a1, ..., am } is a
quickly goes intractable with the increase of system dimen- finite set of actions, Tssa0 represents the transition probability
from s to s0 under the a and r ss 0 = R(s, s , a) is the
sions. On the other hand, policy search improves stability of a 0

learning and reduces computational complexity by updating corresponding reward, γ ∈ (0, 1) is the discount factor. The
one optimal policy instead of the whole state-action space. In policy π(a|s) denotes the probability of taking the action a
f gT
under the state s. The optimal policy π ∗ attains the optimal where Φ = φ(x1 ), ..., φ(x N ) and O Ψ̂ t is N × 1 matrix
value function V ∗ that is a Bellman equation: with elements:
an 0
0+γMη Ψ̂t (s n )−Mη . Ψ̂t (s n ).
X
V ∗ (s) = max π(a|s)Tssa0 (r ss
a
0 + γV (s )), ∀s ∈ S.
∗ 0 O Ψ̂t (xn ),Ψ̂t (xn )+r (9)
π sn sn
a ∈A
(1)
s0 ∈S Thus, the empirical least-squares solution of minimizing
Dynamic policy programming [7, 8] adds Kullback- J(Θ; Ψ̂k ) is given by:
Leibler divergence between the optimal policy π and the
Θt+1 = [ΦT Φ + σ 2 I]−1 ΦT O Ψ̂ t , (10)
baseline policy π̄ into the Eq. (1) as a regularization term
to control the policy deviation: σ is used for avoiding over-fitting due to the small number
π(a|s) of samples.
∗ 0 1
X 
Vπ̄∗ (s) = max π(a|s) Tss
a a
0 r 0 +γVπ̄ (s ) − log .
π
a ∈A
ss η π̄(a|s) (2) III. K ERNEL DYNAMIC P OLICY P ROGRAMMING
s0 ∈S
According to [7, 8], even though DPP with linear func-
Equation (2) minimizes the distance between π and π̄ tion approximation has better error boundary and lower
while maximizing the expected reward. To get the solution requirement of the number of samples compared with other
of this Bellman equation, the action preferences [1] for all value function approach algorithms like Q-learning [12] and
state action pairs (s, a) ∈ S × A in the (t + 1)-th iteration Least Squares Policy Iteration (LSPI) [13] in simulations, its
are defined according to [8]: application in robot control is limited by the computational
1 X
t 0 
complexity in high-dimensional systems. Our previous study
Ψt+1 (s, a) = log π̄ t (a|s) + a
Tss 0 r
a
0 + γVπ̄ (s ) . (3) [14] combined smooth policy update with Nearest Neighbor
η 0
ss
s ∈S Search (NNS) to locally update the value function and be
Instead of the optimal value function, DPP learns optimal successfully applied to real robot. However, it is limited in
action preferences to determine the optimal control policy very high dimensional systems where the 1% of the total
throughout the state-action space. The main loop of DPP is number of basis functions for linear function approximation
calculated as: are still too huge to our computational resource.
1 X
A. Applying Kernel Trick in DPP
Vπ̄t+1 (s) = log exp(ηΨt (s, a)) (4)
η a ∈A Kernel trick implicitly represents a high-dimensional space
by computing the inner products between pairs of data in
exp ηΨt (s, a)
 that space instead of computing the coordinates of the data
π̄ t+1 (a) = P 0 . (5) [9]. It is therefore suitable to represent high-dimensional
a ∈A exp ηΨk (s, a )
0
state-space in robots efficiently. For policy search algorithms,
Plug Eqs. (4) (5) into Eq. (3), we get the DPP recursion kernel trick has been applied to efficiently represent very rich
Ψt+1 (s, a) = OΨt (s, a) calculated by: function class for policy approximation in policy search [15].
X 0  Moreover, it is combined with Kullback-Leibler divergence
0 + Mη Ψt (s )
a a
Ψt (s, a) − Mη Ψt (s)+ Tss 0 r
ss (6) in [16] to integrate both robust policy updates and kernel
0
s ∈S embeddings. On the other hand, kernel ridge regression [17]
where Mη Ψt (s) is the the Boltzmann soft-max operator: has been applied to value function based RL algorithms
in [10, 11]. However, less kernelized value function-based
X exp ηΨt (s, a)  Ψt (s, a)
Mη Ψt (s) = RL methods is applied in real robots due to the brittle
0  . (7)
a ∈A exp ηΨt (s, a )
P
a ∈A
0 value function approximation caused by insufficient samples
covering the whole state-action space to the best of our
B. Approximate Dynamic Policy Programming with Linear knowledge. In this paper, we employ kernel ridge regression
Function Approximation with smooth updated policy in KDDP, to efficiently and
Apply DPP to learning large-scale problems with con- stably update high-dimensional approximate value function
tinuous state s ∈ S based on samples, i.e., reinforcement in robot systems.
learning, one approach is using linear function approximation To apply kernel ridge regression to the least squares
to approximate the action preferences. Define the n-th state- solution of DPP using linear function approximation, we
action pair from a set of N samples as xn = [sn, an ]n=1:N , expressing the weights vector by dual variables αk =
φ(xn ) denotes the m × 1 output vector of m basis functions, [α 1t , ..., α tN ]T as:
[ϕ1 (xn ), ..., ϕ m (xn )]T . The approximate action preferences N
in the t-th iteration follows Ψ̂t (xn ) = φ(xn ) T Θt where Θt
X
Θt = α ti φ(xi ) = ΦT αt , (11)
is the corresponding m × 1 weights vector. The loss function i=1
is defined as:
and define the matrix of inner products as K := ΦΦT
J(Θ; Ψ̂ t ) , kΦΘ − O Ψ̂ t k22 (8) that [K]i j = hφ(xi ), φ(x j )i =: k (xi , x j ). The approximate
action preferences therefore follows: Algorithm 1 Kernel Dynamic Policy Programming
N Require: η, σ, T, I, J and TOL.
1: initialize Dk = Ø, α = Ø and M = 0.
X
Ψ̂t (xn ) = φ(xn )Θt = k (xn, xi )α ti . (12)
i=1 2: for t= 0, 2, ..., T
After translating Eq. (10) using the Woodbury identity to: 3: if t == 0
4: generate D t by πrandom
[ΦT Φ + σ 2 I]−1 ΦT O Ψ̂ t = ΦT [ΦΦT + σ 2 I]−1 O Ψ̂ t (13) 5: else
6: generate D t by πexplore
t
we get the solution represented by dual variables:
7: for each {xi,t j = [si,t j , ai,t j ]}i=1:I, j=1:J ∈ D t
αt+1 = [K + σ 2 I]−1 O Ψ̂ t (14)
8: if Dk == Ø
where O Ψ̂ t is calculated by plugging Eq. (12) into Eq. (9). 9: M = M +1
10: add x̃ M = xi,t j to Dk and α M = 0 to αt
B. Online Selection of Regression Subset 11: else
Applying kernel ridge regression, the high-dimensional 12: calculate δ of xi,t j following Eq. (19)
action performances function approximation is efficiently 13: if δ > TOL
calculated by kernel on samples pairs following Eq. (12). 14: M = M +1
However, building kernel with all N samples is computation- 15: add x̃ M = xi,t j to Dk and α M = 0 to αt
ally infeasible in each iteration. Instead we follow [10] and
16: for each {(sli, j , ali, j , si,0l j )}l=0:t,i=1:I, j=1:J , calculate:
[11] to select one subset of samples Dk = [x̃m ]m=1:M , M 
N to reduce the computational complexity: 17: Ψ̂t ([sli, j , a]), Ψ̂t ([si,0l j , a]) for all a ∈ A with Dk
M
18: Mη Ψ̂t (sli, j ), Mη Ψ̂t (si,0l j ) by Eq. (7)
O Ψ̂t ([sli, j , ali, j ]) by Eq. (9)
X
φ(x) ≈ ai φ(x̃i ). (15) 19:
i=1 20: update αt = [K M M + σ 2 I]−1 O Ψ̂ t
PM 2 21: return Dk and αT
Solving mina ∈R M φ(x) − i=1 ai φ(x̃i ) , the parameter
vector a follows:
a = KM
−1
M k M (x) (16) and updating kernel subset. At the beginning (t = 0), KDPP
firstly generates samples by a purely random policy πrandom
where [K M M ]i j = k (x̃i , x̃ j ), k M (x) represents vector and builds the subset for kernel ridge regression Dk from
[k (x, x̃1 ), ..., k (x, x̃ M )]T . Defining [K N M ]i j = k (xi , x̃ j ), samples D0 following Eq. (19) with threshold TOL. The
we obtain: corresponding dual variables vector α0 is set to 0. For the
k (xi , x j ) ≈ k M (xi ) T K M
−1
M k M (x j ) (17) later iterations t ≥ 1, samples are generated by a soft-max
exploration policy defined as:
M KN M,
−1 T
K ≈ KN M KM (18) exp(η explore Ψ̂t (s, a))
πexplore (s|a) = P (20)
and reduce the calculation of Eqs. (12) and (14) from all a0 ∈A exp(η explore Ψ̂t (s, a0 ))
samples [xn ]n=1:N to a subset [x̃m ]m=1:M . where η explore is the temperature to control the randomness
To select Dk online, the variance of sample x∗ approxi- of the soft-max function. αt is expanded according to δ from
mated by subset is calculated following Eq. (??): lines 8 to 15 and updated from lines 16 to 21.
δ∗ = k (x∗, x∗ ) − k M (x∗ ) T (K M M ) −1 k M (x∗ ) (19) IV. S IMULATION R ESULTS
which is equivalent to the variance of predictive distribution In this section, KDPP’s learning performance is investi-
in Gaussian process regression [18]. A new sample x∗ is gated in the simulation of N DOF manipulator reaching task
considered to be sufficiently different to add to Dk if the (N = 4, 10, 20) with comparison of DPP [7, 8], LSPI [13] and
corresponding δ∗ exceeds a given threshold. Kernel Least Squares Policy Programming (KLSPI) [11].
C. Kernel Dynamic Policy Programming A. Simulation Setting
The Kernel Dynamic Policy Programming (KDPP) is In this simulation, the continuous state is [θ 1, θ 2, ..., θ n ]T
proposed in this subsection. In KDPP, samples are obtained where θ i ∈ [− π2 , π2 ] rads represents the angle in the i-th joint
by rollout-based interaction with the process every iteration of manipulator. The action for each joint has three discrete
and reused during the learning. The data generated in the actions: increase/decrease 0.0875 rad angle and maintain the
t-th iteration is defined as D t = {D1t , ..., D It } containing current angle. The first joint is set to the position [0, 0],
I rollout trajectories and each trajectory has J samples: the length of each limb between two joints is set to n1 m.
Dit = {(si,t j , ai,t j , s0 i,t j )} j=1:J . All angles are initialized to 0 rad at the start of roll-out.
The pseudocode is shown in Algorithm 1. Lines 3 to 15 Defining the target to reach in two dimensional axises as
in Algorithm 1 represent the process of generating samples Xtarget = 0.6830, Ytarget = 0, the reward function is set as
4 DOF Manipulator Reaching
B. Results
All simulation results support that KDPP had the capa-
bility of efficiently learning high-dimensional value function
with high stability and encourage us to implement KDPP to
real world high-dimensional robots. In four DOF manipulator
reaching task, two kernel value function approach algorithms
converged to higher reward compared with conventional DPP
and LSPI with same number of samples while KDPP had a
10 DOF Manipulator Reaching better result than KLSPI. DPP and LSPI require 94 × 34 =
531441 radial basis functions (RBFs) for linear function
approximation if we set nine RBFs in each state dimension
and three RBFs in each action dimension. It results to more
than 100 GB memory cost in our computational server while
the calculation time is more than 30 mins/iteration. On
the other hand, KDPP and KLSPI efficiently represent the
value function by tiny subsets built form samples (less than
1000 features). In ten DOF manipulator reaching task, we
20 DOF Manipulator Reaching
define that action only effects on one joint at each time
step so that the number of actions is reduced to 3 × 10.
However, the total number of RBFs for linear function
approximation, i.e., 910 × 30 ≈ 1011 is still intractable. We
could not run this simulation using DPP and LSPI under
our limited computational resource. On the other hand, less
than 2000 features are used to calculate the value function
in KDPP. Even though both KDPP and KLSPI learned
good solutions with 20 × 50 samples per iteration, KDPP
Fig. 1: Simulation results. (A× B means the learning at each still converged while KLSPI could not learn good solutions
iteration generates A roll-out, each roll-out has B steps) when the number of samples turned to 10 × 50. In 20 DOF
manipulator reaching, 920 ×60 ≈ 7×1020 RBFs are required to
build value function by linear function approximation. Only
KDPP learned good solutions with less than 5000 features
being employed to calculate the value function while KLSPI
was failed to converge with limited number of samples.

V. E XPERIMENTS
A. Experimental Setting
As one vague approximation of real muscle, PAM is an
attractive device for a wide variety of robots with high
power-to-weight ratio and good flexibility. However, the
modeling and controlling of PAM is challenging due to
04 DOF 10 DOF 20 DOF the nonlinearities in its structure like pressure dynamics,
hysteresis phenomena and the effect of mass flow rate
Fig. 2: Average number of features for approximate value according to [19]. In this section, KDPP is applied to Shadow
function. Dexterous Hand [20], a PAMs driven humanoid robot, to
learn unscrewing bottle cap using two finger like humans
without model dynamics and initial policy. It is an appealing
task to be solved by RL because the knowledge of physically
interaction between robot and environment ( e.g., the certain
R = −1000 × (Xtarget − X ) 2 + (Ytarget − Y ) 2 where X, Y

friction needed to unscrew cap using finger) is variable and
is the current position of end-effector. The parameters of difficult to obtain.
KDPP are set following: γ = 0.95, η = 0.0001, σ = 0.01 and In Shadow dexterous hand, each finger has three joints.
TOL = 0.0001. The learning is finished when the reward of One joint is controlled by two antagonistically PAMs with
the last state in test roll-out is more than −1. The simulation discrete actions u: filling the express air in, releasing the ex-
results are based on 10 times experiments. Figure 1 shows press air out and keeping the current air pressure. Therefore,
the learning results in simulation. The average number of the control state of one joints is four dimensional state space
features for approximate value function are in Fig. 2. [θ, θ̇, P1, P2 ]T and two dimensional action [u1, u2 ]T where
examples of learned results in iteration 1, 4 and 8. One
policy learned in iteration 8 that controls two PAMs to drive
Shadow dexterous hand to unscrew bottle cap in 20 steps is
demonstrated in Fig. 4c. The discrete values are defined as
1: filling in express air, 0: keep air pressure, −1: releasing
express air.

VI. C ONCLUSIONS AND F UTURE W ORK


While policy search algorithms with Kullback-Leibler
divergence (e.g., Relative Entropy Policy Search [5] and
Guided Policy Search [6]) are well developed in robot control
recently, this study firstly explores the capability of value
function based RL algorithms which focus on learning and
updating the global value function rather than a local control
policy with Kullback-Leibler divergence in real world robots
to the best of our knowledge. As one practical extension
of DPP and kernelized value function-based RL methods,
KDPP stabilizes the value function approximation by adding
Fig. 3: Real experimental setting of unscrewing bottle cap the Kullback-Leibler divergence as one regularization term
using Shadow Dexterous Hand. to keep policy update smoothly. Exploiting the smoothness
of policy update, KDPP successfully employs kernel trick to
represent and update the high-dimensional approximate value
θ is joint angle, θ̇ is angular velocity, P1 and P2 are the function implicitly using generated samples and considerably
air pressure of two PAMs, u1 and u2 are the corresponding reduces the computational complexity. According to our
actions. simulation results, KDPP learned good solutions in very high
We apply KDPP to control two full fingers with six joints dimensional systems with a small number of samples while
to unscrew bottle cap. It has a 24 dimensional state space. We other kernel value function approach algorithm like KLSPI
coded 625 actions that limits only one joint in each finger could not.
fills/releases express air every time step. The experimental As one real application, KDPP was implemented to
setting was shown in Fig. 3. The Kinect was set in front Shadow dexterous hand to learn unscrewing bottle cap. The
of Shadow dexterous hand to get the current state of bottle whole system has a 24 dimensional state space with 625
cap by capturing the position of red marker attached on discrete actions. Generating 500 samples per iteration, KDPP
cap. During the learning, if the cap was unscrewed, i.e., the successfully learned good solutions within 10 iterations.
marker moved down in Y axis, a high reward is obtained. The average size of the implicit features using kernel trick
In the learning process, ten trajectories with 10 × 50 is only 3571 while  1010 RBFs are required for linear
samples was generated every iteration. The initial position function approximation in conventional value function based
was fixed to be close to bottle cap as shown in Fig. 3. A RL algorithms. This result indicated KDPPs potential to be
same parameters setting of KDPP used in simulations was applied to other advanced robot systems.
employed. The control loop runs at 1.25Hz while each action The future works are divided in two parts. On the algo-
operates for 0.25 second. Before each iteration, one test rithm side, simplifying the computational complexity further
rollout with a greedy control policy is carried out twice, the and move to higher dimensional system (≥ 100 dimensions)
reward Rtest = 1000 × (Ystart − Yend ) represent the movement remain important issues. For the part of robot control, we
between the initial and final position of marker in Y axis. The will implement KDPP to Shadow dexterous hand with bio
stopping criterion is Rtest ≥ 1800, when the cap is unscrew tac sensors to achieve more challenging tasks with touch
more than 120◦ . pressure.
B. Results VII. ACKNOWLEDGMENT
According to Fig. 4, KDPP successfully learned the value This work was supported by the New Energy and Indus-
function with 24 dimensional state and 625 discrete actions trial Technology Development Organization (NEDO).
to give a good policy to control Shadow dexterous hand
to unscrew bottle cap within ten iteration. Figure 4a shows R EFERENCES
the learning line of five times experiments with error bar of
[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
standard deviation. KDPP converged to high reward policy MIT press Cambridge, 1998.
in all experiments. The average size of kernel features is [2] S. Schaal and C. G. Atkeson, “Learning control in robotics,” Robotics
3571 which results to an efficient learning process (less than & Automation Magazine, IEEE, vol. 17, no. 2, pp. 20–29, 2010.
[3] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in
1 GB memory is used while the update time ≤ 10 s in robotics: A survey,” The International Journal of Robotics Research,
our computational server using MATLAB). Figure 4b shows vol. 32, no. 11, pp. 1238–1274, 2013.
(a) The average learning curve of reward.

(b) Typical example of learning process in one experiment.

(c) One learned policy in the 8-th iteration


Fig. 4: The learning result of unscrewing bottle cap using Shadow Dexterous Hand.

[4] G. Endo, J. Morimoto, T. Matsubara, J. Nakanishi, and G. Cheng, [12] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8,
“Learning CPG-based biped locomotion with a policy gradient no. 3-4, pp. 279–292, 1992.
method: Application to a humanoid robot,” The International Journal [13] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” The
of Robotics Research, vol. 27, no. 2, pp. 213–228, 2008. Journal of Machine Learning Research, vol. 4, no. 44, pp. 1107–1149,
[5] J. Peters, K. Mülling, and Y. Altun, “Relative entropy policy search.,” 2003.
in Association of the Advancement of Artificial Intelligence (AAAI), [14] Y. Cui, T. Matsubara, and K. Sugimoto, “Local update dynamic policy
pp. 1607–1612, 2010. programming in reinforcement learning of pneumatic artificial muscle-
[6] S. Levine, N. Wagener, and P. Abbeel, “Learning contact-rich manip- driven humanoid hand control,” in 2015 IEEE-RAS 15th International
ulation skills with guided policy search,” in Robotics and Automation Conference on Humanoids, pp. 1083–1089, IEEE, 2015.
(ICRA), 2015 IEEE International Conference on, pp. 156–163, IEEE, [15] G. Lever and R. Stafford, “Modelling policies in mdps in reproducing
2015. kernel hilbert space.,” in AISTATS, 2015.
[7] M. G. Azar, V. Gómez, and B. Kappen, “Dynamic policy programming [16] H. Van Hoof, J. Peters, and G. Neumann, “Learning of non-parametric
with function approximation,” in International Conference on Artificial control policies with high-dimensional state features.,” in AISTATS,
Intelligence and Statistics, pp. 119–127, 2011. 2015.
[8] M. G. Azar, V. Gómez, and H. J. Kappen, “Dynamic policy program- [17] C. Saunders, A. Gammerman, and V. Vovk, “Ridge regression learning
ming,” The Journal of Machine Learning Research, vol. 13, no. 1, algorithm in dual variables,” in (ICML-1998) Proceedings of the 15th
pp. 3207–3245, 2012. International Conference on Machine Learning, pp. 515–521, Morgan
[9] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern anal- Kaufmann, 1998.
ysis. Cambridge university press, 2004. [18] C. K. Williams and C. E. Rasmussen, “Gaussian processes for machine
[10] T. Jung and D. Polani, “Kernelizing LSPE (λ),” in Approximate learning,” the MIT Press, vol. 2, no. 3, p. 4, 2006.
Dynamic Programming and Reinforcement Learning, 2007. ADPRL [19] F. Daerden and D. Lefeber, “Pneumatic artificial muscles: actuators
2007. IEEE International Symposium on, pp. 338–345, IEEE, 2007. for robotics and automation,” European journal of mechanical and
environmental engineering, vol. 47, no. 1, pp. 11–21, 2002.
[11] X. Xu, D. Hu, and X. Lu, “Kernel-based least squares policy iteration
[20] R. Walker, “Shadow dextrous hand technical specification,” Shadow
for reinforcement learning,” Neural Networks, IEEE Transactions on,
Robot Company, 2013.
vol. 18, no. 4, pp. 973–992, 2007.

View publication stats

You might also like