Factorial Kernel Dynamic Policy Programming For Vinyl Acetate Monomer Plant Model Control
Factorial Kernel Dynamic Policy Programming For Vinyl Acetate Monomer Plant Model Control
net/publication/327294861
Factorial Kernel Dynamic Policy Programming for Vinyl Acetate Monomer Plant
Model Control
CITATIONS READS
24 7,896
5 authors, including:
Takamitsu Matsubara
Nara Institute of Science and Technology
244 PUBLICATIONS 2,835 CITATIONS
SEE PROFILE
All content following this page was uploaded by Yunduan Cui on 30 August 2018.
Abstract— This research focuses on applying reinforcement conducted [2, 9, 10]. Based on experienced engineers’ intu-
learning towards chemical plant control problems in order to ition and judgment in eliminating and evaluating candidate
optimize production while maintaining plant stability without architectures, these works mainly focus on classical con-
requiring knowledge of the plant models. Since a typical
chemical plant has a large number of sensors and actuators, troller design procedures that require analysis of the process
the control problem of such a plant can be formulated as a dynamics, development of an abstract mathematical model,
Markov decision process involving high-dimensional state and and derivation of a control law that meets certain design
a huge number of actions that might be difficult to solve criteria.
by previous methods due to computational complexity and As an integral part of contemporary machine learning,
sample insufficiency. To overcome these issues, we propose a
new reinforcement learning method, Factorial Kernel Dynamic reinforcement learning [11] agents search for optimal poli-
Policy Programming, that employs 1) a factorial policy model cies by interacting with their environment without any prior
and 2) a factor-wise kernel-based smooth policy update by knowledge and express a remarkably broad range of con-
regularization with the Kullback-Leibler divergence between trol problems in a natural manner. Example applications
the current and updated policies. To validate its effectiveness, include electrical power systems control [12], dynamic power
FKDPP is evaluated via the Vinyl Acetate Monomer plant
(VAM) model, a popular benchmark chemical plant control management [13], and robot control [14]. Such a bio-
problem. Compared with previous methods that cannot directly inspired approach is suitable for application towards learning
process a huge number of actions, our proposed method the control policies for chemical process plants. However,
leverages the same number of training samples and achieves the application of reinforcement learning towards chemical
a better control strategy for VAM yield, quality, and plant process plants, which commonly feature a large number
stability.
of sensors, controllers, and parameters and require high
I. I NTRODUCTION stability for safety, is a less explored area. Some previous
Chemical plants consist of several processing units that works with reinforcement learning were conducted besides
cooperatively produce chemical product via complex interac- heuristic solutions. For model-free methods, [15] applied
tions, and their coordination is important for safe and prof- neural networks and reinforcement learning to cool a reactor
itable plant operations. The conventional control strategies simulator via one valve. For model-based methods, MPC-
for chemical plants are primarily formed as a set of heuristics based dynamic programming was utilized for a plant with
[1, 2] and optimized in both steady and dynamic simulation one or two dimensional states and actions for one control
[3]. In terms of model predictive control (MPC), another parameters in [5]. However, to the best of the authors’
popular method in chemical process control [4], it is a model- knowledge, there is no application of reinforcement learning
based method and suffers exorbitant online computational towards a large-scale problem like the VAM plant model in
cost with large-scale plants as well as long prediction length the literature due to the curse of dimensionality in the high-
[5]. A method that does not rely on either human engineer dimensional state, the intractable computational cost with
knowledge or models remains elusive in this field. large action spaces and the difficulty in accurately modelling
As a popular test problem in chemical plant design, op- complex plant systems.
timization, and control, the Vinyl Acetate Monomer (VAM) In this research, we focus on applying reinforcement
plant model benchmark problem was originally proposed by learning to the large-scale chemical plant control problems
[6]. This problem is uniquely suited for researchers pursuing to optimize production while maintaining plant stability
simulation, design, and control studies as it: without any knowledge of the plant models. Since a typical
1) Has a realistically large flowsheet, containing sev- chemical plant has a large number of sensors and actuators,
eral standard chemical unit operations with high-level its control problem could be formulated as a Markov decision
chemical interactions. process involving high-dimensional state and large action
2) Includes typical industrial characteristics of recycle space that might be difficult to solve by previous methods
streams and energy integration. due to computational complexity and sample insufficiency.
This model has been further developed by [7, 8], while To overcome these issues, a new approach, Factorial Policy
several other studies on control-system design have been Dynamic Policy Programming (FKDPP), is proposed. In-
spired by both Kernel Dynamic Policy Programming (KDPP)
∗ Indicates equal contribution.
1 [16], which outperforms other conventional reinforcement
Division of Information Science, Graduate School of Science and
Technology, Nara Institute of Science and Technology (NAIST), Nara, Japan learning methods such as LSPI [17] in robot control tasks
2 Yokogawa Electric Corporation with high-dimensional states and [18], which factorizes the
high-dimensional state Hidden Markov Model to reduce TABLE I: Observed values and control parameters of inves-
computational complexity, FKDPP enjoys a factorial policy tigated task.
model and factor-wise kernel-based smooth policy update
Observed values
by regularization with the Kullback-Leibler divergence be-
tween the current and updated policies. Thus, the proposed Name Description Control criteria
approach is suited for learning near-optimal policies in large- FI560.F flow rate of VAM to be optimized
as VAM yield
scale chemical plant models. The effectiveness of FKDPP LI550.L level (%) of decanter tank < 100%
is validated in the VAM plant model benchmark problem, s407.F flow rate from distillation column >0
with experimental results showing that the proposed method in Part 7 to decanter tank
QI531.Q acetic-acid density < 500 ppm
outperforms other methods in terms of VAM yield, quality QI560.Q VAM density (inversely propor- < 150 ppm
and system stability. tional to VAM quality)
term that maximizes expected discounted total reward, while DPP can be extended to large-scale (continuous) state
minimizing the difference between current policy π and spaces via function approximation, i.e. linear function ap-
baseline policy π̄, follows a Bellman equation: proximation with basis functions. Here, we define the n-
th state-action pair from a set of N samples as xn =
X 1 π(a|s)
Vπ̄∗ (s) = max π(a|s) Tss
a a
0 r +γV ∗ 0
(s ) − log . [sn , an ]n=1:N , where φ(xn ) denotes the m×1 output vector
ss0 π̄
η π̄(a|s)
π
a∈A of m basis functions, [ϕ1 (xn ), ..., ϕm (xn )]T . The approxi-
s0 ∈S mate action preferences in the t-th iteration follow Ψ̂t (xn ) =
(1) φ(xn )T θt , where θt is the corresponding m × 1 weights
vector. The empirical least-squares solution of minimizing
For solving the optimal value function, the action prefer- the loss function J(θ; Ψ̂t ) , kΦθ − OΨ̂t k22 is given by
ences [11] for all state action pairs (s, a) ∈ S × A in the
(t + 1)-th iteration are defined according to [19]: θt+1 = [ΦT Φ + σ 2 I]−1 ΦT OΨ̂t , (4)
where σ is used toavoid over-fitting due
T to the small number
1 X
t 0 of samples. Φ = φ(x1 ), ..., φ(xN ) and OΨ̂t is N × 1
Ψt+1 (s, a) = log π̄ t (a|s) + a a
Tss 0 r
ss 0 + γVπ̄ (s ) .
η 0 matrix with elements
s ∈S
0
(2) OΨ̂t (xn ),Ψ̂t (xn )+rsans0+γMη Ψ̂t (sn )−Mη .Ψ̂t (sn ). (5)
n n
Instead of the optimal value function, DPP learns optimal B. Kernel Dynamic Policy Programming
action preferences to determine the optimal control policy The main limitation of DPP in high-dimensional systems
throughout the state-action space. The DPP recursion is is the intractable computational complexity as a result of
Algorithm 1: Factorial KDPP Mη Ψt (s) in Eq. 3 through a huge discrete action set A is
Require: number of iterations T , number of action intractable. For example, define M discrete actions for one
dimensions N . control parameter. To control N parameters, the entire set
1: Initialize kernel subset Dn Kernel
= Ø, n = 1, ..., N . of all possible actions becomes |A| = M N (Solution 2 in
2: for iteration t = 0, 1, 2, ..., T Fig. 2), which is intractable with large M and N . As one
3: if t == 0 solution to maintaining a tractable computational complexity,
4: Generate samples in Dt with random policies πn0 . the experiments in [16] carefully coded actions to limit only
5: else one parameter’s control available in each action (Solution 1
6: Generate samples in t-th iteration Dt by setting in Fig. 2). It therefore reduces the size of the action set to
t
πn , i = 1, ..., N , as softmax exploration policies. M N . However, this trick clearly weakens control capability
7: end if and is not suitable for tasks requiring the simultaneous
8: for each dimension of actions n = 1, 2, ..., N control of several units (e.g. chemical plant control). To
9: Select samples from Dt to build kernel subset for address this, Factorial Kernel Dynamic Policy Programming
the n-th dimension DnKernel . (FKDPP) is proposed to learn action space dimension by
10: Update dual weights vector αn with samples Di , dimension separately under the following KDPP framework:
i = 0, 1, ..., t, and kernel subset DnKernel following N
Y
KDPP. π(a|s) = π (n) (a(n) |s). (10)
11: end for n=1
12: end for
FKDPP divides the discrete action set for N control param-
eters, A, to N subsets An , which only contains M discrete
actions and leaves them to N KDPP agents respectively.
an exponentially growing number of basis functions. Based The policy π(a|s) then turns to N policies π (n) (a(n) |s) as
on DPP, KDPP combines the kernel trick and smooth policy Solution 3 in Fig. 2. With one agent searching M discrete
updates to learn tasks represented as high-dimensional MDPs values in a subset, N agents factorially search all M N dis-
with both increased stability and significantly reduced com- crete actions in A. This hugely decreases the computational
putational complexity [16]. Kernel ridge regression is applied complexity without losing control capability. Inheriting the
to the least squares solution in Eq. (4). The weights vector regularization with the Kullback-Leibler divergence from
is represented by dual variables αt = [αt1 , ..., αtN ]T as KDPP, FKDPP features a factor-wise kernel-based smooth
N policy update that stabilizes the learning among multiple
agents since the over-large update of each agents’ policy is
X
θt = αti φ(xi ) = ΦT αt , (6)
i=1 avoided every iteration.
FKDPP adds a loop to learn each subset of actions
and we define the matrix of inner products as K := ΦΦT separately according to Algorithm 1. Each subset is allocated
such that [K]ij = hφ(xi ), φ(xj )i =: k(xi , xj ). The to an agent, which is updated with corresponding samples
approximate action preferences therefore follow: generated through a soft-max exploration policy:
N
exp(ηexplore Ψ̂nt (s, an ))
X
Ψ̂t (xn ) = φ(xn )θt = k(xn , xi )αti . (7) n
πexplore (an |s) = P . (11)
i=1 an0 ∈An exp(ηexplore Ψ̂nt (s, an0 ))
After translating Eq. (4) using the Woodbury identity: By doing this, tasks with a huge discrete action set for
T
[Φ Φ + σ I]2 −1 T T
Φ OΨ̂t = Φ [ΦΦ + σ I] T 2 −1
OΨ̂t . (8) multiple control parameters are divided into several sub-tasks
with smaller action sets that are tractable for KDPP. Details
The solution can also be represented by dual variables as of KDPP’s kernel subset selection and weights update (lines
9 and 10 in Algorithm 1) are covered in [16].
αt+1 = [K + σ 2 I]−1 OΨ̂t (9)
IV. E XPERIMENTAL R ESULTS
where OΨ̂t is calculated by plugging Eq. (7) into Eq. (5).
According to [16], a suitable kernel subset from all samples In this section, FKDPP and KDPP are applied to the
Dk = [x̃m ]m=1:M , M N can be built via an online task introduced in Section II-B. According to Table I, the
selection in order to reduce the computational complexity, state space has nine dimensions including control state
and therefore enables learning in the high-dimensional state (FI560.F, LI550.L, s407.F, QI531.Q and QI560.Q) and con-
space. trol parameters (FC550.SVM, TC540.SVM, TC501.SVM
and PC501.SVM). The discrete action set is defined as
C. Factorial KDPP increasing/decreasing four control parameters by 10 discrete
KDPP has shown its efficient learning in the control of a actions (M = 10) following: aFC550 ∈ [−2, 2], aTC540 ∈
pneumatic muscle-driven robotic hand with a 32-dimensional [−4, 4], aTC501 ∈ [−10, 10] and aPC540 ∈ [−2, 2]. The total
state space [16], while other conventional methods such number of actions is close to 104 , resulting in an intractable
as LSPI [17] could not. On the other hand, calculating computation in Eq. 3 for KDPP under Solution 2 in Fig.
Discrete actions for 𝑵 Solution 1 Solution 2 Solution 3 (factorial action set)
control parameters Limited discrete action set 𝓐′ Full discrete action set 𝓐 |𝒜 "| = 𝑀
𝑎"" 𝑎"# 𝑎"$ 𝑎"" … 𝑎"$ … 0 0 0 𝑎"" … 𝑎"$ … 𝑎"" … 𝑎"$ 𝑎"" 𝑎"# … 𝑎$
"
Agent 1 𝜋" 𝑎" 𝒔)
… ∈ 𝒜"
#
0 0 0 0 0 0 𝑎#" 𝑎#" 𝑎#" 𝑎#$ 𝑎#$ 𝑎#$ |𝒜 | = 𝑀
𝑎#" 𝑎## … 𝑎#$ ∈ 𝒜# … …
𝑎#" 𝑎## … 𝑎#$ Agent 2 𝜋 # 𝑎# 𝒔)
…
Factorize
…
…
𝑎% 𝑎% 𝑎% 0 0 0 … 𝑎"$ … 𝑎% 𝑎% % %
" 𝑎" 𝑎" … 𝑎% % %
$ 𝑎$ 𝑎$ |𝒜 %| = 𝑀
" # … $ ∈ 𝒜% $
Each subset 𝒜 ' has 𝑀 Control one parameter per action Control all parameters per action,
𝑎%
" 𝑎%
# … 𝑎%
$ Agent N 𝜋 ; 𝑎; 𝒔)
discrete values. 𝓐′ = |𝒜 "| + ⋯ + |𝒜 %| = 𝑀𝑁 𝓐 = |𝒜 "|× ⋯×|𝒜 %| = 𝑀% Multiple agents learn policy for each subset. The search
Lose control capability. Intractable with large 𝑁 and M. space is reduced 𝑀% → 𝑀𝑁 without losing control capability.
Fig. 2: Solutions for handling huge discrete action set in reinforcement learning.
2. Therefore, Solution 1 is used in KDPP to consider only eters, FKDPP took an average 0.0017 s per step to search
M × N = 40 action combinations. For FKDPP, the action over 104 actions while KDPP took 0.0014 s to consider
space with 104 actions is factorized by four agents. For 10 × 4 actions. Because FKDPP reduces the computational
each agent, there are M = 10 actions. The VAM plant complexity by limiting the number of action combinations
simulation used in this experiment is detailed in [8], which is (Eq. 3), computation time for FKDPP with 10 actions each
implemented on the commercial dynamic simulator, Visual is negligibly slower in comparison to running a single agent
Modeler, developed by Omega Simulation Co., Ltd. Each with 40 actions in the case of KDPP.
algorithm is trained over 30 iterations, with each iteration
consisting of 200 steps. Each step simulates approx. 30 V. D ISCUSSIONS AND C ONCLUSIONS
minutes due to lengthy chemical processes. The reward
function used by both methods is defined as: In this paper, a novel reinforcement learning approach,
Factorial Kernel Dynamic Policy Programming (FKDPP),
R = 30 × FI560.F − 2 × LI550.L − 5 × QI560.Q which is applicable to tasks with both multi-dimensional
(12)
− 20 × QI531.Q. state and a huge action set, was proposed and successfully
applied to the Vinyl Acetate Monomer plant model control
It follows the strategy of giving a high reward when FI560.F problem as the first step to control complex chemical plants.
is increased (VAM quantity up) or QI560.Q is decreased Besides utilizing the smooth policy update and kernel tricks
(VAM quality up). If any critical condition in Table I is to efficiently learn in high-dimensional state space, FKDPP
violated, the reward is sharply decreased. also factorizes its action space for multiple control param-
Figure 3a and 3b show the mean and variance of VAM eters to several small subsets and learns them separately.
yield (FI560.F) and quality (QI560.Q) for each 200-step Its performance was investigated in a challenging task with
period during 30 iterations of learning. FKDPP quickly a nine-dimensional state and 104 discrete actions for four
converged to a good policy that maximized production (Fig. control parameters. Experimental results show that FKDPP
3a), while KDPP resulted in a lower mean and larger variance can outperform methods that do not factorize the action space
due to system instability (e.g. depleting the tank or causing in optimizing both quality and quantity of VAM production
other catastrophic failure scenarios). FKDPP also has a lower while maintaining plant stability. FKDPP was able to search
mean of QI560.Q during learning compared with KDPP, as 104 actions with a comparable computation time to a more
shown in Fig. 3b. conventional method with a limited 10 × 4 action set.
After 30 learning iterations, the policies of FKDPP and For future works, we aim to apply FKDPP to control
KDPP are tested. According to Fig. 3c and 3d, FKDPP the whole VAM plant not only to further improve VAM
outperformed KDPP in both VAM quantity and quality. The production but also to optimize the system’s stability and by-
sequences of control parameters are shown in Fig. 3e and 3f, product recycling. Algorithmically, the effect of Kullback-
where FKDPP facilitates exploration with multiple control Leibler divergence should be evaluated. Since the idea of
parameters at each step. On the other hand, KDPP only factorizing policy is based on the assumption that the con-
operates a single control parameter at each step, which is ventional heuristic strategy always independently handles
insufficient in terms of exploration, especially when each each control parameter, different factorized structures of
step lasts for 30 minutes. In Fig. 3f, KDPP has to use more action space should also be investigated for better control
iterations to operate each dimension of the action, resulting in performance.
worse performance. These results show that under identical
learning conditions, the proposed FKDPP achieves superior VI. ACKNOWLEDGMENT
performance compared with methods in which the policy is
not factorized. We gratefully acknowledge the support of the commercial
In terms of computational complexity for this task with its dynamic simulator software Visual Modeler from Omega
nine-dimensional state and action set for four control param- Simulation Co., Ltd. for this research.
QI560.Q
FI560.F
FI560.F
FKDPP (factorially search in 10# actions ) KDPP (search in 40 actions) FKDPP (factorially search in 10# actions ) KDPP (search in 40 actions) FKDPP (factorially search in 10# actions ) KDPP (search in 40 actions) Steady
(a) Mean and variance of VAM yield during (b) Mean and variance of VAM quality (c) VAM yield with policies after 30 it-
30 iterations’ learning. (QI560.Q) during 30 iterations’ learning. erations’ learning. The purple line is the
Lower values indicate a higher quality. benchmark from provided equilibrium pa-
rameters.
Control Parameters
Control Parameters
QI560.Q
FKDPP (factorially search in 10# actions ) KDPP (search in 40 actions) Steady TC550.SVM TC540.SVM PC501.SVM TC501.SVM TC550.SVM TC540.SVM PC501.SVM TC501.SVM
(d) VAM quality with policies after 30 it- (e) Control actions following FKDPP poli- (f) Control actions following KDPP policies
erations’ learning. The purple line is the cies after 30 iterations’ learning. after 30 iterations’ learning.
benchmark from provided equilibrium pa-
rameters.