0% found this document useful (0 votes)
4 views7 pages

Factorial Kernel Dynamic Policy Programming For Vinyl Acetate Monomer Plant Model Control

Uploaded by

Rajesh Varagani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

Factorial Kernel Dynamic Policy Programming For Vinyl Acetate Monomer Plant Model Control

Uploaded by

Rajesh Varagani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/327294861

Factorial Kernel Dynamic Policy Programming for Vinyl Acetate Monomer Plant
Model Control

Conference Paper · August 2018


DOI: 10.1109/COASE.2018.8560593

CITATIONS READS
24 7,896

5 authors, including:

Yunduan Cui Lingwei Zhu


Chinese Academy of Sciences University of Alberta
91 PUBLICATIONS 1,004 CITATIONS 33 PUBLICATIONS 197 CITATIONS

SEE PROFILE SEE PROFILE

Takamitsu Matsubara
Nara Institute of Science and Technology
244 PUBLICATIONS 2,835 CITATIONS

SEE PROFILE

All content following this page was uploaded by Yunduan Cui on 30 August 2018.

The user has requested enhancement of the downloaded file.


Factorial Kernel Dynamic Policy Programming for
Vinyl Acetate Monomer Plant Model Control
Yunduan Cui1∗ , Lingwei Zhu1∗ , Morihiro Fujisaki2 , Hiroaki Kanokogi2 and Takamitsu Matsubara1

Abstract— This research focuses on applying reinforcement conducted [2, 9, 10]. Based on experienced engineers’ intu-
learning towards chemical plant control problems in order to ition and judgment in eliminating and evaluating candidate
optimize production while maintaining plant stability without architectures, these works mainly focus on classical con-
requiring knowledge of the plant models. Since a typical
chemical plant has a large number of sensors and actuators, troller design procedures that require analysis of the process
the control problem of such a plant can be formulated as a dynamics, development of an abstract mathematical model,
Markov decision process involving high-dimensional state and and derivation of a control law that meets certain design
a huge number of actions that might be difficult to solve criteria.
by previous methods due to computational complexity and As an integral part of contemporary machine learning,
sample insufficiency. To overcome these issues, we propose a
new reinforcement learning method, Factorial Kernel Dynamic reinforcement learning [11] agents search for optimal poli-
Policy Programming, that employs 1) a factorial policy model cies by interacting with their environment without any prior
and 2) a factor-wise kernel-based smooth policy update by knowledge and express a remarkably broad range of con-
regularization with the Kullback-Leibler divergence between trol problems in a natural manner. Example applications
the current and updated policies. To validate its effectiveness, include electrical power systems control [12], dynamic power
FKDPP is evaluated via the Vinyl Acetate Monomer plant
(VAM) model, a popular benchmark chemical plant control management [13], and robot control [14]. Such a bio-
problem. Compared with previous methods that cannot directly inspired approach is suitable for application towards learning
process a huge number of actions, our proposed method the control policies for chemical process plants. However,
leverages the same number of training samples and achieves the application of reinforcement learning towards chemical
a better control strategy for VAM yield, quality, and plant process plants, which commonly feature a large number
stability.
of sensors, controllers, and parameters and require high
I. I NTRODUCTION stability for safety, is a less explored area. Some previous
Chemical plants consist of several processing units that works with reinforcement learning were conducted besides
cooperatively produce chemical product via complex interac- heuristic solutions. For model-free methods, [15] applied
tions, and their coordination is important for safe and prof- neural networks and reinforcement learning to cool a reactor
itable plant operations. The conventional control strategies simulator via one valve. For model-based methods, MPC-
for chemical plants are primarily formed as a set of heuristics based dynamic programming was utilized for a plant with
[1, 2] and optimized in both steady and dynamic simulation one or two dimensional states and actions for one control
[3]. In terms of model predictive control (MPC), another parameters in [5]. However, to the best of the authors’
popular method in chemical process control [4], it is a model- knowledge, there is no application of reinforcement learning
based method and suffers exorbitant online computational towards a large-scale problem like the VAM plant model in
cost with large-scale plants as well as long prediction length the literature due to the curse of dimensionality in the high-
[5]. A method that does not rely on either human engineer dimensional state, the intractable computational cost with
knowledge or models remains elusive in this field. large action spaces and the difficulty in accurately modelling
As a popular test problem in chemical plant design, op- complex plant systems.
timization, and control, the Vinyl Acetate Monomer (VAM) In this research, we focus on applying reinforcement
plant model benchmark problem was originally proposed by learning to the large-scale chemical plant control problems
[6]. This problem is uniquely suited for researchers pursuing to optimize production while maintaining plant stability
simulation, design, and control studies as it: without any knowledge of the plant models. Since a typical
1) Has a realistically large flowsheet, containing sev- chemical plant has a large number of sensors and actuators,
eral standard chemical unit operations with high-level its control problem could be formulated as a Markov decision
chemical interactions. process involving high-dimensional state and large action
2) Includes typical industrial characteristics of recycle space that might be difficult to solve by previous methods
streams and energy integration. due to computational complexity and sample insufficiency.
This model has been further developed by [7, 8], while To overcome these issues, a new approach, Factorial Policy
several other studies on control-system design have been Dynamic Policy Programming (FKDPP), is proposed. In-
spired by both Kernel Dynamic Policy Programming (KDPP)
∗ Indicates equal contribution.
1 [16], which outperforms other conventional reinforcement
Division of Information Science, Graduate School of Science and
Technology, Nara Institute of Science and Technology (NAIST), Nara, Japan learning methods such as LSPI [17] in robot control tasks
2 Yokogawa Electric Corporation with high-dimensional states and [18], which factorizes the
high-dimensional state Hidden Markov Model to reduce TABLE I: Observed values and control parameters of inves-
computational complexity, FKDPP enjoys a factorial policy tigated task.
model and factor-wise kernel-based smooth policy update
Observed values
by regularization with the Kullback-Leibler divergence be-
tween the current and updated policies. Thus, the proposed Name Description Control criteria
approach is suited for learning near-optimal policies in large- FI560.F flow rate of VAM to be optimized
as VAM yield
scale chemical plant models. The effectiveness of FKDPP LI550.L level (%) of decanter tank < 100%
is validated in the VAM plant model benchmark problem, s407.F flow rate from distillation column >0
with experimental results showing that the proposed method in Part 7 to decanter tank
QI531.Q acetic-acid density < 500 ppm
outperforms other methods in terms of VAM yield, quality QI560.Q VAM density (inversely propor- < 150 ppm
and system stability. tional to VAM quality)

II. P ROBLEM S TATEMENT Control parameters


A. VAM Plant Description Name Description Operation Effect
The process-flow diagram of the VAM plant model is VAM yield −
PID parameter Increase stability +
shown in Fig. 1. It consists of eight parts. FC550.SVM
for decanter resilient to reverse flow
Part 1 is for the input of raw materials: ethylene (C2 H4 ), tank’s VAM yield +
oxygen (O2 ), and acetic acid (AcOH or CH3 COOH) are reflux-flow rate stability −
into distillation Decrease
provided as fresh feed streams. easy to reverse flow
column
Part 2 converts raw materials into vinyl acetate (VAM, decanter feed flow −
CH2 = CHOCOCH3 with = denoting a double chemical PID parameter Increase
TC540.SVM stability +
for decanter
bond) along with with water (H2 O) and carbon dioxide decanter feed flow +
feed-flow
Decrease stability −
(CO2 ) as byproducts in a reactor. The following reactions temperature
take place during this process: steam amount +
C2 H4 +CH3 COOH+ 12 O2 −−→ CH2 – CHOCOCH3 +H2 O, PID parameter Increase
column temperature +
TC501.SVM VAM yield +
C2 H4 + 3 O2 −−→ 2 CO2 + 2 H2 O for distillation
VAM quality −
Part 3 contains a cooler, a separator and a compressor. column’s
steam-flow rate steam amount −
Since both reactions are highly exothermic, heat from the column temperature −
Decrease
reaction is dissipated by boiler feed water (BFW) circulation. VAM yield −
VAM quality +
Steam is generated on the shell side of the reactor, while
feed N2 into pipeline
gas emitted from the reactor is processed in this part where PC501.SVM
PID parameter Increase
acetic-acid density −
leftover acetic acid, water and vinyl acetate are condensed as for pressure of
open gas valve
distillation
liquid VAM crude while separated gases including unreacted Decrease acetic-acid density +
column.
ethylene, oxygen, carbon dioxide, inert ethane (C2 H6 ), and
a small amount of gaseous vinyl acetate are compressed for
circulation.
Part 4 has an absorber to capture vinyl acetate gas via the optimal control of the whole VAM plant in Fig. 1
cold acetic acid from Parts 3, 7 and send it to Part 6. Other by reinforcement learning. This problem features a nine-
gases are fed into Part 5. dimensional state space including observed values and con-
Part 5 is the gas-purge system. It keeps the concentrations trol parameters detailed in Table I. A finite discrete action set
of CO2 around 5 ∼ 10 mol% and C2 H6 around 5 mol% in is defined as increasing/decreasing each control parameter.
the gas recycle line. The initial parameters of the plant are provided to start from
Part 6 is the intermediate buffer tank to mix vinyl acetate equilibrium. The specific objective is optimizing the VAM
and acetic acid with VAM crude condensed from Part 3. production’s quality and quantity while keeping the level
Part 7 contains a distillation that separates the VAM of the decanter tank and the flow rate from the distillation
crude and acetic acid from the intermediate buffer tank (Part column within safe limits. This is done by manipulating the
6). VAM-water mixture is then discharged from the bottom decanter’s reflux-flow rate and feed-flow temperature, and
while the acetic acid is recycled to the absorber (Part 4) and the distillation columns steam-flow rate and pressure.
raw material feed section (Part 1).
Part 8 contains a decanter where the production vinyl III. A PPROACH
acetate is finally decanted.
A. Dynamic Policy Programming Recursion
B. Tasks Dynamic Policy Programming (DPP) [19] solves the
In this paper, we focus on observing and controlling the Markov decision process (MDP) with smooth policy updates
decanter tank (Part 8) and the distillation column (Part 7), by employing the Kullback-Leibler divergence between cur-
with the overall objective of optimizing VAM yield and its rent and new policies as a regularization term. According to
quality while maintaining stability as the first step towards [19, 20], such a smooth policy update is beneficial when
Fig. 1: Process-flow diagram of VAM plant model [8].

working with a limited number of samples. An MDP is calculated by:


defined by (S, A, T , R, γ). S = {s1 , ..., sn } is a finite
Ψt+1 (s, a) = OΨt (s, a)
set of states, A = {a1 , ..., am } is a finite set of discrete X 0 
actions. Tssa0 represents the transition probability from s to = Ψt (s, a) − Mη Ψt (s)+ a
Tss 0 r
a
+ Mη Ψt (s ) (3)
ss0
s0 under a, and rssa0 = R(s, s0 , a) is the corresponding s ∈S0

reward, and γ ∈ (0, 1) is the discount factor. The policy P



exp ηΨt (s,a) Ψt (s,a)
π(a|s) denotes the probability of taking action a under where Mη Ψt (s) = a∈A P
 is the
0 exp ηΨt (s,a0 )
state s. The optimal value function with the regularization Boltzmann soft-max operator.
a ∈A

term that maximizes expected discounted total reward, while DPP can be extended to large-scale (continuous) state
minimizing the difference between current policy π and spaces via function approximation, i.e. linear function ap-
baseline policy π̄, follows a Bellman equation: proximation with basis functions. Here, we define the n-
  th state-action pair from a set of N samples as xn =
X  1 π(a|s)
Vπ̄∗ (s) = max π(a|s) Tss
a a
0 r +γV ∗ 0
(s ) − log . [sn , an ]n=1:N , where φ(xn ) denotes the m×1 output vector
ss0 π̄
η π̄(a|s)
π
a∈A of m basis functions, [ϕ1 (xn ), ..., ϕm (xn )]T . The approxi-
s0 ∈S mate action preferences in the t-th iteration follow Ψ̂t (xn ) =
(1) φ(xn )T θt , where θt is the corresponding m × 1 weights
vector. The empirical least-squares solution of minimizing
For solving the optimal value function, the action prefer- the loss function J(θ; Ψ̂t ) , kΦθ − OΨ̂t k22 is given by
ences [11] for all state action pairs (s, a) ∈ S × A in the
(t + 1)-th iteration are defined according to [19]: θt+1 = [ΦT Φ + σ 2 I]−1 ΦT OΨ̂t , (4)
where σ is used toavoid over-fitting due
T to the small number
1 X
t 0 of samples. Φ = φ(x1 ), ..., φ(xN ) and OΨ̂t is N × 1
Ψt+1 (s, a) = log π̄ t (a|s) + a a

Tss 0 r
ss 0 + γVπ̄ (s ) .
η 0 matrix with elements
s ∈S
0
(2) OΨ̂t (xn ),Ψ̂t (xn )+rsans0+γMη Ψ̂t (sn )−Mη .Ψ̂t (sn ). (5)
n n

Instead of the optimal value function, DPP learns optimal B. Kernel Dynamic Policy Programming
action preferences to determine the optimal control policy The main limitation of DPP in high-dimensional systems
throughout the state-action space. The DPP recursion is is the intractable computational complexity as a result of
Algorithm 1: Factorial KDPP Mη Ψt (s) in Eq. 3 through a huge discrete action set A is
Require: number of iterations T , number of action intractable. For example, define M discrete actions for one
dimensions N . control parameter. To control N parameters, the entire set
1: Initialize kernel subset Dn Kernel
= Ø, n = 1, ..., N . of all possible actions becomes |A| = M N (Solution 2 in
2: for iteration t = 0, 1, 2, ..., T Fig. 2), which is intractable with large M and N . As one
3: if t == 0 solution to maintaining a tractable computational complexity,
4: Generate samples in Dt with random policies πn0 . the experiments in [16] carefully coded actions to limit only
5: else one parameter’s control available in each action (Solution 1
6: Generate samples in t-th iteration Dt by setting in Fig. 2). It therefore reduces the size of the action set to
t
πn , i = 1, ..., N , as softmax exploration policies. M N . However, this trick clearly weakens control capability
7: end if and is not suitable for tasks requiring the simultaneous
8: for each dimension of actions n = 1, 2, ..., N control of several units (e.g. chemical plant control). To
9: Select samples from Dt to build kernel subset for address this, Factorial Kernel Dynamic Policy Programming
the n-th dimension DnKernel . (FKDPP) is proposed to learn action space dimension by
10: Update dual weights vector αn with samples Di , dimension separately under the following KDPP framework:
i = 0, 1, ..., t, and kernel subset DnKernel following N
Y
KDPP. π(a|s) = π (n) (a(n) |s). (10)
11: end for n=1
12: end for
FKDPP divides the discrete action set for N control param-
eters, A, to N subsets An , which only contains M discrete
actions and leaves them to N KDPP agents respectively.
an exponentially growing number of basis functions. Based The policy π(a|s) then turns to N policies π (n) (a(n) |s) as
on DPP, KDPP combines the kernel trick and smooth policy Solution 3 in Fig. 2. With one agent searching M discrete
updates to learn tasks represented as high-dimensional MDPs values in a subset, N agents factorially search all M N dis-
with both increased stability and significantly reduced com- crete actions in A. This hugely decreases the computational
putational complexity [16]. Kernel ridge regression is applied complexity without losing control capability. Inheriting the
to the least squares solution in Eq. (4). The weights vector regularization with the Kullback-Leibler divergence from
is represented by dual variables αt = [αt1 , ..., αtN ]T as KDPP, FKDPP features a factor-wise kernel-based smooth
N policy update that stabilizes the learning among multiple
agents since the over-large update of each agents’ policy is
X
θt = αti φ(xi ) = ΦT αt , (6)
i=1 avoided every iteration.
FKDPP adds a loop to learn each subset of actions
and we define the matrix of inner products as K := ΦΦT separately according to Algorithm 1. Each subset is allocated
such that [K]ij = hφ(xi ), φ(xj )i =: k(xi , xj ). The to an agent, which is updated with corresponding samples
approximate action preferences therefore follow: generated through a soft-max exploration policy:
N
exp(ηexplore Ψ̂nt (s, an ))
X
Ψ̂t (xn ) = φ(xn )θt = k(xn , xi )αti . (7) n
πexplore (an |s) = P . (11)
i=1 an0 ∈An exp(ηexplore Ψ̂nt (s, an0 ))
After translating Eq. (4) using the Woodbury identity: By doing this, tasks with a huge discrete action set for
T
[Φ Φ + σ I]2 −1 T T
Φ OΨ̂t = Φ [ΦΦ + σ I] T 2 −1
OΨ̂t . (8) multiple control parameters are divided into several sub-tasks
with smaller action sets that are tractable for KDPP. Details
The solution can also be represented by dual variables as of KDPP’s kernel subset selection and weights update (lines
9 and 10 in Algorithm 1) are covered in [16].
αt+1 = [K + σ 2 I]−1 OΨ̂t (9)
IV. E XPERIMENTAL R ESULTS
where OΨ̂t is calculated by plugging Eq. (7) into Eq. (5).
According to [16], a suitable kernel subset from all samples In this section, FKDPP and KDPP are applied to the
Dk = [x̃m ]m=1:M , M  N can be built via an online task introduced in Section II-B. According to Table I, the
selection in order to reduce the computational complexity, state space has nine dimensions including control state
and therefore enables learning in the high-dimensional state (FI560.F, LI550.L, s407.F, QI531.Q and QI560.Q) and con-
space. trol parameters (FC550.SVM, TC540.SVM, TC501.SVM
and PC501.SVM). The discrete action set is defined as
C. Factorial KDPP increasing/decreasing four control parameters by 10 discrete
KDPP has shown its efficient learning in the control of a actions (M = 10) following: aFC550 ∈ [−2, 2], aTC540 ∈
pneumatic muscle-driven robotic hand with a 32-dimensional [−4, 4], aTC501 ∈ [−10, 10] and aPC540 ∈ [−2, 2]. The total
state space [16], while other conventional methods such number of actions is close to 104 , resulting in an intractable
as LSPI [17] could not. On the other hand, calculating computation in Eq. 3 for KDPP under Solution 2 in Fig.
Discrete actions for 𝑵 Solution 1 Solution 2 Solution 3 (factorial action set)
control parameters Limited discrete action set 𝓐′ Full discrete action set 𝓐 |𝒜 "| = 𝑀
𝑎"" 𝑎"# 𝑎"$ 𝑎"" … 𝑎"$ … 0 0 0 𝑎"" … 𝑎"$ … 𝑎"" … 𝑎"$ 𝑎"" 𝑎"# … 𝑎$
"
Agent 1 𝜋" 𝑎" 𝒔)
… ∈ 𝒜"
#
0 0 0 0 0 0 𝑎#" 𝑎#" 𝑎#" 𝑎#$ 𝑎#$ 𝑎#$ |𝒜 | = 𝑀
𝑎#" 𝑎## … 𝑎#$ ∈ 𝒜# … …
𝑎#" 𝑎## … 𝑎#$ Agent 2 𝜋 # 𝑎# 𝒔)


Factorize


𝑎% 𝑎% 𝑎% 0 0 0 … 𝑎"$ … 𝑎% 𝑎% % %
" 𝑎" 𝑎" … 𝑎% % %
$ 𝑎$ 𝑎$ |𝒜 %| = 𝑀
" # … $ ∈ 𝒜% $

Each subset 𝒜 ' has 𝑀 Control one parameter per action Control all parameters per action,
𝑎%
" 𝑎%
# … 𝑎%
$ Agent N 𝜋 ; 𝑎; 𝒔)
discrete values. 𝓐′ = |𝒜 "| + ⋯ + |𝒜 %| = 𝑀𝑁 𝓐 = |𝒜 "|× ⋯×|𝒜 %| = 𝑀% Multiple agents learn policy for each subset. The search
Lose control capability. Intractable with large 𝑁 and M. space is reduced 𝑀% → 𝑀𝑁 without losing control capability.

Fig. 2: Solutions for handling huge discrete action set in reinforcement learning.

2. Therefore, Solution 1 is used in KDPP to consider only eters, FKDPP took an average 0.0017 s per step to search
M × N = 40 action combinations. For FKDPP, the action over 104 actions while KDPP took 0.0014 s to consider
space with 104 actions is factorized by four agents. For 10 × 4 actions. Because FKDPP reduces the computational
each agent, there are M = 10 actions. The VAM plant complexity by limiting the number of action combinations
simulation used in this experiment is detailed in [8], which is (Eq. 3), computation time for FKDPP with 10 actions each
implemented on the commercial dynamic simulator, Visual is negligibly slower in comparison to running a single agent
Modeler, developed by Omega Simulation Co., Ltd. Each with 40 actions in the case of KDPP.
algorithm is trained over 30 iterations, with each iteration
consisting of 200 steps. Each step simulates approx. 30 V. D ISCUSSIONS AND C ONCLUSIONS
minutes due to lengthy chemical processes. The reward
function used by both methods is defined as: In this paper, a novel reinforcement learning approach,
Factorial Kernel Dynamic Policy Programming (FKDPP),
R = 30 × FI560.F − 2 × LI550.L − 5 × QI560.Q which is applicable to tasks with both multi-dimensional
(12)
− 20 × QI531.Q. state and a huge action set, was proposed and successfully
applied to the Vinyl Acetate Monomer plant model control
It follows the strategy of giving a high reward when FI560.F problem as the first step to control complex chemical plants.
is increased (VAM quantity up) or QI560.Q is decreased Besides utilizing the smooth policy update and kernel tricks
(VAM quality up). If any critical condition in Table I is to efficiently learn in high-dimensional state space, FKDPP
violated, the reward is sharply decreased. also factorizes its action space for multiple control param-
Figure 3a and 3b show the mean and variance of VAM eters to several small subsets and learns them separately.
yield (FI560.F) and quality (QI560.Q) for each 200-step Its performance was investigated in a challenging task with
period during 30 iterations of learning. FKDPP quickly a nine-dimensional state and 104 discrete actions for four
converged to a good policy that maximized production (Fig. control parameters. Experimental results show that FKDPP
3a), while KDPP resulted in a lower mean and larger variance can outperform methods that do not factorize the action space
due to system instability (e.g. depleting the tank or causing in optimizing both quality and quantity of VAM production
other catastrophic failure scenarios). FKDPP also has a lower while maintaining plant stability. FKDPP was able to search
mean of QI560.Q during learning compared with KDPP, as 104 actions with a comparable computation time to a more
shown in Fig. 3b. conventional method with a limited 10 × 4 action set.
After 30 learning iterations, the policies of FKDPP and For future works, we aim to apply FKDPP to control
KDPP are tested. According to Fig. 3c and 3d, FKDPP the whole VAM plant not only to further improve VAM
outperformed KDPP in both VAM quantity and quality. The production but also to optimize the system’s stability and by-
sequences of control parameters are shown in Fig. 3e and 3f, product recycling. Algorithmically, the effect of Kullback-
where FKDPP facilitates exploration with multiple control Leibler divergence should be evaluated. Since the idea of
parameters at each step. On the other hand, KDPP only factorizing policy is based on the assumption that the con-
operates a single control parameter at each step, which is ventional heuristic strategy always independently handles
insufficient in terms of exploration, especially when each each control parameter, different factorized structures of
step lasts for 30 minutes. In Fig. 3f, KDPP has to use more action space should also be investigated for better control
iterations to operate each dimension of the action, resulting in performance.
worse performance. These results show that under identical
learning conditions, the proposed FKDPP achieves superior VI. ACKNOWLEDGMENT
performance compared with methods in which the policy is
not factorized. We gratefully acknowledge the support of the commercial
In terms of computational complexity for this task with its dynamic simulator software Visual Modeler from Omega
nine-dimensional state and action set for four control param- Simulation Co., Ltd. for this research.
QI560.Q

FI560.F
FI560.F

FKDPP (factorially search in 10# actions ) KDPP (search in 40 actions) FKDPP (factorially search in 10# actions ) KDPP (search in 40 actions) FKDPP (factorially search in 10# actions ) KDPP (search in 40 actions) Steady

(a) Mean and variance of VAM yield during (b) Mean and variance of VAM quality (c) VAM yield with policies after 30 it-
30 iterations’ learning. (QI560.Q) during 30 iterations’ learning. erations’ learning. The purple line is the
Lower values indicate a higher quality. benchmark from provided equilibrium pa-
rameters.

Control Parameters

Control Parameters
QI560.Q

FKDPP (factorially search in 10# actions ) KDPP (search in 40 actions) Steady TC550.SVM TC540.SVM PC501.SVM TC501.SVM TC550.SVM TC540.SVM PC501.SVM TC501.SVM

(d) VAM quality with policies after 30 it- (e) Control actions following FKDPP poli- (f) Control actions following KDPP policies
erations’ learning. The purple line is the cies after 30 iterations’ learning. after 30 iterations’ learning.
benchmark from provided equilibrium pa-
rameters.

Fig. 3: Learning results of VAM plant simulation.

R EFERENCES [11] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.


MIT press Cambridge, 1998.
[1] A. Zheng, R. V. Mahajanam, and J. M. Douglas, “Hierarchical [12] D. Ernst, M. Glavic, P. Geurts, and L. Wehenkel, “Approximate value
procedure for plantwide control system synthesis,” AIChE Journal, iteration in the reinforcement learning context. application to electrical
vol. 45, no. 6, pp. 1255–1265, 1999. power system control.,” International Journal of Emerging Electric
[2] D. G. Olsen, W. Y. Svrcek, and B. R. Young, “Plantwide control study Power Systems, vol. 3, no. 1, 2005.
of a vinyl acetate monomer process design,” Chemical Engineering [13] W. Liu, Y. Tan, and Q. Qiu, “Enhanced q-learning algorithm for
Communications, vol. 192, no. 10, pp. 1243–1257, 2005. dynamic power management with performance constraint,” in Pro-
[3] T. J. McAvoy, “Synthesis of plantwide control systems using opti- ceedings of the Conference on Design, Automation and Test in Europe,
mization,” Industrial & engineering chemistry research, vol. 38, no. 8, pp. 602–605, European Design and Automation Association, 2010.
pp. 2984–2994, 1999. [14] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in
[4] R. Cheng, J. F. Forbes, and W. San Yip, “Dantzig–wolfe decomposition robotics: A survey,” The International Journal of Robotics Research,
and plant-wide mpc coordination,” Computers & Chemical Engineer- vol. 32, no. 11, pp. 1238–1274, 2013.
ing, vol. 32, no. 7, pp. 1507–1522, 2008. [15] J. Hoskins and D. Himmelblau, “Process control via artificial neural
[5] J. H. Lee and W. Wong, “Approximate dynamic programming ap- networks and reinforcement learning,” Computers & chemical engi-
proach for process control,” Journal of Process Control, vol. 20, no. 9, neering, vol. 16, no. 4, pp. 241–251, 1992.
pp. 1038–1048, 2010. [16] Y. Cui, T. Matsubara, and K. Sugimoto, “Kernel dynamic policy
[6] M. L. Luyben and B. D. Tyréus, “An industrial design/control study programming: Applicable reinforcement learning to robot systems
for the vinyl acetate monomer process,” Computers & Chemical with high dimensional states,” Neural networks, vol. 94, pp. 13–23,
Engineering, vol. 22, no. 7-8, pp. 867–877, 1998. 2017.
[7] R. Chen, K. Dave, T. J. McAvoy, and M. Luyben, “A nonlinear [17] M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” The
dynamic model of a vinyl acetate process,” Industrial & engineering Journal of Machine Learning Research, vol. 4, no. 44, pp. 1107–1149,
chemistry research, vol. 42, no. 20, pp. 4478–4487, 2003. 2003.
[8] Y. Machida, S. Ootakara, H. Seki, Y. Hashimoto, M. Kano, Y. Miyake, [18] T. Matsubara, V. Gómez, and H. J. Kappen, “Latent Kullback Leibler
N. Anzai, M. Sawai, T. Katsuno, and T. Omata, “Vinyl acetate control for continuous-state systems using probabilistic graphical mod-
monomer (vam) plant model: A new benchmark problem for control els,” in The 30th Conference on Uncertainty in Artificial Intelligence,
and operation study,” International Federation of Automatic Control pp. 583–592, 2014.
(IFAC) -PapersOnLine, vol. 49, no. 7, pp. 533–538, 2016. [19] M. G. Azar, V. Gómez, and H. J. Kappen, “Dynamic policy program-
[9] R. Chen and T. McAvoy, “Plantwide control system design: Method- ming,” The Journal of Machine Learning Research, vol. 13, no. 1,
ology and application to a vinyl acetate process,” Industrial & engi- pp. 3207–3245, 2012.
neering chemistry research, vol. 42, no. 20, pp. 4753–4771, 2003. [20] Y. Cui, Practical Model-free Reinforcement Learning in Complex
[10] H. Seki, M. Ogawa, T. Itoh, S. Ootakara, H. Murata, Y. Hashimoto, Robot Systems with High Dimensional States. PhD thesis, Nara
and M. Kano, “Plantwide control system design of the benchmark Institute of Science and Technology, 2017.
vinyl acetate monomer production plant,” Computers & chemical
engineering, vol. 34, no. 8, pp. 1282–1295, 2010.

View publication stats

You might also like