Robot
Robot
                                                        Hang Zhao1 , Qijin She1 , Chenyang Zhu1 ,      Yin Yang2 , Kai Xu1*
                                                          1National University of Defense Technology, 2Clemson University
       ther complicate 3D-BPP in its real-world applications.                 2D- and 3D-BPP are natural generalization of the origi-nal
           As an echo to those challenges, we design a deep rein-             BPP. Here, an item does not only have a scalar-valued
       forcement learning algorithm for 3D-BPP. To maximize the               weight but a high-dimension size of width, height, and/or
        applicability, we carefully accommodate restrictions raised           depth. The main difference between 1D- and 2D-/3D- pack-ing
       in its actual usage. For instance, we require item placement           problems is the verification of the feasibility of the
        satisfying order dependence and not inducing instable stack-          packing, i.e. determining whether an accommodation of the
       ing. An item is immediately packed upon its arrival, and no            items inside the bin exists such that items do not interpene-
        adjustment will be permitted after it is packed. To this end,         trate and the packing is within the bin size. The complexity
       we opt to formulate our problem as a constrained Markov                and the difficulty significantly increase for high-dimension
        decision process (CMDP) (Altman 1999) and propose a con-              BPP instances. In theory, it is possible to generalize ex-act 1D
        strained DRL approach based on the on-policy actor-critic             solutions like MTP (Martello and Toth 1990) or
       framework (Mnih et al. 2016; Wu et al. 2017).                          branch-and-bound (Delorme, Iori, and Martello 2016) al-
           In particular, we introduce a prediction-and-projection            gorithms to 2D-BPP (Martello and Vigo 1998) and 3D-BPP
        scheme for the training of constrained DRL. The agent first           (Martello, Pisinger, and Vigo 2000). However accord-ing to the
       predicts a feasibility mask for the placement actions as an            timing statistic reported in (Martello, Pisinger,
        auxiliary task. It then uses the mask to modulate the action          and Vigo 2000), exactly solving 3D-BPP of a size match-ing
       probabilities output by the actor. These supervision and pro-          an actual parcel packing pipeline, which could deal with
       jection enable the agent to learn feasible policy very effi-ciently.   tens of thousand parcels, remains infeasible. Resorting to
       We also show that our method is general with the                       approximation algorithms is a more practical choice for us.
        ability to handle lookahead items, multi-bin packing, and             Hifi et al. (2010) proposed a mixed linear programming al-
       item re-orienting. With a thorough test and validation, we             gorithm for 3D-BPP by relaxing the integer constraints in
        demonstrate that our algorithm outperforms existing meth-ods          the problem. Crainic et al. (2008) refined the idea of corner
        by a noticeable margin. It even demonstrates a human-level            points (Martello, Pisinger, and Vigo 2000), where an upcom-
       performance in a preliminary user study.                               ing item is placed to the so-called extreme pointsto better ex-
                                                                              plore the un-occupied space in a bin. Heuristic local search
                            2 Related Work                                    iteratively improves an existing packing by searching within
                                                                              a neighbourhood function over the set of solutions. There
        1D-BPP is one of the most famous problems in combina-torial           have been several strategies in designing fast approximate
        optimization, and related literature dates back to the                algorithms, e.g., guided local search (Faroe, Pisinger, and
        sixties (Kantorovich 1960). Many variants and generaliza-tions        Zachariasen 2003), greedy search (De Castro Silva, Soma,
        of 1D-BPP arise in practical contexts such as the cut-ting stock      and Maculan 2003), and tabu search (Lodi, Martello, and
        problem (CSP), in which we want to cut bins to                        Vigo 1999; Crainic, Perboli, and Tadei 2009). Similar strat-egy
        produce desired items of different weights, and minimize              has also been adapted to Online BPP (Ha et al. 2017;
        the total number of bins used. A comprehensive list of bib-           Wang et al. 2016). In contrast, genetic algorithms leads to
        liography on 1D-BPP and CSP can be found in (Sweeney                  better solutions as a global, randomized search (Li, Zhao,
        and Paternoster 1992). Knowing to be strongly NP-hard,                and Zhang 2014; Takahara and Miyamoto 2005).
        most existing literature focuses on designing good heuristic
        and approximation algorithms and their worst-case perfor-             Deep reinforcement learning (DRL) has demonstrated
        mance analysis (Coffman, Garey, and Johnson 1984). For                tremendous success in learning complex behaviour skills
        example, the well-known greedy algorithm, the next fit al-            and solving challenging control tasks with high-dimensional
        gorithm (NF) has a linear time complexity of O(N) and                 raw sensory state-space (Lillicrap et al. 2015; Mnih et al. 2015).
        its worst-case performance ratio is 2 i.e. NF needs at most           2015, 2016). The existing research can largely be divided
        twice as many bins as the optimal solution does (De La Vega           into two lines: on-policy methods (Schulman et al. 2017; Wu
        and Lueker 1981). The first fit algorithm (FF) allows an              et al. 2017; Zhao et al. 2018) and off-policy ones (Mnih et al.
        item to be packed into previous bins that are not yet full,           2015; Wang et al. 2015; Barth-Maron et al. 2018). On-policy
        and its time complexity increases to O(N log N). The best             algorithms optimize the policy with agent-environment in-
        fit algorithm (BF) aims to reduce the residual capacity of            teraction data sampled from the current policy. While lack-ing
        all the non-full bins. Both FF and BF have a better worst-case
                                       17
                                                                              the ability of reusing old data makes them less data ef-ficient,
        performance ratio of than NF10(Johnson et al. 1974).                  updates calculated by on-policy data lead to stable
        Pre-sorting all the items yields the off-line version of those        optimization. In contrast, off-policy methods are more data-
        greedy strategies sometimes also known as the decreasing              efficient but less stable. In our problem, agent-environment
        version (Martello 1990). While straightforward, NF, FF, and           interaction data is easy to obtain (in 2000FPS), thus data ef-
        BF form a foundation of more sophisticated approxima-tions            ficiency is not our main concern. We base our method on the
        to 1D-BPP (e.g. see (Karmarkar and Karp 1982)) or                     on-policy actor-critic framework. In addition, we formulate
        its exact solutions (Martello and Toth 1990; Scholl, Klein,
              ¨                    ´                                          online 3D-BPP as constrained DRL and solve it by project-ing
        and Jurgens 1997; Labb e, Laporte, and Martello 1995; De-             the trajectories sampled from the actor to the constrained
        lorme, Iori, and Martello 2016). We also refer the reader to          state-action space, instead of resorting to more involved con-
        BPPLib library (Delorme, Iori, and Martello 2018), which              strained policy optimization (Achiam et al. 2017).
        includes the implementation of most known algorithms for
        the 1D-BPP problem.
Machine Translated by Google
                                                                   llllllllll
                                                                   llllllllll
                                                  0000000000
                                                                   llllllllll
                                                  0000000000       llllllllllllllllllll
                                                  0000000000
                                                                   llllllllll
                                                  3330000000
                                                                   llllllllll
                                                  3330000000       llllllllll
                                                  4440000000       llllllllllllllllllll
                                                  4442220000
                                                  3332220000       wwwwwwwwww
                                                                   wwwwwwwwww
                                                  6662220000
                                                                   wwwwwwwwww
                                                  6662220000       wwwwwwwwww
                                                                   wwwwwwwwww
                                                          ÿÿ   ×
                                                                   wwwwwwwwww
                                                                   wwwwwwwwww
                                                                   wwwwwwwwww
                                                  0000000000       wwwwwwwwww
                                                  0000000000       wwwwwwwwww
                                                  0000000000       hhhhhhhhhh
                                                  0011111111       hhhhhhhhhhhhhhhhhhhh
                                              ÿ
                                                  0000011111
                                                                   hhhhhhhhhh
                                                  0000011111       hhhhhhhhhh
                                                  0000000011       hhhhhhhhhh
                                                                   hhhhhhhhhhhhhhhhhhhh
                                                  0000000011
                                    FLB           0000010011       hhhhhhhhhh
                                                  0000010011       hhhhhhhhhh
                                                               ×          ÿÿ       × ×3
                                                          ÿÿ
        Figure 2: Left: The environment state of the agent includes the configuration of the bin (the grey boxes) and the size of the
        next item to be packed (green box). The bin configuration is parameterized as a height map H over a L × W grid. The feasibility
        mask M is a binary matrix of size L × W indicating the placement feasibility at each grid cell. The three dimensions of the next
        item are stored into a L × W × 3 tensor D. Right: The network architecture (the three losses other than the standard actor and
        critic losses are shown in red color).
        RL for combinatorial optimization has a distinguished              discount factor, and ÿ = (s0, a0, s1, . . .) is a trajectory sam-pled based on the
        history (Gambardella and Dorigo 1995; Zhang and Diet-              policy ÿ.
        terich 2000) and is still an active direction with especially          The environment state of 3D-BPP is comprised of two
        intensive focus on TSP (Bello et al. 2016). Early attempts         parts: the current configuration of the bin and the coming
        strive for heuristics selection using RL (Nareyek 2003).           items to be placed. For the first part, we parameterize the
        Bello et al. (2016) combined RL pretraining and active             bin through discretizing its bottom area as a L × W regular
        search and demonstrated that RL-based optimization out-            grid along length (X) and width (Y ) directions, respectively.
        performs supervised learning framework when tackling NP-           We record at each grid cell the current height of stacked items, leading to a
        hard combinatorial problems. Recently, Hu et al. (2017)            height map Hn (see Figure 2). Here, the subscript n implies n is the next item
        pro-posed a DRL solution to 3D-BPP. Laterre et al. (2018)          to be packed. Since all the dimensions are integers, Hn ÿ Z can be ex-pressed
        in-troduced a rewarding strategy based on self-play.                                                                              L×W
                                                                           as a 2D integer array. The dimensionality of item n is given as dn = [ln, wn,
        Different from ours, these works deal with an offline setting      hn] ÿ Z teger dimensions helps to reduce the state/action space and accelerate
        where the main goal is to find an optimal sequence of items                                                                   3
                                                                           the policy learning significantly. A spatial resolu-tion . Working with in-
        inspired by the Pointer Network (Vinyals, Fortunato, and Jaitly 2015).
                                                                           of up to 30 × 30 is sufficient in many real scenarios.
                                  3 Method
                                                                                          Putting together, the current environment state can be writ-ten as sn = {Hn,
        In online 3D-BPP, the agent is agnostic on li , wi or hi of all                   dn, dn+1, ..., dn+kÿ1}. We first consider the case where k = |Io| = 1, and name
        the items in I – only immediately incoming ones Io ÿ I are                        this special instance as BPP-1. In other words, BPP-1 only considers the imme-
        observable. As soon as an item arrives, we pack it into the                       diately coming item n i.e., Io = {n}. We then generalize it to BPP-k with k > 1
        bin, and no further adjustment will be applied. As the                            afterwards.
        complexity of BPP decreases drastically for bigger items,
        we further constrain the sizes of all items to be li ÿ L/2, wi ÿ
        W/2, and hi ÿ H/2. We start with our problem state-ment                           BPP-1 In BPP-1, the agent places n’s front-left-bottom
        under the context of DRL and the formulation based on                             (FLB) corner (Figure 2 (left)) at a certain grid point or the
        constrained DRL. We show how we solve the problem via                             loading position (LP) in the bin. For instance, if the agent
        predicting action feasibility in the actor-critic framework.                      chooses to put n at the LP of (xn, yn). This action is rep-
                                                                                          resented as an = xn + L · yn ÿ A, where the action set A =
                                                                                          {0, 1, . . . , L · W ÿ 1}. After an is executed, Hn is updated
        3.1 Problem statement and formulation
                                                                                          by adding hn to the maximum height over all (x, y) = hmax(x,
        The 3D-BPP can be formulated as a Markov decision pro-                            the cells covered by n: H x n y) + hn for
        cess, which is a tuple of (S, A, P, R). S is the set of en-                       ÿ [xn, xn + ln], y ÿ [yn, yn + wn], with hmax(x, y) be-ing the
        vironment states; A is the action set; R : S × A ÿ R is the                       maximum height among those cells. The state tran-sition
        reward function; P : S × A × S ÿ [0, 1] is the tran-sition                        is deterministic: P(H|Hn, an) = 1 for H = H and P(H|Hn,n an)
        probability function. P(s |s, a) gives the probability of                         = 0 otherwise.
        transiting from s to s for given action a. Our method is                              During packing, the agent needs to secure enough space
        model-free since we do not learn P(s |s, a). The policy ÿ : S                     in the bin to host item n. Meanwhile, it is equally important
        ÿ A is a map from states to probability distribu-tions over                       to have n statically equilibrated by the underneath at the LP
        actions, with ÿ(a|s) denoting the probability of selecting                        so that all the stacking items are physically stable. Evaluating
        action a under state s. For DRL, we seek for a policy ÿ to                        the physical stability at a LP is involved, taking into account
        maximize the accumulated discounted reward, J(ÿ) =                                of n’s center of mass, moment of inertia, and rotational sta-
                              ÿ
        Eÿÿÿ[ t=0 ÿ tR(st, at)]. Here, ÿ ÿ [0, 1] is the                                  bility (Goldstein, Poole, and Safko 2002). All of them are
Machine Translated by Google
        normally unknown as the mass distribution differs among                       Feasibility constraints We devise a prediction-and-projection
        items. To this end, we employ a conservative and simplified                   mechanism to enforce feasibility constraints.
        criterion. Specifically, a LP is considered feasible if it not                First, we introduce an independent multilayer perceptron
        only provides sufficient room for n but also satisfies any of                 module, namely the mask predictor (Figure 2 (right)),
        following conditions with n placed: 1) over 60% of n’s bot-tom area and       to predict the feasibility mask Mn for the item n.
        all of its four bottom corners are supported by                               The predictor takes the state CNN features of the cur-rent state as the
        existing items; or 2) over 80% of n’s bottom area and three                   input and is trained with the ground-truth mask as the supervision. Next,
        out of four bottom corners are supported; or 3) over 95% of                   we use the pre-dicted mask to modulate the output, i.e., the probability
        n’s bottom area is supported. We store the feasibility of all
        the LPs for item n with a feasibility mask Mn, an L × W                       distribution of the ac-tions.
        binary matrix (also see Figure 2).                                            In theory, if the
           Since not all actions are allowed, our problem becomes                     LP at (x, y) is infeasible
        a constrained Markov decision processes (CMDP) (Altman                        for n, the corresponding
        1999). Typically, one augments the MDP with an auxil-iary cost function       probability P(an = x +
        C : S × A ÿ R mapping state-action tuples to costs, and require that the      L·y|sn) should be set to
        expectation of                                                                0. However, we find that
        the accumulated cost should be bounded by cm: JC (ÿ) =                        setting P to a small pos-itive
                      ÿ     t
        Eÿÿÿ[         t=0 c C C(st, at)] ÿ cm. Several methods have                   quantity like = 10ÿ3 works better in practice – it
        been proposed to solve CMDP based on e.g., algorith-mic heuristics            provides a strong penalty to an invalid action but a smoother
        (Uchibe and Doya 2007), primal-dual meth-ods (Chow et al. 2017), or           transformation beneficial to the network training. The inset
        constrained policy optimiza-tion (Achiam et al. 2017). While these            shows that softening the mask-based modulation improves
        methods are proven                                                            the training convergence. To further discourage infeasible
        effective, it is unclear how they could fit for 3D-BPP in-stances, where      actions, we explicitly minimize the summed probability at
        the constraint is rendered as a discrete mask.                                all infeasible LPs: Einf = P(an = x + L · y|sn),
        In this work, we propose to exploit the mask M to guide the                   ÿ(x, y)|Mn(x, y) = which is plugged into the final loss
                                                                                                                      ,
        DRL training to enforce the feasibility constraint without in-troducing       function for training.
        excessive training complexity.
                                                                                      Loss function Our loss function is defined as:
                                              1
                                                                        3
                                                                                                ÿ ÿ 63.7% ÿ ÿ ÿ 63.0% ÿ ÿ       7.5
                                                      1                                         ÿ 66.9%                        16.9
                a virtual order                                                                                                16.7
                                                      2
                                                                                                                               17.5
                                      3
                     3                    1
           1                  2
                                                  2   3
                                                                                  Table 1: This ablation study compares the space utilization
                                                                                  and the total number of packed items with different combi-
                                                                                  nations of MP, MC and FE, on the CUT-2 dataset.
        Figure 3: The permutation tree for Io = {1, 2, 3}. To find the
        best packing for item 1, our method explores different virtual
        placing orders satisfying the order dependence constraint,                on our adaptions of the standard MCTS. MCTS allows a
        e.g., 1 cannot be placed on top of virtually placed 2 or 3.               scalable lookahead for BPP-k with a complexity of O(km)
                                                                                  where m is the number of paths sampled.
        the current placement account for the future ones, we opt to
        “hallucinate” the placement of future items through updat-ing                                 4 Experiments
        the height map accordingly. Conditioned on the virtually                  We implement our framework on a desktop computer
        placed future items, the decision for the current item could              (ubuntu 16.04), which equips with an Intel Xeon
        be globally more optimal. However, such virtual placement                 Gold 5115 CPU @ 2.40 GHz, 64G memory, and a
        must satisfy the order dependence constraint which stipu-                 Nvidia Titan V GPU with 12G memory. The DRL and
        lates that the earlier items should never be packed on top                all other networks are implemented with PyTorch (Paszke
        of the later ones. In particular, given two items p and q,                et al. 2019). The model training takes about 16 hours on a
        p < q in Io, if q is (virtually) placed before p, we require              spatial resolution of 10 × 10. The test time of BPP-1 model
        that the placement of p should be spatially independent to                (no lookahead) is less than 10 ms. Please refer to the sup-
        the placement of q. It means p can never be packed at any                 plemental material for more implementation details.
        LPs that overlap with q. This constraint is enforced by set-
        ting the height values in H at the corresponding LPs to H,                Training and test set We set L = W = H = 10 in our
        the maximum height value allowed: Hp(x, y) ÿ H, for all                   experiments with 64 pre-defined item dimensions (|I| =
        x ÿ [xq, xq + lq] and y ÿ [yq, yq + wq]. Combining ex-plicit              64). Results with higher spatial resolution are given in the
        height map updating with feasibility mask prediction,                     supplemental material. We also set li ÿ L/2, wi ÿ W/2
        the agent utilizes the trained policy with the order depen-               and hi ÿ H/2 to avoid over-simplified scenarios. The train-ing
        dence constraint satisfied implicitly.                                    and test sequence is synthesized by generating items out
                                                                                  of I, and the total volume of items should be equal to or big-
        Monte Carlo permutation tree search We opt to search                      ger than bin’s volume. We first create a benchmark called
        for a better an through exploring the permutations of the                 RS where the sequences are generated by sampling items
        sequence (dn, ..., dn+kÿ1). This amounts to a permutation                 out of I randomly. A disadvantage of the random sampling
        tree search during which only the actor network test is con-              is that the optimality of a sequence is unknown (unless per-
        ducted – no training is needed. Figure 3 shows a k-level per-             forming a brute-force search). Without knowing whether the
        mutation tree: A path (r, v1, v2, ..., vk) from the root to a leaf        sequence would lead to a successful packing, it is difficult to
        forms a possible permutation of the placement of the k items              gauge the packing performance with this benchmark.
        in Io, where r is the (empty) root node and let item(vi) rep-                Therefore, we also generate training se-
        resent the i-th item being placed in the permutation. Given               quences via cutting stock (Gilmore and Go-
                                                                                                                                               4
        two items item(vi) < item(vj ) meaning item(vi) arrives                   mory 1961). Specifically, items in a se-              3
        before item(vj ) in the actual order. If i > j along a permu-             quence are created by sequentially “cut-
        tation path, meaning that item(vj ) is virtually placed before            ting” the bin into items of the pre-defined                  2
                                                                                                                                       1
        item(vi), we block the LPs corresponding to item(vj )’s oc-               64 types so that we understand the se-
        cupancy to avoid placing item(vi) on top of item(vj ).                    quence may be perfectly packed and re-
           Clearly, enumerating all the permutations for k items                  stored back to the bin. There are two variations of this strat-
        quickly becomes prohibitive with an O(k!) complexity. To                  egy. CUT-1: After the cutting, we sort resulting items into
        make the search scalable, we adapt the Monte Carlo tree                   the sequence based on Z coordinates of their FLBs, from
        search (MCTS) (Silver et al. 2017) to our problem. With                   bottom to top. If FLBs of two items have the same Z coor-
        MCTS, the permutation tree is expanded in a priority-based                dinate, their order in the sequence is randomly determined.
        fashion through evaluating how promising a node would                     CUT-2: The cut items are sorted based on their stacking de-
        lead to the optimal solution. The latter is evaluated by sam-             pendency: an item can be added to the sequence only after
        pling a fixed number of paths starting from that node and                 all of its supporting items are there. A 2D toy example is
        computing for each path a value summing up the accumu-                    given in the inset figure with FLB of each item highlighted.
        lated reward and the critic value (“reward to go”) at the leaf            Under CUT-1, both {1, 2, 3, 4} and {2, 1, 3, 4} are valid item
        (k-th level) node. After search, we choose the action an cor-             sequences. If we use CUT-2 on the other hand, {1, 3, 2, 4}
        responding to the permutation with the highest path value.                and {2, 4, 1, 3} would also be valid sequences as the place-
        Please refer to the supplemental material for more details                ment of 3 or 4 depends on 1 or 2. For the testing purpose,
 Machine Translated by Google
(MPFEw/
   MC,
   MP,
Ours
 FE)
MC w/ M
      o
                               0.417                      0.524                       0.674
   o
                              12 items                   22 items                    18 items
++
                                                                                                Figure 5: HM shows a clear advantage over vector-based
                               0.228                       0.24                       0.297
                                                                                                height parameterizations (HV and ISV).
                              6 items                    10 items                    8 items
         Ablation study Table 1 reports an ablation study. From                                 Scalability of BPP-k With the capability of lookahead, it
         the results, we found that the packing performance drops                               is expected that the agent better exploits the remaining space
         significantly if we do not incorporate the feasibility mask                            in the bin and delivers a more compact packing. On the other
         prediction (MP) during the training. The performance is im-                            hand, due to the NP-hard nature, big k values increase the
         paired if the mask constraint (MC) is not enforced with                                environment space exponentially. Therefore, it is important
         our projection scheme. The feasibility-based entropy (FE) is                           to understand if MCTS is able to effectively navigate us in
         also beneficial for both the training and final performance.                           the space at the scale of O(k!) for a good packing strategy.
         Figure 4 demonstrates the packing results visually for dif-                            In Figure 7(a,b), we compare our method with a brute-force
         ferent method settings.                                                                permutation search, which traverses all k! permutations of
                                                                                                k coming items and chooses the best packing strategy (i.e.,
         Height parameterization Next, we show that the environ-
                                                                                                the global optimal). We also compare to MCTS-based ac-
         ment parameterization using the proposed 2D height map
         (HM) (i.e., the H matrix) is necessary and effective. To this
         end, we compare our method using HM against that employ-                                           (a)                  (b)                 (c)
          # bins Space uti. # items per bin # total items Decision time                                                          # items / % Space uti.
                                                                                                Method
             1       67.4%               17.6               17.6        2.2 × 10ÿ3 s                                     RS               CUT-1           CUT-2
            4        69.4% 6.3 × 10ÿ3 s18.8               75.2                          Boundary rule (Online) 8.7 / 34.9% 10.8 / 41.2% 11.1 / 40.8%
            9        72.1% 1.8 × 10ÿ2 s19.1               171.9                             BPH (Online) 8.7 / 35.4% 13.5 / 51.9% 13.1 / 49.2%
            16       75.3% 2.8 × 10ÿ2 s19.6              313.6                              LBP (Offline) 12.9 / 54.7% 14.9 / 59.1% 15.2 / 59.5%
                                                                                         Our BPP-1 (Online) 12.2 / 50.5% 19.1 / 73.4% 17.5 / 66.9%
            25       77.8%            20.2               505.0         4.5 × 10ÿ2 s
                                                                             Hifi, M.; Kacem, I.; Negre, S.; and Wu, L. 2010. A lin- ear
        Bello, I.; Pham, H.; Le, Q. V.; Norouzi, M.; and Bengio,             programming approach for the three-dimensional bin-packing
        S. 2016. Neural combinatorial optimization with reinforce-ment       problem. Electronic Notes in Discrete Mathematics
        learning. arXiv preprint arXiv:1611.09940 .                          36: 993–1000.
        Chow, Y.; Ghavamzadeh, M.; Janson, L.; and Pavone, M.                Hu, H.; Zhang, X.; Yan, X.; Wang, L.; and Xu, Y. 2017.
        2017. Risk-constrained reinforcement learning with per-centile       Solving a new 3d bin packing problem with deep reinforce-ment
        risk criteria. The Journal of Machine Learning Re-search 18(1):      learning method. arXiv preprint arXiv:1708.05930 .
        6070–6120.
                                                                             Johnson, D. S.; Demers, A.; Ullman, J. D.; Garey, M. R.; and
        Coffman, E. G.; Garey, M. R.; and Johnson, D. S. 1984.               Graham, R. L. 1974. Worst-case performance bounds for
        Approximation algorithms for bin-packing—an updated sur-vey.         simple one-dimensional packing algorithms. SIAM Journal
        In Algorithm design for computer system design, 49–                  on computing 3(4): 299–325.
        106. Springer.
                                                                             Kantorovich, L. V. 1960. Mathematical methods of orga-nizing
       Crainic, TG; Perboli, G.; and Tadei, R. 2008. Extremes
                                                                             and planning production. Management science 6(4):
       point-based heuristics for three-dimensional bin packing. In-forms    366–422.
       Journal on computing 20(3): 368–384.                                                                ÿ
        Crainic, TG; Perboli, G.; and Tadei, R. 2009. TS2PACK:                Karabulut, K.; and ÿInceoglu, M. M. 2004. A hybrid genetic
                                                                              algorithm for packing in 3D with deepest bottom left with
        A two-level tabu search for the three-dimensional bin pack-ing
                                                                              fill method. In International Conference on Advances in In-
        problem. European Journal of Operational Research
                                                                             formation Systems, 441–450. Springer.
        195(3): 744–760.
        De Castro Silva, J.; Soma, N.; and Maculan, N. 2003. A               Karmarkar, N.; and Karp, R. M. 1982. An efficient approx-imation
        greedy search for the three-dimensional bin packing prob-lem:        scheme for the one-dimensional bin-packing prob-lem. In 23rd
        the packing static stability case. International Transac-tions in    Annual Symposium on Foundations of Com-puter Science (sfcs
        Operational Research 10(2): 141–153.                                 1982), 312–320. IEEE.
        De La Vega, W. F.; and Lueker, G. S. 1981. Bin packing can           Korte, B.; and Vygen, J. 2012. Bin packing. In Combinatorial
        be solved within 1+ ÿ in linear time. Combinatorica 1(4):            Optimization, 499–516. Springer.
        349–355.                                                                   ´
        Li, X.; Zhao, Z.; and Zhang, K. 2014. A genetic algorithm                  Silver, D.; Stepwieser, J.; Simonyan, K.; Antonoglou, I.;
        for the three-dimensional bin packing problem with hetero-geneous          Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, 1999
        bins. In IIE Annual Conference. Proceedings, 2039.                         A.; et al. 2017. Mastering the game of go without human
        Institute of Industrial and Systems Engineers (IISE).                      knowledge. Nature 550(7676): 354–359.
        Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.;           Skiena, S. S. 1997. The stony brook algorithm repos-itory. http://
        Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuous                   www.cs.sunysb.edu/algorith/implement/nauty/
        control with deep reinforcement learning. arXiv preprint                   implement.shtml.
        arXiv:1509.02971 .
                                                                                   Sweeney, P. E.; and Paternoster, E. R. 1992. Cutting and
        Lodi, A.; Martello, S.; and Vigo, D. 1999. Approximation al-gorithms for   packing problems: a categorized, application-orientated re-search
        the oriented two-dimensional bin packing prob-lem. European Journal        bibliography. Journal of the Operational Research
        of Operational Research 112(1):                                            Society 43(7): 691–706.
        158–166.
                                                                                   Takahara, S.; and Miyamoto, S. 2005. An evolutionary
        Man Jr, E. C.; Garey, M.; and Johnson, D. 1996. Approxi-mation             approach for the multiple container loading problem. In
        algorithms for bin packing: A survey. Approximation                        Fifth International Conference on Hybrid Intelligent Sys-tems (HIS’05),
        algorithms for NP-hard problems 46–93.                                     6–pp. IEEE.
        Martello, S. 1990. Knapsack problems: algorithms and com-puter             Uchibe, E.; and Doya, K. 2007. Constrained reinforcement
        implementations. Wiley-Interscience series in discrete                     learning from intrinsic and extrinsic rewards. In 2007 IEEE
        mathematics and optimiza tion .                                            6th International Conference on Development and Learn-ing, 163–168.
                                                                                   IEEE.
        Martello, S.; Pisinger, D.; and Vigo, D. 2000. The three-dimensional bin
                                                                                   Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer net-works. In
        packing problem. Operations research
                                                                                   Advances in Neural Information Processing Sys-tems, 2692–2700.
        48(2): 256–267.
        Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;                 Zhao, Y.; Xu, K.; Zhu, E.; Liu, X.; Zhu, X.; and Yin, J. 2018.
        Cannon, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.;              Triangle lasso for simultaneous clustering and optimization
        et al. 2019. PyTorch: An imperative style, high-performance                in graph datasets. IEEE Transactions on Knowledge and
        deep learning library. In Advances in Neural Information                   Data Engineering 31(8): 1610–1623.
        Processing Systems, 8024–8035.
                                           ¨
        Scholl, A.; Klein, R.; and Jurgens, C. 1997. Bison: A fast hy- brid
        procedure for exactly solving the one-dimensional bin
        packing problem. Computers & Operations Research 24(7):
        627–645.
supplemental document, we report more implemen-tation Critic Network Actor Network Mask Predictor Network
architecture, training selection, Monte Carlo permutation resume resume resume resume
                                                                                             Wÿ64 x 64 x 3 x 3
                                                                                                                                     Flatten                  Flatten                     Flatten
Bÿ64
                                                                                             Wÿ64 x 64 x 3 x 3
                                                                                                                                      groan                    groan                       groan
                                                                                                               resume
                                                                                                                                               resume                   resume                      Sigmoid
        • The user study design is reported in Section E. •                                            Conv                           groan                    groan                     predicted
                                                                                                                                                                                            mask
                                                                                             Wÿ64 x 64 x 3 x 3                 Wÿ256 x 1                Wÿ256 x 100
                                                                                             Bÿ64                              Bÿ1                      Bÿ100
        Section F analyzes the performance difference between step-                                            resume
                                                                                                                                                                                                    binaryze
                                                                                                                                                                                  Multipy 7
          wise reward and termination reward in our problem. • The                                     Conv
                                                                                                                                    critic value                 Sub
                                                                                             Wÿ64 x 64 x 3 x 3
                                                                                             Bÿ64
        details of reward function to penalize unsafe place-                                                                                                 Softmax
                                      B Implementation Details We
                                                                                                                Figure 9: Detailed network architecture.
        report the details of our implementation in this section,
        and our source code is also submitted with this
        supplemental material.
                                                                                       Afterwards, a softmax operation is adopted to output the fi-nal
                                                                                       action distribution. Note that, the infeasibility penalty could be
        Network architecture and training configurations A detailed                    absent in test with the help of Einf and our method can still work
        specifications of the our network is shown in Fig-ure 9. Our                   well in this situation.
        pipeline consists of three major components: an actor network,
        a critic network, and the feasibility mask pre-dictor. It takes                Monte Carlo permutation tree search Our algorithm is
        three inputs, height map Hn and the dimen-sionality dn = [ln,                  inspired by the Monte Carlo tree search of (Silver et al.
                                                3
        wn, hn] ÿ Z of the current item n to be packed as state, and the               2017). The main difference lies in: Firstly, the goal of our MCTS
        feasibility mask Mn as ground truth. Note that Mn is only used                 is to find the best packing order for the next k items; Secondly,
        in the training processing.                                                    max reward is used in our MCTS instead of the mean reward
           The whole network is trained via a composite loss consist-                  as the evaluation strategy. Algorithm 1 out-lines the entire
        ing of actor loss Lactor, critic loss Lcritic, mask prediction loss            procedure step by step, where in T is the maximum simulation
        Lmask, infeasibility loss Einf and action entropy loss Eentropy.               time, I is lookahead items, Last is the first item after I, i0 is
        These loss function are defined as:                                            current item, s is state input of each node and a is action of
                                                                                       each node. Environment simulator SimEnv, which takes height
                             Lactor = (Rn ÿ V (sn)) log P(an|sn) ÿ                     map, the next item dimension, and action (position of item) as
                             Lcritic = (Rn ÿ V (sn))2                                  the input, and returns the updated height map. Action choosing
                                                                     2                 function ÿ uses pol-icy network from BPP-1 model to get the
                             Lmask =              (Mgt n ÿ Mpred
                                                              n
                                                                 )
                                                                                       action with high-est possibility. N is visit times of a node. Q is
                                              (x,y)
                                                                                       expect return from a node. We test our method on our three
                                          =
        ÿÿÿÿÿÿÿÿÿÿÿ
                             Simple                       P(an = L · x + y|sn)         benchmarks, and more results can be found in Figure 10.
                                              Mn(x,y)=0
(a) Packing performance on RS. (b) Packing performance on CUT-1. (c) Packing performance on CUT-2.
Figure 10: When k increases, the space utilization rate first goes up and later, enters a “plateau” zone.
        rithm 2. Here, val is the last state value estimation of bin b, V is the state                  D Heuristic Baseline Method Online BPP is
        value estimation function via a value network, n is the current item, B is
                                                                                           an under-investigated problem. To better demonstrate the effectiveness
        the set of bins and H is the height maps of bins. The default score for
                                                                                           of our method, we design a heuristic baseline approach to evaluate the
        each bin at beginning sdef = ÿ0.2.
                                                                                           performance of our DRL method. We report details of this baseline
                                                                                           approach in this section. This method is designed based on a simple
                                                                                           observation, human would try to keep the volume of packed bin to be
                                                                                           regular during the packing to maintain the left reg-ular space as large as
        Orientated items Our method can also incorporating items with different
        orientation. We multifold the action space and related mask based on               possible. Such “regularity” is used as the metric to measure a packing
                                                                                           action.
        how many different ori-entations are considered, e.g. we will have a m
        times larger feasibility Mn and action space if m different poses are al-
                                                                                               To describe regularity of a bin, we introduce the concept of spare
        lowed for an item. Doing so induces more flexibility for the packing, and
                                                                                           cuboid. As shown in Figure 12, a spare cuboid is an unoccupied,
        it potentially leads to a better result. This is ob-served and reported in
                                                                                           rectangular space in the bin, and the reg-ularity of a bin is defined
        Table 4. Note that, orientation only happens around Z axis in our
                                                                                           based on the maximum spare cuboids. Intuitively, we would like to have
        problem setting.
                                                                                           a bigger max-imum spare cuboid. If a packed bin has many small-size
                                                                                           spare cuboids, it implies the remaining space of this bin is not “regular”.
                                                                                           As illustrated in Figure 12, the right packing strategy would left the
        Table 4: Performance comparison with and without orienta-tion on
                                                                                           biggest spare cuboid. The regular-ity of the bin is then defined as the
        different benchmarks.
                                                                                           maximum rectangular residual space or maximum spare cuboid. Since
                                                                                           I is pre-defined, we know how many items can be packed into a maximum
                                   RS CUT-1 CUT-2 w                                        spare cuboid. Based on this, we rate each max-imum spare cuboid c
                   orientation 62.1% 76.2% 70.2% w/o                                       by the number of item types can be packed in RSc = Ivalid + cvolume,
                  orientation 50.5% 73.4% 66.9%                                            Ivalid ÿ I. If a max-imum spare cuboid fits all the items in I, additional
                                                                                           reward is given as: RSc = I+cvolume + 10. The final score BSp of a bin
                                                                                           by packing the current item at p would be the sum of RSc of its
                                                                                           maximum spare cuboid c. And we can find the best packing position
                      C Benchmark Construction
                                                                                           pbest as:
        All 64 pre-defined items are visualized in Figure 11. Algo-rithm 3
        outlines how the dataset is constructed given the bin size L, W, H and a
        valid item size threshold.
           The sequence in RS benchmark is generated by random sampling.
                                                                                                                                    1
        Each item along the sequence is picked out of our pre-defined item set I                             pbest = arg max                   RSc               (3)
                                                                                                                               p    C
        randomly. However, as everything is random, we do not know the                                                                   cÿC
        optimal packing configura-tion of a RS sequence ourselves (unless we
        run an exhaus-tive branch-and-bound search (Martello, Pisinger, and
        Vigo 2000) which is much too time consuming to accomplish).                                                 E User Study Figure
                                                                                           13 is the interface of our user study app, which con-sists of two parts:
        For a better quantitative evaluation, we also generate item sequences              visualization and action space. The test se-quences is randomly picked
        via cutting stock (Gilmore and Gomory 1961). It is clear that a sequence           from CUT-2 test set. Users can drag our UI to change the angle of view
        created by cutting the bin should be packed in bin perfectly with a perfect        thus having a full observation of the packed items. To help users make
        space utilization of 100%. Algorithm 3 provides the detailed procedures            better decision, our app allow them choose any suggestion circle in
        of the data generation.                                                            action space and virtually place item before they make
Machine Translated by Google
(2, 2, 2)~(2, 2, 5)
(3, 2, 2)~(3, 5, 5)
(4, 2, 2)~(4, 5, 5)
(5, 2, 2)~(5, 5, 5)
        Figure 14: Left: Packing performance on different resolution. Size-10 means the resolution of the test bin is 10 × 10 × 10 and etc.
        Second to right: Imposing the C&B rule leads to inferior performance (lower average reward).
(a) (b)
        Figure 15: Both (a) and (b) has same height map but different
        space utilization, which is ambiguous for agent to predict
        state value given termination reward.                                                (a)                         (b)                     (c)
                                                                              Before packing the red item     Following the C&B rule 16     Our method 19
RS
                                                                                                                  0.726                    0.698
                                                                                                                 24 items                 34 items
CUT-1
                                                                                                                  0.830                    0.762
                                                                                                                 27 items                 56 items
                                                                                   CUT-2
        room for next moves but not only takes the current situation
                                                                                                                  0.884                    0.772
        into consideration. This move reserves enough space around
                                                                                                                41 items                  62 items
        the red item for the following item and this decision makes
        our method packing 3 more items when dealing with a same
        sequence.                                                                Figure 18: Packing results of our BPP-1 model with 125
                                                                                 types of pre-defined items.
        Generalizability with unseen items We also test our
        method with unseen items. In this experiment, we randomly                Table 8: Evaluation of generalizability with untrained se-quences.
        choose 40 items from the pre-defined item set I to train an
        agent and test it in complete I. All the items are generated
        with RS and their test dimensions may not be seen during                      Test        Train Space utilization #Packed items
        training.                                                                                  RS               50.5%                      12.2
           The result is presented in Table 7. It shows that our                       RS        CUT-243.4% CUT-1 47.6%                        10.7
        method does demonstrate some generalizability and pro-vides a                            73.4% RS 60.8% CUT-1                          11.6
        reasonable benchmark.                                                                      CUT-2 69.4% 60.9% RS                        15.7
                                                                                     CUT-1       62.4% CUT-1 66.9%                            19.1
                                                                                                 CUT-2                                         17.9
        Generalizability with untrained sequences Since our
                                                                                                                                               16.1
        RS, CUT-1 and CUT-2 benchmarks are constructed based
                                                                                     CUT-2                                                     16.6
        on different types of sequences, we can also evaluate the per-formance
                                                                                                                                              17.5
        of our method on different sequence type from the
        training data. The result is presented in Table 8, our method
        can still perform well while testing on varied sequences.
        Note that our model trained on CUT-2 attains the best gener-alization    l ÿ L/2, w ÿ W/2, and h ÿ H/2 to ensure the complexity
        since this benchmark has the best balance between                        of BPP which
                                                                                            ,   also means more little items has been added
        variation and completeness.                                              (one of item’s axes must be 1). The result can be seen from
                                                                                 Table 9 and Figure 18.
        Different DRL framework We also test our method with
        different DRL frameworks on CUT-2 dataset with well-
        tuned parameters. For on-policy methods, we have evaluated
        A2C (Mnih et al. 2016) and ACKTR (Wu et al. 2017). And
        we also evaluated DQN (Mnih et al. 2015), RAINBOW (?)
        and SAC (Haarnoja et al. 2018) for off-policy methods. Fig-ure 17 and
        Table 10 demonstrate that ACKTR can achieve
        the fastest convergence speed and best performance.
Random sample
CUTÿ1
CUT-2
       42         return argmax               (v .Q + c
                                                            v.N
                                                                  );
                                                                       22           while Lvalid = ÿ do
                                                          1+v.N
                             v ÿchildren(v)
                                                                       23               Randomly pop an itemi from Lvalid satisfy
       43 Function BACKUP(v, ÿ):                                                                                        lzi = Hn(item)
       44    while v is not null do
       45             v.N ÿ v.N + 1;                                   24                  Add the itemi into S;
       46             vQ ÿ max(vQ, ÿ);                                 25                  Hn(itemi) ÿ Hn(itemi) + hi ;
       47             v ÿ parent of v;                                 26           return S;