Cvxnet:: Learnable Convex Decomposition
Cvxnet:: Learnable Convex Decomposition
Abstract
                                                                                                         1
sentations are key to capturing 3D data whose topology is              cently, meshes have also been considered as the output of
unknown [47, 46], while part-based models provide a nat-               a network [26, 32, 70]. A key weakness of these models is
ural decomposition of an object into its semantic compo-               the fact that they may produce self-intersecting meshes. An-
nents. This creates a representation useful to reason about            other natural high-dimensional representation that has gar-
extent, support, mass, contact, ... quantities that are key            nered some traction in vision is the point cloud representa-
in order to describe the scene, and eventually design action           tion. Point clouds are the natural representation of objects
plans [29, 28].                                                        if one is using sensors such as depth cameras or LiDar, and
                                                                       they require far less memory than voxels. Qi et al. [52, 54]
Contributions In this paper, we propose a novel represen-              used point clouds as a representation for discriminative deep
tation for geometry based on primitive decomposition. The              learning tasks. Hoppe et al. [30] used point clouds for sur-
representation is parsimonious, as we approximate geome-               face mesh reconstruction (see also [3] for a survey of other
try via a small number of convex elements, while we seek               techniques). Fan et. al. [19] and Lin et. al. [37] used point
to allow low-dimensional representation to be automati-                clouds for 3D reconstruction using deep learning. How-
cally inferred from data – without any human supervision.              ever, these approaches require additional non-trivial post-
More specifically, inspired by recent works [67, 21, 43] we            processing steps to generate the final 3D mesh.
train our pipeline in a self-supervised manner: predicting
the primitive configuration as well as their parameters by             Primitives Far more common is to approximate the input
checking whether the reconstructed geometry matches the                shape by set of volumetric primitives. With this perspective
geometry of the target. We note how we inherit a number of             in mind, representing shapes as voxels will be a special case,
interesting properties from several of the aforementioned              where the primitives are unit cubes in a lattice. Another
representations. As it is part-based it is naturally locally           fundamental way to describe 3D shapes is via Constructive
supported, and by training on a shape collection, parts have           Solid Geometry [33]. Sherma et. al. [59] presents a model
a semantic association (i.e. the same element is used to rep-          that will output a program (i.e. set of Boolean operations on
resent the backs of chairs). Although part-based, each of              shape primitives) that generate the input image or shape. In
them is not restricted to belong to the class of boxes [67], el-       general, this is a fairly difficult task. Some of the classical
lipsoids [21], or sphere-meshes [66], but to the more general          primitives used in graphics and computer vision are blocks
class of convexes. As a convex is defined by a collection              world [58], generalized cylinders [5], geons [4], and even
of half-space constraints, it can be simultaneously decoded            Lego pieces [69]. In [67], a deep CNN is used to interpret
into an explicit (polygonal mesh), as well as implicit (in-            a shape as a union of simple rectangular prisms. They also
dicator function) representation. Because our encoder de-              note that their model provides a consistent parsing across
composes geometry into convexes, it is immediately usable              shapes (i.e. the head is captured by the same primitive),
in any application requiring real-time physics simulation, as          allowing some interpretability of the output. In [49], they
collision resolution between convexes is efficiently decided           extended cuboids to superquadrics, showing that the extra
by GJK [23] (Figure 1). Finally, parts can interact via struc-         flexibility will result in much lower error and qualitatively
turing [21] to generate smooth blending between parts.                 better looking results.
                                                                   2
Figure 2. An example of how a collection of hyperplane parameters for an image specifies the indicator function of a convex object.
The soft-max allows gradients to propagate through all hyperplanes and allows for the generation of smooth convex, while the sigmoid
parameter controls the slope of the transition in the generated indicator. Note that our soft-max function is a LogSumExp.
representing shapes is to describe them as a collection of           3.1. Differentiable convex indicator – Figure 2
convex objects. Several methods for convex decomposition
of meshes have been proposed [25, 51]. In machine learn-             We define a decoder that given a collection of (unordered)
ing, however, we only find some initial attempts to approach         half-space constraints constructs the indicator function of a
convex hull computation via neural networks [34]. Splitting          single convex object; such a function can be evaluated at
the meshes into exactly convex parts generally produces too          any point x ∈ R3 . We define Hh (x) = nh · x + dh as
many pieces [13]. As such, it is more prudent to seek small          the signed distance of the point x from the h-th plane with
number of approximately convex objects that generate the             normal nh and offset dh . Given a sufficiently large number
input shape [22, 36, 38, 40, 39]. Recently [65] also extended        H of half-planes the signed distance function of any convex
convex decomposition to the spatio-temporal domain, by               object can be approximated by taking the intersection (max
considering moving geometry. Our method is most related              operator) of the signed distance functions of the planes. To
to [67] and [21], in that we train an occupancy function for         facilitate gradient learning, instead of maximum, we use the
the input shapes. However, we choose our space of func-              smooth maximum function LogSumExp and define the ap-
tions so that their level sets are approximately convex, and         proximate signed distance function, Φ(x):
use these as building blocks.                                                      Φ(x) = LogSumExp{δHh (x)},                   (1)
                                                                 3
Figure 3. Convex auto-encoder – The encoder E creates a low dimensional latent vector representation λ, decoded into a collection of
hyperplanes by the decoder D. The training loss involves reconstructing the value of the input image at random pixels x.
to discover some form of correlation between their param-                a low-dimensional latent representation of all K convexes
eters. Towards this goal, we employ the bottleneck auto-                 λ that D decodes into a collection of K parameter tuples.
encoder architecture illustrated in Figure 3. Given an input,            Each tuple (indexed by k) is comprised of a shape code β k ,
the encoder E derives a latent representation λ from the in-             and corresponding transformation Tk (x) = x + ck that
put. Then, a decoder D derives the collection of hyperplane              transforms the point from world coordinates to local coor-
parameters. While in theory permuting the H hyperplanes                  dinates. ck is the predicted translation vector (Figure 6).
generates the same convex, the decoder D correlates a par-
ticular hyperplane with a corresponding orientation. This is             3.4. Training losses
visible in Figure 4, where we color-code different 2D hyper-
                                                                         First and foremost, we want the (ground truth) indicator
planes and indicate their orientation distribution in a simple
                                                                         function of our object O to be well approximated:
2D auto-encoding task for a collection of axis-aligned ellip-
soids. As ellipsoids and oriented cuboids are convexes, we                         Lapprox (ω) = Ex∼R3 kÔ(x) − O(x)k2 ,          (3)
argue that the architecture in Figure 3 allows us to general-
ize the core geometric primitives proposed in VP [67] and                where Ô(x) = maxk {Ck (x)}, and Ck (x) = C(Tk (x)|β k ).
SIF [21]; we verify this claim in Figure 5.                              The application of the max operator produces a perfect
                                                                         union of indicator functions. While constructive solid ge-
3.3. Multi convex decomposition – Figure 6                               ometry typically applies the min operator to compute the
                                                                         union of signed distance functions, note that we apply the
Having a learnable pipeline for a single convex object, we               max operator to indicator functions instead with the same
can now generalize the expressivity of our model by repre-               effect; see Appendix A for more details. We couple the
senting generic non-convex objects as compositions of con-               approximation loss with a small set of auxiliary losses that
vex elements [65]. To achieve this task an encoder E outputs             enforce the desired properties of our decomposition.
                                                                     4
Figure 6. Multi-convex auto-encoder – Our network approximates input geometry as a composition of convex elements. Note that
this network does not prescribe how the final image is generated, but merely output the shape {β k } and pose {Tk } parameters of the
abstraction. Note that this is an illustration where the parameters {β k }, {T k } have been directly optimized via SGD with a preset δ.
In the supplementary material, we prove that minimizing                Eq.1], where the overall surface is modeled as a sum
Lunique leads to a unique solution and centers the convex              of meta-ball implicit functions [7] – which the authors
body to the origin. This loss further ensures that “inactive”          call “structuring”. The core difference lies in the fact
hyperplane constraints can be readily re-activated during              that SIF [21] models the surface of the object ∂O as an iso-
learning. Because they fit tightly around the surface they             level of the function post structuring – therefore, in most
are therefore sensitive to shape changes.                              cases, the iso-surface of the individual primitives do not ap-
                                                                       proximate the target surface, resulting in a slight loss of in-
Guidance loss (auxiliary). As we will describe in Sec-                 terpretability in the generated representation.
tion 3.5, we use offline sampling to speed-up training. How-
ever, this can cause severe issues. In particular, when a con-         3.5. Implementation details
vex “falls within the cracks” of sampling (i.e. @x | C(x) >
0.5), it can be effectively removed from the learning pro-             To increase training speed, we sample a set of points on
cess. This can easily happen when the convex enters a de-              the ground-truth shape offline, precompute the ground truth
generate state (i.e. dh = 0 ∀h). Unfortunately these degen-            quantities, and then randomly sub-sample from this set dur-
erate configurations are encouraged by the loss (5). We can            ing our training loop. For volumetric samples, we use the
prevent collapses by ensuring that each of them represents             samples from OccNet [43], while for surface samples we
a certain amount of information (i.e. samples):                        employ the “near-surface” sampling described in SIF [21].
                      X       X                                        Following SIF [21], we also tune down Lapprox of “near-
     Lguide (ω) = K1
                          N
                           1
                                   kCk (x) − O(x)k2 ,      (6)         surface” samples by 0.1. We draw 100k random samples
                       k     x∈NkN                                     from the bounding box of O and 100k samples from each
                                                                       Figure 7. Auxiliary losses – Our Lunique loss (left) prevents the ex-
Observations. Note that we supervise the indicator func-               istence of a null-space in the specification of convexes, and (mid-
tion C rather than Φ, as the latter does not represent the             dle) ensures inactive hyperplanes can be easily activated during
signed distance function of a convex (e.g. k∇Φ(x)k 6= 1).              training. (right) Our Lguide move convexes towards the representa-
Also note how the loss in (4) is reminiscent of SIF [21,               tion of samples drawn from within the object x ∈ O.
                                                                   5
of ∂O to construct the points samples and labels. We use a            ods provides an interpretable encoding of geometry. Fi-
sub-sample set (at training time) with 1024 points for both           nally, from the class of techniques that directly learn non-
sample sources. Although Mescheder et al. [43] claims that            interpretable representations of implicit functions, we se-
using uniform volumetric samples are more effective than              lect OccNet [43], P2M [70], and AtlasNet [26]; in contrast
surface samples, we find that balancing these two strategies          to the previous methods, these solutions do not provide any
yields the best performance – this can be attributed to the           form of shape decomposition. As OccNet [43] only report
complementary effect of the losses in (3) and (4).                    results on RGB-to-3D tasks, we extend the original code-
                                                                      base to also solve {Depth}-to-3D tasks. We follow the same
Architecture details. In all our experiments, we use the              data pre-processing used by SIF [21].
same architecture while varying the number of convexes
and hyperplanes. For the {Depth}-to-3D task, we use 50                Metrics. With Ô and ∂ Ô we respectively indicate the in-
convexes each with 50 hyperplanes. For the RGB-to-3D                  dicator and surface of the union of our primitives. We then
task, we use 50 convexes each with 25 hyperplanes. Sim-               use three quantitative metrics to evaluate the performance
ilar to OccNet [43], we use ResNet18 as the encoder E                 of 3D reconstruction: 
 1 The Volumetric IoU; note that with
for both the {Depth}-to-3D and the RGB-to-3D experi-                  100K uniform samples to estimate this metric, our estima-
ments. A fully connected layer then generates the latent              tion is more accurate than the 323 voxel grid estimation used
code λ ∈ R256 that is provided as input to the decoder D.             by [16]. 
2 The Chamfer-L1 distance, a smooth relaxation
For the decoder D we use a sequential model with four hid-            of the symmetric Hausdorff distance measuring the average
den layers with (1024, 1024, 2048, |H|) units respectively.           between reconstruction accuracy Eo∼∂O [minô∈∂ Ô ko−ôk]
The output dimension is |H| = K(4 + 3H) where for each                and completeness Eô∼∂ Ô [mino∈∂O kô − ok] [18]. 
   3 Fol-
of the K elements we specify a translation (3 DOFs) and               lowing the arguments presented in [64], we also employ F-
a smoothness (1 DOFs). Each hyperplane is specified by                score to quantitatively assess performance. It can be under-
the (unit) normal and the offset from the origin (3H DOFs).           stood as “the percentage of correctly reconstructed surface”.
In all our experiments, we use a batch of size 32 and train
with Adam with a learning rate of 1e − 4, and β1 = .9 and             4.1. Abstraction – Figure 8, 9, 10
β2 = .999. As determined by grid-search on the validation
set, we set the weight for our losses {Lapprox : 1.0, Ldecomp :       As our convex decomposition is learnt on a shape collec-
0.1, Lunique : 0.001, Lguide : 0.01, Lloc : 1.0} and σ = 75.          tion, the convex elements produced by our decoder are in
                                                                      natural correspondence – e.g. we expect the same k-th con-
4. Experiments                                                        vex to represent the leg of a chair in the chairs dataset. We
                                                                      analyze this quantitatively on the PartNet dataset. We do
We use the ShapeNet [12] dataset in our experiments. We               so by verifying whether the k-th component is consistently
use the same voxelization, renderings, and data split as              mapped to the same PartNet part label; see Figure 8 (left)
in Choy et. al. [16]. Moreover, we use the same multi-                for the distribution of PartNet labels within each compo-
view depth renderings as [21] for our {Depth}-to-3D ex-               nent. We can then assign the most commonly associated
periments, where we render each example from cameras                  label to a given convex to segment the PartNet point cloud,
placed on the vertices of a dodecahedron. Note that this              achieving a relatively high accuracy; see Figure 8 (right).
problem is a harder problem than 3D auto-encoding with                This reveals how our latent representation captures the se-
point cloud input as proposed by OccNet [43] and resem-               mantic structure in the dataset. We also evaluate our shape
bles more closely the single view reconstruction problem.
At training time we need ground truth inside/outside labels,
so we employ the watertight meshes from [43] – this also
                                                                                                                 Accuracy
ensures a fair comparison to this method. For the quanti-                                      Part
                                                                                                         Cvx       BAE        BAE*
tative evaluation of semantic decomposition, we use labels
from PartNet [44] and exploit the overlap with ShapeNet.                                       back    91.50%     86.36%     91.81%
                                                                                               arm     38.94%     65.75%     71.32%
Methods. We quantitatively compare our method to a                                             base    71.95%     88.46%     91.75%
                                                                                               seat    90.63%     73.66%     92.91%
number of self-supervised algorithms with different char-
acteristics. First, we consider VP [67] that learns a par-
simonious approximation of the input via (the union of)               Figure 8. (left) The distribution of partnet labels within each con-
oriented boxes. We also compare to the Structured Im-                 vex ID (4 out of 50). (right) The binary classification accuracy for
plicit Function SIF [21] method that represents solid ge-             each semantic part when using the convex ID to label each point.
ometry as an iso-level of a sum of weighted Gaussians;                Cvx represents CvxNet. BAE [14] is a baseline for unsupervised
like VP [67], and in contrast to OccNet [43], this meth-              part segmentation. BAE* is the supervised version of BAE.
                                                                  6
Figure 9. Analysis of accuracy vs. # primitives – (left) The ground truth object to be reconstructed and the single shape-abstraction
generated by VP [67]. (middle) Quantitative evaluation (ShapeNet/Multi) of abstraction performance with an increase number of primitives
– the closer the curve is to the top-left, the better. (right) A qualitative visualization of the primitives and corresponding reconstructions.
                                                                      7
Figure 12. ShapeNet/Multi – Qualitative comparisons to SIF [21], AtlasNet [26], OccNet [43], VP [67] and SQ [49]; on RGB Input,
while VP uses voxelized, and SQ uses a point-cloud input. (∗ Note that the OccNet [43] results are post-processed with smoothing).
Table 1. Reconstruction performance on ShapeNet/Multi – We evaluate our method against P2M [70], AtlasNet [26], OccNet [43] and
SIF [21]. We provide in input either (left) a collection of depth maps or (right) a single color image. For AtlasNet [26], note that IoU
cannot be measured as the meshes are not watertight. We omit VP [67], as it only produces a very rough shape decomposition.
in terms of reconstruction accuracy and inference/training                                                 permutation-invariant encoders [62], or remove encoders al-
time. Moreover, we quantitatively demonstrate that using                                                   together in favor of auto-decoder architectures [48].
signed distance as supervision for Lapprox produces signif-
icantly worse results and at the cost of slightly worse per-                                               A. Union of Smooth Indicator Functions
formance we can collapse Lguide and Lloc into one. Finally,
we perform an ablation study with respect to our losses, and                                               We define the smooth indicator function for the k-th object:
verify that each is beneficial towards effective learning.
                                                                                                                                     Ck (x) = Sigmoidσ (−Φkδ (x)),                                        (8)
                                                                                                       8
References                                                             [16] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin
                                                                            Chen, and Silvio Savarese. 3d-r2n2: A unified approach for
 [1] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, and               single and multi-view 3d object reconstruction. In Proc. of
     Leonidas Guibas. Learning representations and generative               the European Conf. on Comp. Vision. Springer, 2016. 2, 6
     models for 3d point clouds. In International Conference on        [17] Erwin Coumans and Yunfei Bai. PyBullet, a python mod-
     Machine Learning, pages 40–49, 2018. 1                                 ule for physics simulation for games, robotics and machine
 [2] Baptiste Angles, Marco Tarini, Loic Barthe, Brian Wyvill,              learning. pybullet.org, 2016–2019. 1, 12
     and Andrea Tagliasacchi. Sketch-based implicit blending.          [18] Siyan Dong, Matthias Niessner, Andrea Tagliasacchi, and
     ACM Transaction on Graphics (Proc. SIGGRAPH Asia),                     Kevin Kai Xu. Multi-robot collaborative dense scene recon-
     2017. 1                                                                struction. ACM Trans. on Graphics (Proc. of SIGGRAPH),
 [3] Matthew Berger, Andrea Tagliasacchi, Lee M Seversky,                   2019. 6
     Pierre Alliez, Gael Guennebaud, Joshua A Levine, Andrei           [19] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set
     Sharf, and Claudio T Silva. A survey of surface reconstruc-            generation network for 3d object reconstruction from a single
     tion from point clouds. In Computer Graphics Forum, vol-               image. In Proceedings of the IEEE conference on computer
     ume 36, pages 301–329. Wiley Online Library, 2017. 2                   vision and pattern recognition, pages 605–613, 2017. 1, 2
 [4] Irving Biederman. Recognition-by-components: a theory of          [20] Matheus Gadelha, Subhransu Maji, and Rui Wang. 3d shape
     human image understanding. Psychological review, 1987. 1,              induction from 2d views of multiple objects. In International
     2                                                                      Conference on 3D Vision (3DV), 2017. 1
 [5] Thomas Binford. Visual perception by computer. In IEEE            [21] Kyle Genova, Forrester Cole, Daniel Vlasic, Aaron Sarna,
     Conference of Systems and Control, 1971. 2                             William T Freeman, and Thomas Funkhouser. Learning
 [6] Volker Blanz and Thomas Vetter. A morphable model for the              shape templates with structured implicit functions. arXiv
     synthesis of 3D faces. In ACM Trans. on Graphics (Proc. of             preprint arXiv:1904.06447, 2019. 1, 2, 3, 4, 5, 6, 7, 8
     SIGGRAPH), 1999. 1                                                [22] Mukulika Ghosh, Nancy M Amato, Yanyan Lu, and Jyh-
 [7] James F Blinn. A generalization of algebraic surface draw-             Ming Lien. Fast approximate convex decomposition using
     ing. ACM Trans. on Graphics (TOG), 1(3):235–256, 1982.                 relative concavity. Computer-Aided Design, 45(2):494–504,
     5                                                                      2013. 3
 [8] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter          [23] Elmer G Gilbert, Daniel W Johnson, and S Sathiya Keerthi.
     Gehler, Javier Romero, and Michael J Black. Keep it smpl:              A fast procedure for computing the distance between com-
     Automatic estimation of 3d human pose and shape from a                 plex objects in three-dimensional space. IEEE Journal on
     single image. In Proc. of the European Conf. on Comp. Vi-              Robotics and Automation, 1988. 2
     sion, 2016. 1                                                     [24] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, and Ab-
 [9] Mario Botsch, Leif Kobbelt, Mark Pauly, Pierre Alliez, and             hinav Gupta. Learning a predictable and generative vector
     Bruno Lévy. Polygon mesh processing. AK Peters/CRC                     representation for objects. In Proc. of the European Conf. on
     Press, 2010. 1                                                         Comp. Vision, pages 484–499. Springer, 2016. 2
[10] Andrew Brock, Theodore Lim, James M Ritchie, and                  [25] Ronald L. Graham. An efficient algorithm for determining
     Nick Weston. Generative and discriminative voxel mod-                  the convex hull of a finite planar set. Info. Pro. Lett., 1:132–
     eling with convolutional neural networks. arXiv preprint               133, 1972. 3
     arXiv:1608.04236, 2016. 1                                         [26] Thibault Groueix, Matthew Fisher, Vladimir G Kim,
[11] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur                    Bryan C Russell, and Mathieu Aubry. A papier-mache ap-
     Szlam, and Pierre Vandergheynst. Geometric deep learning:              proach to learning 3d surface generation. In Proc. of Comp.
     going beyond euclidean data. IEEE Signal Processing Mag-               Vision and Pattern Recognition (CVPR), 2018. 1, 2, 6, 8
     azine, 34(4):18–42, 2017. 2                                       [27] Kan Guo, Dongqing Zou, and Xiaowu Chen. 3d mesh label-
[12] Angel X Chang, Thomas Funkhouser, Leonidas Guibas,                     ing via deep convolutional neural networks. ACM Transac-
     Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese,                  tions on Graphics (TOG), 35(1):3, 2015. 2
     Manolis Savva, Shuran Song, Hao Su, et al. Shapenet:              [28] Eric Heiden, David Millard, and Gaurav Sukhatme.
     An information-rich 3d model repository. arXiv preprint                Real2sim transfer using differentiable physics. Workshop
     arXiv:1512.03012, 2015. 6                                              on Closing the Reality Gap in Sim2real Transfer for Robotic
[13] Bernard M Chazelle. Convex decompositions of polyhedra.                Manipulation, 2019. 2
     In Proceedings of the thirteenth annual ACM symposium on          [29] Eric Heiden, David Millard, Hejia Zhang, and Gaurav S
     Theory of computing, pages 70–79. ACM, 1981. 3                         Sukhatme. Interactive differentiable simulation. arXiv
[14] Zhiqin Chen, Kangxue Yin, Matthew Fisher, Siddhartha                   preprint arXiv:1905.10706, 2019. 2
     Chaudhuri, and Hao Zhang. Bae-net: Branched autoen-               [30] Hugues Hoppe, Tony DeRose, Tom Duchamp, John McDon-
     coder for shape co-segmentation. Proceedings of Interna-               ald, and Werner Stuetzle. Surface reconstruction from unor-
     tional Conference on Computer Vision (ICCV), 2019. 6                   ganized points. In ACM SIGGRAPH Computer Graphics,
[15] Zhiqin Chen and Hao Zhang. Learning implicit fields for                volume 26, pages 71–78. ACM, 1992. 2
     generative shape modeling. Proc. of Comp. Vision and Pat-         [31] Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, and
     tern Recognition (CVPR), 2019. 2                                       Jitendra Malik. Learning category-specific mesh reconstruc-
                                                                   9
       tion from image collections. In Proc. of the European Conf.           [46] Richard A Newcombe, Dieter Fox, and Steven M Seitz.
       on Comp. Vision, 2018. 1                                                   Dynamicfusion: Reconstruction and tracking of non-rigid
[32]   Chen Kong, Chen-Hsuan Lin, and Simon Lucey. Using lo-                      scenes in real-time. In Proceedings of the IEEE conference
       cally corresponding cad models for dense 3d reconstructions                on computer vision and pattern recognition, 2015. 2
       from a single image. In Proc. of Comp. Vision and Pattern             [47] Richard A Newcombe, Shahram Izadi, Otmar Hilliges,
       Recognition (CVPR), pages 4857–4865, 2017. 2                               David Molyneaux, David Kim, Andrew J Davison, Pushmeet
[33]   David H Laidlaw, W Benjamin Trumbore, and John F                           Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon.
       Hughes. Constructive solid geometry for polyhedral objects.                Kinectfusion: Real-time dense surface mapping and track-
       In ACM Trans. on Graphics (Proc. of SIGGRAPH), 1986. 2                     ing. In Proc. ISMAR. IEEE, 2011. 2
[34]   Yee Leung, Jiang-She Zhang, and Zong-Ben Xu. Neural net-              [48] Jeong Joon Park, Peter Florence, Julian Straub, Richard
       works for convex hull computation. IEEE Transactions on                    Newcombe, and Steven Lovegrove. Deepsdf: Learning con-
       Neural Networks, 8(3):601–611, 1997. 3                                     tinuous signed distance functions for shape representation.
[35]   Yiyi Liao, Simon Donne, and Andreas Geiger. Deep march-                    arXiv preprint arXiv:1901.05103, 2019. 2, 8
       ing cubes: Learning explicit surface representations. In Proc.        [49] Despoina Paschalidou, Ali Osman Ulusoy, and Andreas
       of Comp. Vision and Pattern Recognition (CVPR), 2018. 1                    Geiger. Superquadrics revisited: Learning 3d shape parsing
[36]   Jyh-Ming Lien and Nancy M Amato. Approximate convex                        beyond cuboids. In Proceedings IEEE Conf. on Computer
       decomposition of polyhedra. In Proceedings of the 2007                     Vision and Pattern Recognition (CVPR), 2019. 2, 8
       ACM symposium on Solid and physical modeling, pages                   [50] Jason Patnode. Character Modeling with Maya and ZBrush:
       121–131. ACM, 2007. 3                                                      Professional polygonal modeling techniques. Focal Press,
[37]   Chen-Hsuan Lin, Chen Kong, and Simon Lucey. Learning                       2012. 1
       efficient point cloud generation for dense 3d object recon-           [51] Franco P Preparata and Se June Hong. Convex hulls of finite
       struction. In Thirty-Second AAAI Conference on Artificial                  sets of points in two and three dimensions. Communications
       Intelligence, 2018. 2                                                      of the ACM, 20(2):87–93, 1977. 3
[38]   Guilin Liu, Zhonghua Xi, and Jyh-Ming Lien. Nearly convex             [52] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.
       segmentation of polyhedra through convex ridge separation.                 Pointnet: Deep learning on point sets for 3d classification
       Computer-Aided Design, 78:137–146, 2016. 3                                 and segmentation. In Proceedings of the IEEE Conference
[39]   Khaled Mamou and Faouzi Ghorbel. A simple and efficient                    on Computer Vision and Pattern Recognition, 2017. 2
       approach for 3d mesh approximate convex decomposition.                [53] Charles R Qi, Hao Su, Matthias Nießner, Angela Dai,
       In 2009 16th IEEE international conference on image pro-                   Mengyuan Yan, and Leonidas J Guibas. Volumetric and
       cessing (ICIP), pages 3501–3504. IEEE, 2009. 3                             multi-view cnns for object classification on 3d data. In Pro-
[40]   Khaled Mamou, E Lengyel, and Ed AK Peters. Volumet-                        ceedings of the IEEE conference on computer vision and pat-
       ric hierarchical approximate convex decomposition. Game                    tern recognition, pages 5648–5656, 2016. 2
       Engine Gems 3, pages 141–158, 2016. 3                                 [54] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J
[41]   Jonathan Masci, Davide Boscaini, Michael Bronstein, and                    Guibas. Pointnet++: Deep hierarchical feature learning on
       Pierre Vandergheynst. Geodesic convolutional neural net-                   point sets in a metric space. In Advances in Neural Informa-
       works on riemannian manifolds. In Proceedings of the                       tion Processing Systems, pages 5099–5108, 2017. 2
       IEEE international conference on computer vision work-                [55] Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and
       shops, pages 37–45, 2015. 2                                                Michael J Black. Generating 3d faces using convolutional
[42]   Daniel Maturana and Sebastian Scherer. Voxnet: A 3d con-                   mesh autoencoders. In Proceedings of the European Confer-
       volutional neural network for real-time object recognition.                ence on Computer Vision (ECCV), 2018. 1
       In 2015 IEEE/RSJ International Conference on Intelligent              [56] Danilo Jimenez Rezende, SM Ali Eslami, Shakir Mohamed,
       Robots and Systems (IROS), pages 922–928. IEEE, 2015. 2                    Peter Battaglia, Max Jaderberg, and Nicolas Heess. Unsu-
[43]   Lars Mescheder, Michael Oechsle, Michael Niemeyer, Se-                     pervised learning of 3d structure from images. In Advances
       bastian Nowozin, and Andreas Geiger. Occupancy networks:                   in Neural Information Processing Systems, 2016. 1, 2
       Learning 3d reconstruction in function space. arXiv preprint          [57] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger.
       arXiv:1812.03828, 2018. 2, 5, 6, 7, 8                                      Octnet: Learning deep 3d representations at high resolutions.
[44]   Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna                      In Proceedings of the IEEE Conference on Computer Vision
       Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-                 and Pattern Recognition, pages 3577–3586, 2017. 1, 2
       scale benchmark for fine-grained and hierarchical part-level          [58] Lawrence G Roberts.          Machine perception of three-
       3d object understanding. In Proceedings of the IEEE Con-                   dimensional solids. PhD thesis, Massachusetts Institute of
       ference on Computer Vision and Pattern Recognition, pages                  Technology, 1963. 2
       909–918, 2019. 6                                                      [59] Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos
[45]   Federico Monti, Davide Boscaini, Jonathan Masci,                           Kalogerakis, and Subhransu Maji. Csgnet: Neural shape
       Emanuele Rodola, Jan Svoboda, and Michael M Bronstein.                     parser for constructive solid geometry. In Proc. of Comp.
       Geometric deep learning on graphs and manifolds using                      Vision and Pattern Recognition (CVPR), 2018. 1, 2
       mixture model cnns. In Proc. of Comp. Vision and Pattern              [60] Shuran Song and Jianxiong Xiao. Deep sliding shapes for
       Recognition (CVPR), pages 5115–5124, 2017. 2                               amodal 3d object detection in rgb-d images. In Proceed-
                                                                        10
       ings of the IEEE Conference on Computer Vision and Pattern                 vances in neural information processing systems, pages 82–
       Recognition, pages 808–816, 2016. 2                                        90, 2016. 1, 2
[61]   David Stutz and Andreas Geiger. Learning 3d shape com-                [74] Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Lin-
       pletion from laser scan data with weak supervision. In Pro-                guang Zhang, Xiaoou Tang, and Jianxiong Xiao. 3d
       ceedings of the IEEE Conference on Computer Vision and                     shapenets: A deep representation for volumetric shapes. In
       Pattern Recognition, pages 1955–1964, 2018. 1, 2                           Proc. of Comp. Vision and Pattern Recognition (CVPR),
[62]   Weiwei Sun, Wei Jiang, Eduard Trulls, Andrea Tagliasac-                    pages 1912–1920, 2015. 2
       chi, and Kwang Moo Yi. Attentive context normalization                [75] Fenggen Yu, Kun Liu, Yan Zhang, Chenyang Zhu, and Kai
       for robust permutation-equivariant learning. arXiv preprint                Xu. Partnet: A recursive part decomposition network for
       arXiv:1907.02545, 2019. 8                                                  fine-grained and hierarchical shape segmentation. Proc. of
[63]   Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox.                    Comp. Vision and Pattern Recognition (CVPR), 2019. 8
       Octree generating networks: Efficient convolutional archi-            [76] Chuhang Zou, Ersin Yumer, Jimei Yang, Duygu Ceylan, and
       tectures for high-resolution 3d outputs. In Proceedings of the             Derek Hoiem. 3d-prnn: Generating shape primitives with
       IEEE International Conference on Computer Vision, pages                    recurrent neural networks. In Proceedings of the IEEE Inter-
       2088–2096, 2017. 1, 2                                                      national Conference on Computer Vision, 2017. 1
[64]   Maxim Tatarchenko, Stephan R Richter, René Ranftl,
       Zhuwen Li, Vladlen Koltun, and Thomas Brox. What do
       single-view 3d reconstruction networks learn? In Proceed-
       ings of the IEEE Conference on Computer Vision and Pattern
       Recognition, pages 3405–3414, 2019. 6
[65]   Daniel Thul, Sohyeon Jeong, Marc Pollefeys, et al. Ap-
       proximate convex decomposition and transfer for animated
       meshes. In SIGGRAPH Asia 2018 Technical Papers, page
       226. ACM, 2018. 1, 3, 4
[66]   Anastasia Tkach, Mark Pauly, and Andrea Tagliasacchi.
       Sphere-meshes for real-time hand modeling and tracking.
       ACM Transaction on Graphics (Proc. SIGGRAPH Asia),
       2016. 1, 2
[67]   Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A
       Efros, and Jitendra Malik. Learning shape abstractions by as-
       sembling volumetric primitives. In Proceedings of the IEEE
       Conference on Computer Vision and Pattern Recognition,
       2017. 1, 2, 3, 4, 6, 7, 8
[68]   Ali Osman Ulusoy, Andreas Geiger, and Michael J Black.
       Towards probabilistic volumetric reconstruction using ray
       potentials. In International Conference on 3D Vision (3DV),
       2015. 1
[69]   Anton van den Hengel, Chris Russell, Anthony Dick, John
       Bastian, Daniel Pooley, Lachlan Fleming, and Lourdes
       Agapito. Part-based modelling of compound scenes from
       images. In Proc. of Comp. Vision and Pattern Recognition
       (CVPR), pages 878–886, 2015. 2
[70]   Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, Wei
       Liu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d mesh
       models from single rgb images. In Proc. of the European
       Conf. on Comp. Vision, 2018. 1, 2, 6, 8
[71]   Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun,
       and Xin Tong. O-cnn: Octree-based convolutional neu-
       ral networks for 3d shape analysis. ACM Transactions on
       Graphics (TOG), 36(4):72, 2017. 2
[72]   Peng-Shuai Wang, Chun-Yu Sun, Yang Liu, and Xin Tong.
       Adaptive o-cnn: a patch-based deep representation of 3d
       shapes. In SIGGRAPH Asia 2018 Technical Papers, page
       217. ACM, 2018. 1, 2
[73]   Jiajun Wu, Chengkai Zhang, Tianfan Xue, Bill Freeman, and
       Josh Tenenbaum. Learning a probabilistic latent space of
       object shapes via 3d generative-adversarial modeling. In Ad-
                                                                        11
                                                                                              Losses        IoU Chamfer-L1 F-Score
                                                                                             Original 0.731         0.080       73.49%
                                                                                             Merged 0.720           0.087       71.62%
                                                                                                             {Depth}-to-3D          RGB-to-3D
                                                                               Class     # exemplars
                                                                                                           F% Single F% Multi   F% Single F% Multi
                                                                                table       5958            70.97      71.71      50.29        48.10
                                                                                  car       5248            77.11      77.75      51.34        47.33
                                                                               chair        4746            62.21      65.39      38.12        38.49
                                                                               plane        2832            83.68      84.68      75.19        68.16
                                                                                 sofa       2222            67.89      75.44      43.36        42.11
                                                                                 rifle      1661            82.73      83.63      69.62        63.74
                                                                               lamp         1624            46.46      51.37      32.49        31.41
                                                                               vessel       1359            65.71      70.77      48.44        45.88
                                                                               bench        1272            68.20      77.68      59.27        54.64
Figure 13. CvxNet for physics simulation – A zoom-in of the                   speaker       1134            50.37      60.24      28.07        28.45
simulation pipeline used to generate the example in Figure 1. The             cabinet       1101            66.75      76.09      45.73        46.09
animated results can be found in the supplementary video. For                 display       767             61.66      71.41      40.31        38.96
each convex, the hyperplane equations are converted via a “geo-                phone        737             84.93      89.28      63.58        66.09
metric dual” to a collection of points; the convexes to be used in             mean         2359            68.36      73.50      49.68        47.65
simulation are then computed as the convex-hull of these points.
Unused hyperplanes, e.g. H5 , become points inside the convex-               Figure 16. Single-class vs. multi-class – It is interesting to note
hull. Then we take the inverse dual to get the convex-hull in the            that training on single vs. multi class has a behavior different from
original space. This convex-hull can be used by physics engines              what one would expect (i.e. overfitting to a single class is benefi-
to simulate animations. The simulation is computed and rendered              cial). Note how training in multi-class on the {Depth}-to-3D input
by Houdini, via its rigid-body world solver based on Bullet [17].            improves the reconstruction performance across the entire bench-
                                                                             mark. Conversely, with RGB-to-3D input, single class training is
                                                                             beneficial in most cases. We explain this by the fact that RGB
                                                                             inputs have complex texture which is stable within each class but
                                                                             not easily transferable across classes. Contrarily, local geometries
                                                                             learned from {depth} are agnostic to classes.
                                                                                  H
                                                                                              12                      25                  50
                                                                              K
                                                                               12      0.709 / 0.139 / 44 0.712 / 0.127 / 59 0.714 / 0.124 / 87
                                                                               25      0.717 / 0.100 / 60 0.720 / 0.099 / 92 0.721 / 0.096 / 153
Figure 14. Depth vs. Color – A qualitative example illustrat-                  50      0.724 / 0.088 / 92 0.730 / 0.083 / 156 0.731 / 0.080 / 280
ing the degradation in performance as we move from {depth} to a
“weaker” RGB input. In the top row we illustrate a model where               Figure 17. Ablation on model complexity – We analyze how
the frequency of the surface is low. In this case, both {depth} input        the number of hyperplanes H and number of convexes K relate
and RGB input can approximate the shape relatively accurately. In            to mIoU / Chamfer-L1 / Inference Time (ms). The mIoU and
contrast, the bottom row shows when the ground truth surface has             Chamfer-L1 are measured on the test set of ShapeNet (multi-class)
high frequency, the RGB input results lose many details while the            with multi-view depth input. We measure the inference time of a
{depth} input results remain accurate.                                       batch with 32 examples and 2048 points, which is equivalent to
                                                                             one forward propagation step at training time.
                                                                        12
              Loss
                      All    −Ldecomp −Lunique −Lguide −Lloc
                                                                           B. Proof of auxiliary null-space loss
    Metric
       Vol-IoU       0.567     0.558      0.545     0.551   0.558          Consider a hyperplane Hh (x) = nh · x + dh , parametrized
       Chamfer       0.245     0.308      0.313     0.335   0.618          by a pair (nh , dh ). Translating this hyperplane by x0 gives
         F%          47.36     45.29      44.03     45.88   46.01          us the hyperplane (nh , dh + nh · x0 ) = (nh , Hh (x0 )).
                                                                           We like to show that given a collection of hyperplanes
Figure 18. Ablation on losses – We test on ShapeNet multi with             h = {(nh , dh )}, if we consider all possible translations of
RGB input, where each column (from 2 to 5) removes (−) one loss            this collection, there is a unique one that minimizes
term from the training. We observe that each loss can improve the
overall performance.                                                                                     1 X
                                                                                             Lunique =       Hh (x)2 .
                                                                                                         H
                                                                                                             h
              Supervision IoU Chamfer-L1 F-Score
                                                                           In fact, we can rewrite this sum as
               Indicator     0.747     0.035      83.22%
                 SDF         0.650     0.045      73.08%                             X
                                                                                         (Hh (x))2 = k(Hh (x))h k2
Figure 19. Ablation SDF Training vs. Indicator Training –                                h
We study the difference between learning signed distance func-                                           =   k(dh + nh · x)h k2
tions(i.e., replacing O(x) with signed distance in (3) and remov-
ing Sigmoid in (2)) and learning indicator functions. Note how
                                                                                                         =   kdh + N xk2
the performance would degenerate significantly when the model
learns to predict signed distances.                                        where dh = (dh )h is a vector with entries dh , and N is the
                                                                           matrix formed by nh as columns. However it is well known
                                                                           that the above minimization has a unique solution, and the
 |λ| F-Score Figure 20. Ablation latent size – We study the                minimizing solution can be computed explicitly (by using
  32 72.24% performance of our architecture as we vary the                 the Moore-Penrose inverse of N for instance).
  64 73.31% number of latent dimensions in Figure 3. Note                  We also note that Lunique has a nice geometric description.
 128 73.14% how the reconstruction performance remains rel-                In fact, minimizing Lunique is essentially “centring” the con-
 256 73.49% atively stable as we vary this hyper-parameter.                vex body at the origin. To see this we first prove the follow-
                                                                           ing:
                                                                           Observation 1. Let Hh (x) = nh · x + dh be a hyperplane
                                                                           and let x0 be any point. Then the distance between x0 and
                                                                           the hyperplane Hh is given by
                                                                                                     |Hh (x0 )|
                                                                                                                                     (14)
                                                                                                       knh k
                                                                      13
Also note that if we translate h by x0 , then we get the
new set of hyperplane h0 = {(nh , dh + nh · x0 )} =
{(nh , Hh (x0 )). Therefore,
                             X
                   Lunique =   kHh (0)k             (15)
                              h
14