Unsupervised classification of sensor readings — self vs. background vs. anomaly — with no explicit geometry or kinematics.
Deep Learning & Robotics Challenge 2018 · Team The Boring Panda — Daniel Plop & Giorgio Giannone · in collaboration with the argmax.ai group.
We were given a 7-DOF Panda robotic arm with 9 single-point lidars and an RGBD camera on the end-effector, controllable point-to-point via a Python API.
Goal: label every incoming sensor reading as self, background, or other — and do it with unsupervised / semi-supervised learning instead of hand-built geometric models.
The hypothesis: perception on a manipulator can be solved as a statistical problem, without explicitly modelling the robot's configuration or geometry. The payoff is a model that generalizes across environments and quantifies its own uncertainty; exactly what is nice to have when the world is dynamic and labels are unavailable.
| Module | Task | Headline metric |
|---|---|---|
| Anomaly detection | Is something new in the scene? | Threshold-free, calibrated uncertainty (predictive σ) |
| Clustering | Self vs. background | 89.8% global accuracy · 69.4% recall on self (3,000 labeled lidar points) |
| Collision detection | Anomaly vs. likely collision | F1 0.61 avg · up to 0.96 per lidar (2,000 points) |
No manual labeling in the loop, no per-sensor heuristics; every decision results from a likelihood.
Classical robotics solves well-defined tasks beautifully with geometry and control — no learning required. But once the task is loosely specified, the agent isn't perfectly known, and the environment keeps changing, those assumptions break.
| Geometry-first | Learning-first |
|---|---|
| Build an exact system model, add ML for narrow tasks. | Build a statistical model, inject geometric/task priors on top. |
Supervised learning is the wrong fit here too: labeling thousands of sensor points per second is impractical and high-variance, discriminative models assume train/test share a distribution (whereas domain shift is what is happens practically in robotics), and they output point-estimates instead of uncertainty. So we for unsupervised learning.
Logging lidar over a fixed trajectory immediately revealed a clean multi-modal structure — the signature of distinct physical regimes (self / background / moving object).
| Per-lidar temporal traces (mm) with density estimates over one trajectory. |
A Gaussian Mixture Model on a tiny hand-crafted feature confirmed the idea. From a short window of a time series
with [max; std]. On 30 windows (10 points each), the GMM clustered 70% correctly — and the misses were dynamic samples genuinely indistinguishable from static ones in this embedding. If a 2-D hand-crafted feature gets us this far, a learned representation should solve the full task.
| Ground truth: dynamic (red) vs. static (blue). | GMM clustering prediction. |
A hierarchical pipeline of three modules, each answering one question. Input: 9 lidar readings + 7 joint positions (variants also consume depth images).
| Anomaly → Clustering → Collision: a hierarchy of statistical decisions over raw sensor input. |
Notation.
Solve a proxy regression task: an MLP (two 256-unit tanh layers) predicts, per lidar, a mean and a standard deviation from the robot state, trained to minimize negative log-likelihood. When the model can't reconstruct a reading within its own confidence interval, that reading is an anomaly. The threshold is the learned predictive uncertainty — no heuristics.
| Lidar #3: injected anomalies fall outside the model's predicted range. |
If a reading is normal, a selector network mimicking a GMM splits it into two learned modes. Same likelihood, now with cluster weights
Lidar #3, controlled experiment. Red lines mark the ground-truth bounds of self. |
→ 89.8% global accuracy, 69.4% recall on self (3,000 labeled points).
Framed as multi-label binary classification: corrupt random columns of a 10-point window with Gaussian noise (~50 mm), label originals 0 and perturbations 1, and learn to separate a generic anomaly from a likely collision — independently per lidar.
Per-lidar results for the collision class (2,000 test points):
| metric | L0 | L1 | L2 | L3 | L4 | L5 | L6 | L7 | L8 | Avg |
|---|---|---|---|---|---|---|---|---|---|---|
| N | 100 | 180 | 140 | 180 | 100 | 170 | 200 | 160 | 180 | 156.6 |
| Sensitivity | 0.96 | 0.33 | 0.22 | 0.56 | 0.84 | 0.97 | 0.22 | 0.20 | 0.57 | 0.54 |
| IoU | 0.92 | 0.32 | 0.20 | 0.50 | 0.73 | 0.91 | 0.19 | 0.17 | 0.50 | 0.49 |
| F1 | 0.96 | 0.49 | 0.33 | 0.67 | 0.84 | 0.95 | 0.32 | 0.29 | 0.66 | 0.61 |
Worked: a single statistical pipeline senses the environment with no per-sensor heuristics, learns reusable structure, and reports calibrated uncertainty on every prediction.
Open: the unsupervised classifier doesn't yet generalize fully, and the representation isn't directly actionable — perceiving a dynamic scene is not the same as knowing how to act in it.
Next: latent-variable models for richer representations and sequence models for temporal dynamics, to make the perception layer robust enough to act on.