Map Extraction Papper
Map Extraction Papper
Abstract
We propose a new approach, named PolyMapper, to
circumvent the conventional pixel-wise segmentation of
(aerial) images and predict objects in a vector represen-
tation directly. PolyMapper directly extracts the topologi-
cal map of a city from overhead images as collections of
building footprints and road networks. In order to unify
the shape representation for different types of objects, we
also propose a novel sequentialization method that reformu-
lates a graph structure as closed polygons. Experiments are Figure 1: PolyMapper result for Boston overlaid on top of
conducted on both existing and self-collected large-scale the original aerial imagery. Buildings and roads are directly
datasets of several cities. Our empirical results demonstrate predicted as polygons. See additional results in Fig. 10.
that our end-to-end learnable model is capable of drawing
polygons of building footprints and road networks that very Our research is inspired by the insight that for many ap-
closely approximate the structure of existing online map plications, image segmentation is just an intermediate step
services, in a fully automated manner. Quantitative and of a more comprehensive workflow that aims at a higher-
qualitative comparison to the state-of-the-art also shows level, abstract, vectorized representation of the image con-
that our approach achieves good levels of performance. To tent. A good example is automated map generation from
the best of our knowledge, the automatic extraction of large- aerial imagery where existing research has mostly focused
scale topological maps is a novel contribution in the remote on aerial image segmentation such as [9, 48, 50, 30, 20, 51,
sensing community that we believe will help develop models 31]. We make this application our core scenario because we
with more informed geometrical constraints. have access to virtually unlimited data from OpenStreetMap
(OSM) [17, 16, 14] and high-resolution RGB orthophotos
from Google Maps.
1. Introduction
Usually, a full mapping pipeline consists of convert-
A fundamental research task in computer vision is pixel- ing an orthophoto to a semantically meaningful raster map
accurate image segmentation, where steady progress has (i.e., semantic segmentation), followed by further process-
been measured with benchmark challenges such as [27, 12, ing such as object shape refinement, vectorization, and map
11]. The classical approach in this field consists of assign- generalization techniques. Here, we turn this multi-step
ing a label to each image pixel describing what category it workflow into an end-to-end learnable deep learning ar-
belongs to, thus yielding a labeled image as output. How- chitecture, PolyMapper, which outputs topological maps of
ever, for many applications, this is not the final desired out- buildings and roads directly, given aerial imagery as input.
put from a user’s point of view. In this paper, we will instead Our approach performs object detection, instance seg-
focus on applications that require a graph or polygon repre- mentation, and vectorization within a unified approach that
sentation as output. Our interest will be in developing a relies on modern CNNs architectures and RNNs with con-
method that, from an input image, directly produces a poly- volutional long-short term memory (ConvLSTM) [45] mod-
gon representation that describes geometric objects using a ules. As illustrated in Fig. 5, the CNN takes as input a city
vector data structure. Motivated by the success of recent tile and extracts keypoints and edge evidence of building
works [10, 8, 5, 1], we avoid explicit pixel-wise labeling al- footprints and road networks, which are fed sequentially to
together, but instead directly predict polygons from images the multi-layer ConvLSTM modules. The latter produces
in an end-to-end learnable approach. a vector representation for each object in a given tile. In
the case of roads, we also propose an approach that re- porate topology, by searching for long 1-dimensional struc-
formulates the topology of roads (typically an undirected tures. One of the most sophisticated methods of the pre-
graph) as polygons by following a maze solving algorithm deep learning era was introduced in [47, 23], who center
that guarantees the shape consistency (sequences) of differ- their approach on marked point processes (MPP) that allows
ent objects (see Sec. 3.3). Finally, the roads from differ- them to include elaborate priors on the connectivity and in-
ent tiles are connected and combined with the buildings to tersection geometry of roads. To the best of our knowledge,
form a complete city map. A PolyMapper result for the city the first (non-convolutional) deep learning approach to road
Boston is shown in Fig. 1, while the results of Chicago and network extraction was proposed by [35, 36]. The authors
Sunnyvale are illustrated in Fig. 10. train deep belief network to detect image patches contain-
We validate our approach for the automated mapping of ing roads and second network repairs small network gaps at
road networks and building footprints on the existing pub- large scale. [53] propose to model longevity and connectiv-
licly available datasets and the new collected PolyMapper ity of road networks with a higher-order CRF, which is ex-
dataset. Experiment results (see Sec. 4) outperform or are tended in [52] to sampling more flexible, road-like higher-
are on par with the state-of-the-art, per-pixel instance seg- order cliques through collections of shortest paths, and to
mentation methods [18, 28], and recent research that pro- also model buildings with higher-order cliques in [39]. [33]
poses custom-tailored approaches for only one of the tasks, combine OSM and aerial images to augment maps with ad-
road network prediction [32, 4] or building footprint ex- ditional information like the road width using a MRF for-
traction [38]. Our approach has significant advantage that mulation, which scales to large regions and achieves good
it generalizes to both, building and road delineation, and results at several locations world-wide. Two recent works
could potentially be extended to other objects. apply deep learning to road center-line extraction in aerial
images. DeepRoadMapper [32] introduces a hierarchical
2. Related work processing pipeline that first segments roads with CNNs,
encodes end points of street segments as vertices in a graph
Building segmentation from overhead data has been a
connected with edges, thins output segments to road center-
core research interest for decades and discussing all works
lines and repairs gaps with an augmented road graph. Road-
is beyond the scope of this paper [19, 34, 20]. Before the
Tracer [4] uses an iterative search process guided by a CNN-
comeback of deep learning, building footprints were of-
based decision function to derive the road network graph di-
ten delineated with multi-step, bottom-up approaches and
rectly from the output of the CNN. To the best of our knowl-
a combination of multi-spectral overhead imagery and air-
edge, [4] is the only work, yet, that completely eliminates
borne LiDAR, e.g., [46, 2]. A modern approach is [6]
the intermediate, explicit pixel-wise image labeling step and
that applies a fully convolutional neural network to com-
outputs road center-lines directly like our method.
bine evidence from optical overhead imagery and a dig-
ital surface model to jointly reason about building foot- Polygon prediction in images has a long history with
prints. Today, most building footprint delineation from a methods such as level sets [44] or active contour mod-
single image is often approached via semantic segmenta- els [21]. While these methods follow an iterative energy
tion as part of a broader multi-class task and many works minimization scheme and usually are a final component of
exist, e.g., [40, 24, 30, 51, 20, 31]. Microsoft recently ex- multi-step, bottom-up workflows (e.g., [7, 15] for road net-
tracted all building footprints in the US from aerial images work refinement), directly predicting polygons from images
by, first, running semantic segmentation with a CNN and is a relatively new research direction. We are aware of only
second, refining footprints with a heuristic polygonization six works that move away from pixel-wise labeling and di-
approach1 . A current benchmark challenge that aims at ex- rectly predict 2D polygons [10, 8, 4, 5, 1, 29]. Interest-
tracting building footprints is [38], which we use to evaluate ingly, [10, 5] apply an unsupervised strategy without mak-
performance of our approach. Another large-scale dataset ing use of deep learning and achieve good results for super-
that includes both, building footprints and road networks pixel polygons [10] and polygonal object segmentation [5].
is SpaceNet [49]. All processing takes place in the Amazon [8] designed a semi-automated approach where a human an-
Cloud on satellite images of lower resolution than our aerial notator first provide bounding boxes surrounding an object
images in this paper. of interest. A deep-learning approach consisting of an RNN
Road network extraction in images goes back to (at coupled with a CNN, then generates a polygon outlining
least) [3], where road pixels were identified using several the target object. A recent extension of this work [1] in-
image processing operations at a local scale. Shortly after- creases the output resolution by adding a graph neural net-
wards [13] was probably the first work to explicitly incor- work (GNN) [43, 25]. This approach, as well as the original
1 We are not aware of any scientific publication of this work and thus work of [8], still relies on user input to provide an initial
refer the reader to the corresponding GitHub repository that describes the bounding box around the object of interest, or to correct
workflow and shares data. a predicted vertex of the polygon if needed. [29] extracts
Boundary Mask Vertices Mask First Vertex Polygon
Select
Conv
Conv
FPN
BBoxes RoIAlign
Image RNN
Select
Conv
Conv
RNN
Concatenate
Figure 2: Workflow of our method for both building footprint and road network extraction. The only difference between
road and building processing is that we use the corresponding local skip feature via RoIAlign for buildings (bounding boxes
provided by FPN), but the entire feature map for roads.
building footprints by formulating active contours as a deep and then introduce the general pipeline as shown in Fig. 2
learning task, where a structured loss imposes learned shape for generating object polygons.
priors that refine an initial extraction result.
In summary, prior works mentioned above either focus 3.2. Multiple Targets
on pixel level outputs or can only handle just a single type
Prior work such as [8, 1] is only applicable when a
of object. Thus, the absence of direct topological map ex-
bounding box is provided for each object of interest. These
traction in the field of remote sensing is what motivates us to
methods are therefore not able to detect objects such as mul-
develop a fully automated, end-to-end learnable approach to
tiple buildings in a given image. We first address the case of
detect geometrical shapes of buildings and roads in a given
buildings by adding a bounding box detection step to par-
overhead image.
tition the image into individual building instances, which
allows to compute separate polygons for all buildings. To
3. Method this end, we have integrated the Feature Pyramid Network
We introduce a new, generic approach for extracting (FPN) [26] into our workflow and have made it an end-
topological map in aerial images using polygons. We first to-end model. The FPN further enhances the performance
start by discussing the use of polygon representations to de- of the region proposal network (RPN) used by Faster R-
scribe objects in an image. CNN [42] by exploiting the multi-scale, pyramidal hier-
archy of CNNs and resulting in a set of so-called feature
3.1. Polygon Representation pyramids. Once images with individual buildings have been
generated, the rest of the pipeline follows the generic pro-
We represent objects as polygons. As in [8, 1], we rely
cedure described in Sec. 3.4.
on a CNN to find keypoints based on image evidence, which
are then connected sequentially by an RNN. A fundamental 3.3. From Graphs to Polygons
difference of PolyMapper is that it runs fully automatically
without any human intervention in contrast to [8, 1], which The inherent topology of objects such as roads or rivers
were originally designed for speeding up manual object an- is a general graph instead of a polygon, and the vertices
notation. All the models discussed in [8, 1] (including their of this graph are not necessarily connected in a sequential
“prediction mode”) require a user to first draw a bounding manner. In order to reformulate the topology of these ob-
box that contains the target object and potentially provide jects as a polygon, we follow the principle of a maze solv-
additional manual intervention (e.g., drag/add/delete some ing algorithm, the wall follower, which is also known as the
keypoints) if the object is not correctly delineated. left-/right-hand rule (see Fig. 3): if a maze is simply con-
We refrain from any manual intervention altogether and nected, then by keeping one hand in contact with one wall
propose a fully automated workflow. This is however dif- of the maze, the algorithm is guaranteed to reach an exit.
ficult for mainly two reasons: (1) multiple objects of inter- We apply this principle to extract road sequences. As
est can appear in a given image patch and (2) the shapes shown in Fig. 3, the road network can be regarded as a bidi-
of different target objects can significantly vary. For in- rected graph. Each road segment has two directed edges
stance, buildings are closed shapes of limited extent in the with opposite directions. We assume that for a given pair of
image while road networks span across entire scenes and directed edges, an edge’s partner is always on its left when
are best described with a general graph topology. We there- facing the direction of travel. Suppose we are standing at
fore present two enhancements to address these problems an arbitrary edge and we travel according to the follow-
ing rules: (1) always walk facing the direction of the edge; F are obtained. We apply convolutional layers to the feature
(2) turn right when encountering an intersection; (3) turn in order to generate a heat-map mask of building boundaries
around when encountering a dead end. Following this set of B that delineate the object of interest. This is followed by
rules, we arrive back at the starting point after completing additional convolutional layers outputting a mask of candi-
a full cycle (see Fig. 3b). Finally, we connect all keypoints date keypoints, denoted by V . Both B and V have a size
on the way (i.e., intersections and dead ends) in the order equal to 18 the size of the input image. Among all candidate
of traveling in order to obtain a “polygon” (see Fig. 3c). In keypoints, we select those w points with the highest score
this way, the vertices that are originally not sequential in the in V as starting point y0 (same as y−1 , see Fig. 5).
road graph become ordered. As illustrated in Fig. 2, the main procedure of road net-
With a larger patch size or denser road networks, mul- work extraction is identical to the case of buildings. We
tiple polygons can exist as shown in Fig. 4. However, we only adapt RoI definition and vertex selection to the road
can only get a single polygon by following the rules de- case. While building RoIs are sampled within an image
scribed above. In order to get all the polygons in a graph, patch, a road RoI corresponds to the entire image patch.
we need to traverse all the road segments twice (forward and Naturally, the generated heatmap B refers to the roads’ cen-
backward). In practice, the sequence generation procedure terlines instead of building boundaries. Vertex selection is
goes as follows: we first traverse all edges in an arbitrary adapted to the road topology by selecting start point candi-
polygon, and for the directed edges that were not visited, dates at image edges and choosing the one with the highest
we randomly select one and traverse it following the set of score as starting point y0 (same as y−1 ) to predict the unique
rules until all edges in the graph have been visited. outer polygon. Note that each segment of the outer polygon
should be passed twice unless the segment is shared with an
3 inner polygon. Thus, after the outer polygon is predicted,
we choose two vertices of a segment that is passed only
4 once as y−1 and y0 (in reverse direction) to further predict
2 a potential inner polygon.
RNN Part As illustrated in Fig. 5, the RNN outputs yt ’s
1
Exit Entrance potential location P (yt+1 |yt , yt−1 , y0 ) at each step t. We
(a) (b) (c) input both, yt and yt−1 to compute the conditional probabil-
Figure 3: Maze wall follower approach to sequentialization ity distribution of yt+1 because it allows defining a unique
of road topology. (a) example aerial view of a T-junction, direction. If given two neighboring vertices with an order
(b) wall follower sequence, (c) resulting “polygon” with se- in a polygon, the next vertex in this polygon is uniquely de-
quence order 1→2→3→2→4→2→1. termined. Note that the distribution also involves the end
signal <eos> (end of sequence), which indicates that the
polygon reaches a closed shape and the prediction proce-
5 4 3 2 1
dure should come to the end. The final, end vertex in a
polygon thus corresponds to the very first, starting vertex
y0 , which therefore has to be included at each step.
In practice, we ultimately concatenate F , B, V , y0 (also
y−1 for polygon prediction in the case of roads) and feed the
7 6 8 9 resulting tensor to a multi-layer RNN with ConvLSTM [45]
cells in order to sequentially predict the vertices that will
Figure 4: Road polygon extraction for a larger patch leading delineate the object of interest, until it predicts the <eos>
to one outer anticlockwise polygon (orange) and two inner symbol. For buildings, we simply connect all sequentially
clockwise polygons (blue and green). predicted vertices to obtain the final building polygon. In
the case of roads, the predicted polygon(s) themselves are
3.4. Pipeline not needed directly but rather used as a set of edges be-
CNN Part For an input image, we first use a VGG-16 tween vertices. We thus use all these individual line seg-
without tail layers as the CNN backbone to extract skip fea- ments that make up the polygon(s) for further processing.
tures [41] with 18 the size of the input image (see Fig. 2). Specifically, each of the predicted segments e is associated
R1
Meanwhile, the FPN also takes features from different lay- with a score se calculated as se = 0 B(e(u))du ∈ [0, 1],
ers of the backbone to construct a feature pyramid and pre- where e(u) = ue1 + (1 − u)e2 , B is the heatmap of cen-
dicts multiple bounding boxes containing the buildings. terlines, e1 and e2 are the two extremities of e. We remove
For a single building, with the skip feature map and its segments with low scores and connect the remaining seg-
bounding box, followed by RoIAlign [18], the local features ments to form the entire graph.
Multi-layer RNN with ConvLSTM Cells
Road
<EOS>
Connect
y0 (y-1) y1 y2 y3 y4 y5 y6 vertices
in order
Building
<EOS>
Final Output
RNN Inputs: RNN RNN RNN RNN RNN RNN
yt-1 yt-2 y0
y0 y-1 y0 y1 y0 y0 y2 y1 y0 y3 y2 y0 y4 y3 y0 y5 y4 y0
Image
Features t=1 t=2 t=3 t=4 t=5 t=6
Figure 5: Keypoint sequence prediction produced by RNN for buildings and roads. At each time step t, the RNN takes
the current vertex yt and previous vertex yt−1 as input, as well as the first vertex y0 , and outputs a conditional probability
distribution P (yt+1 |yt , yt−1 , y0 ). When the polygon reaches its starting keypoint and becomes a closed shape, the end signal
<eos> is raised. Note that the RNN also takes features generated by the CNN (see Fig. 2) as input at each time step.
3.5. Implementation Details relatively large-scale overhead image of a city, we first di-
vide the whole image into several patches with 50% cover-
We set the model parameters using size 28×28 for F , age. In the training phase of the building footprints, incom-
B, V and yt , and set the number of layers of the RNN to plete footprints at the edge of the image are still be used,
3 (buildings) and 4 (roads). The maximum length of a se- however, they are excluded in the inference scheme. In the
quence when training is set to be 30 for both cases. The total case of roads, in order to get a complete city road network,
loss of the building case is a combined loss from the FPN, some post-processing is performed, such as splicing road
CNN and RNN parts. The FPN loss consists of a cross- networks in adjacent patches, removing small loops of the
entropy loss for anchor classification and a smooth L1 loss graph and duplicated vertices and edges.
for anchor regression. The CNN loss refers to the log loss As for the efficiency, the average inference time on a sin-
for the mask of boundary and vertices, and the RNN loss gle GPU is 0.38s for buildings and 0.29s for roads per image
is the cross-entropy loss for the multi-class classification at patch (300×300 pixels).
each time step. In the road case, the FPN loss is excluded.
For training, we use the Adam [22] optimizer with batch 4. Experiments
size 4 and an initial learning rate of 0.0001, as well as de-
fault β1 and β2 . We trained our model on 4 GPUs for a day We are not aware of any publicly available dataset2 that
for buildings and 12 hours for roads. During training, we contains labeled building footprints and road networks to-
force the order in which we visit the edges of the building gether with aerial images at large scale and thus create our
polygons to be anticlockwise, while for the road polygons own dataset (see Sec. 4.3). In order to compare our re-
we follow the set of rules described in Sec. 3.3. sults to the state-of-the-art, we resort to evaluating building
In the inference phase, we use beam search with a width footprint extraction and road network delineation separately
w (which is 5 in our experiments). For building, we select on popular task-specific datasets, crowdAI [37] and Road-
top w vertices with highest probability in V as the starting Tracer [4] (see Sec. 4.2).
vertices, then followed by a general beam search procedure.
Among the w polygon candidates, we choose the one with
4.1. Evaluation Measures
the highest probability as the output. Similarly, for road, we For building extraction, we report the standard MS
select vertices at the edge of the image and then choose top COCO measures including average precision (AP, averaged
w with the highest score as the starting point and follows the over IoU thresholds), AP50 , AP75 and APS , APM , APL (AP
general beam search algorithm. After the outer polygon is at different scales). To measure the proportion of buildings
predicted, we can further predict potential inner polygon(s) detected by our approach with respect to the ground truth,
as mentioned in Sec. 3.4. Finally, we use a threshold of 0.7
2 Note
that the only dataset that contains both, building footprints and
(which was found to yield good results) in our experiments
road centerlines is SpaceNet [49], which runs on the Amazon Cloud and
to exclude unmatched edges. uses images of lower resolution than ours. In addition, we are not aware of
In addition, for the topological map extraction from a any scientific publication of a state-of-the-art approach that uses it.
Table 1: Buildings extraction results on the crowdAI dataset [37]
Method AP AP50 AP75 APS APM APL AR AR50 AR75 ARS ARM ARL
Mask R-CNN[18, 38] 41.9 67.5 48.8 12.4 58.1 51.9 47.6 70.8 55.5 18.1 65.2 63.3
PANet[28] 50.7 73.9 62.6 19.8 68.5 65.8 54.4 74.5 65.2 21.8 73.5 75.0
PolyMapper 55.7 86.0 65.1 30.7 68.5 58.4 62.1 88.6 71.4 39.4 75.6 75.4
(a) Mask R-CNN [18, 37] (b) PANet [28] (c) PolyMapper
Figure 6: Building footprint extraction results on 2 example patches of the crowdAI dataset [37] achieved with (a) Mask
R-CNN [18, 37], (b) PANet [28], and (c) PolyMapper. Note that results in (a) and (b) are images labeled per pixel whereas
PolyMapper shows polygons, as well as vertices connected with line segments.
road graphs in a meaningful way. Similar to the definition
in [32], we define the similarity score for the length of two
shortest paths, d∗ and d, in ground truth and predicted road
graphs as a ratio of minimum and maximum values,
min(d∗ , d)
IoU(d∗ , d) = IoU(d, d∗ ) = ∈ [0, 1]. (1)
max(d∗ , d)
Then, with a given IoU threshold t, we can define the
weighted precision and recall as follows,
(a) Mask R-CNN (b) PANet (c) PolyMapper di 1[IoU(di , d∗ji ) ≥ t]
P
Figure 7: Comparison of pixel-wise semantic segmentation AP IoU=t
= i P , (2)
i di
results of Mask R-CNN and PANet with our direct polygon
j dj 1[IoU(dj , dij ) ≥ t]
P ∗ ∗
prediction PolyMapper for an example building.
ARIoU=t = P ∗ , (3)
we additionally evaluate average recall (AR), which is not j dj
commonly used in previous works such as [18, 28]. Both where 1[·] is the indicator function, di and d∗ji refer to the
AP and AR are evaluated using mask IoU. However, we i-th shortest path in the inferred map and its correspond-
would like to emphasize that in contrast to pixel-wise out- ing shortest path with index ji in the ground truth graph,
put masks produced by common methods for building foot- similar for d∗j and dij . Note that the shortest path computa-
print extraction, our outputs are polygon representations of tion is expensive and it is unfeasible to compute all possible
building footprints. paths exhaustively. We thus randomly sample 100 start ver-
Evaluating the quality of road networks in terms of its tices and sample 1,000 end vertices for each of them, which
topology is a non-trivial problem. [53] propose a connec- yields 100,000 shortest paths in total.
tivity measure SP, which centers on evaluating shortest path
4.2. Comparison to State-of-the-art
distances between randomly chosen point pairs in the road
graph. SP generates a large number of pairs of vertices, Buildings We use the crowdAI dataset [37] to validate
computes the shortest path between each two vertices in the building footprint extraction results and to compare to
both ground truth and predicted maps, and outputs the frac- the state-of-the-arts. This large-scale dataset is split as fol-
tion of pairs where the predicted length is equal (up to a lows. The training set consists of ∼280,000 images with
buffer of 10%) to the ground truth, shorter (erroneous short- ∼2,400,000 annotated building footprints. The test set con-
cut) or longer (undetected piece of road). tains ∼60,000 images with ∼515,000 buildings. Each in-
In addition to SP, we propose a new topology evalua- dividual building is annotated in a polygon format as a se-
tion measure that compares shortest paths through graphs quence of vertices according to MS COCO [27] standards.
[53] using a measure based on average precision (AP) and We compare the performance of our model on the
average recall (AR). This allows an evaluation similar to crowdAI dataset [37] to state-of-the-art methods Mask R-
building footprints and compares ground truth and predicted CNN [18, 38] and PANet [28]. Results in Tab. 1 show
Table 2: Road network extraction results on the RoadTracer dataset [4]
Method SP±5% SP±10% AP85 AP90 AP95 AR85 AR90 AR95
DeepRoadMapper [32] 11.9 15.6 35.9 28.4 19.1 58.2 45.7 27.8
RoadTracer [4] 47.2 61.8 64.9 56.6 42.4 85.3 76.5 56.8
PolyMapper 45.7 61.1 65.5 57.2 40.7 84.2 74.8 53.7
5. Conclusion
(b) Sunnyvale
We have proposed a novel approach that is able to di-
Figure 10: PolyMapper results for (a) Chicago and (b) Sun-
rectly extract topological map from city overhead imagery
nyvale. Results for Boston are shown in Fig. 1.
with a CNN-RNN architecture. We also propose a novel
of building footprints and road networks for aerial imagery. reformulation method that can sequentialize a graph struc-
Thus we created our own dataset following the same proce- ture as closed polygons to unify the shapes of different
dure used to obtain the crowdAI [37] and RoadTracer [4] types of objects. Our empirical results on a variety of
datasets. This new dataset contains building footprints and datasets demonstrate high-level of performance for delin-
road networks from OSM [17, 16, 14] and aerial images eating building footprints and road networks using raw
from Google Maps. We collect the dataset of the three aerial images as input. Overall, PolyMapper performs bet-
US cities Boston, Chicago, and Sunnyvale. We did not ter or on par compared to state-of-the-art methods that are
choose European cities in this work because many build- custom-tailored to either building or road networks extrac-
ings typically share the same roof and polygonal instance tion in pixel level. A favorable property of PolyMapper is
segmentation is thus ill-defined (i.e. a single building in that it produces topological structures instead of conven-
the aerial image is often split into multiple instance annota- tional per-pixel masks, which are much closer to the ones
tions). As for Asian cities, they usually have a lot of missing of real online map services, and are more natural and less
annotations in OSM. Our new PolyMapper dataset contains redundant. We view our framework as a starting point for a
∼400,000 images and each patch is of size 300×300 pix- new research direction that directly learns high-level, geo-
els and shows zoom level 19 (scale ∼22.57m per pixel) metrical shape priors from raw input data through deep neu-
in Google Maps, covering 466.587km2 with ∼3,000,000 ral networks to predict vectorized object representations.
References [14] Jean-Francois Girres and Guillaume Touya. Quality Assess-
ment of the French OpenStreetMap Dataset. Transactions in
[1] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Ef- GIS, 14(4):435–459, 2010.
ficient interactive annotation of segmentation datasets with
[15] Jens C Goepfert, Franz Rottensteiner, and Christian Heipke.
polygon-rnn++. In Proceedings of the IEEE Conference on
Network Snakes for Adapting GIS Roads to Height Data of
Computer Vision and Pattern Recognition, pages 859–868,
Different Data Sources - Performance Analysis Using ALS
2018.
Data and Stereo Images. In ISPRS Annals of Photogramme-
[2] Mohammad Awrangjeb, Mehdi Ravanbakhsh, and Clive S. try, Remote Sensing and Spatial Information Sciences, vol-
Fraser. Automatic detection of residential buildings using ume I-3, pages 209–214, 2012.
lidar data and multispectral imagery. ISPRS Journal of Pho- [16] Mordechai Haklay. How Good is Volunteered Geographical
togrammetry and Remote Sensing, 65(5):457–467, 2010. Information? A Comparative Study of OpenStreetMap and
[3] Ruzena Bajcsy and Mohamad Tavakoli. Computer recogni- Ordnance Survey Datasets. Environment and Planning B:
tion of roads from satellite pictures. IEEE T. Systems, Man, Urban Analytics and City Science, 37(4):682–703, 2010.
and Cybernetics, 6(9):623 – 637, 1976. [17] Mordechai Haklay and Patrick Weber. OpenStreetMap:
[4] Favyen Bastani, Songtao He, Mohammad Alizadeh, Hari User-Generated Street Maps. IEEE Pervasive Computing,
Balakrishnan, Samuel Madden, Sanjay Chawla, Sofiane Ab- 7(4):12–18, 2008.
bar, and David DeWitt. Roadtracer: Automatic extraction of [18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
road networks from aerial images. In Computer Vision and shick. Mask R-CNN. In Computer Vision (ICCV), 2017
Pattern Recognition (CVPR), Salt Lake City, UT, June 2018. IEEE International Conference on, pages 2980–2988. IEEE,
[5] Jean-Philippe Bauchet and Florent Lafarge. Kippi: Kinetic 2017.
polygonal partitioning of images. In IEEE Conference on [19] Christian Heipke, Hélène Mayer, Christian Wiedemann, and
Computer Vision and Pattern Recognition (CVPR), 2018. Olivier Jamet. Evaluation of automatic road extraction. In
[6] Ksenia Bittner, Fathalrahman Adam, Shiyong Cui, Marco 3D Reconstruction and Modeling of Topographic Objects,
Körner, and Peter Reinartz. Building footprint extraction 1997.
from vhr remote sensing images combined with normalized [20] Pascal Kaiser, Jan Dirk Wegner, Aurélien Lucchi, Martin
dsms using fused fully convolutional networks. IEEE Jour- Jaggi, Thomas Hofmann, and Konrad Schindler. Learning
nal of Selected Topics in Applied Earth Observations and aerial image segmentation from online maps. IEEE Trans-
Remote Sensing, 11(8):2615–2629, 2018. actions on Geoscience and Remote Sensing, 55(11):6054–
[7] Matthias Butenuth and Christian Heipke. Network snakes: 6068, 2017.
graph-based object delineation with active contour models. [21] Michael Kass, Andrew Witkin, and Demetri Terzopoulos.
Machine Vision and Applications, 23(1):91–109, 2012. Snakes: Active contour models. International Journal of
[8] Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, and Sanja Computer Vision, 1(4):321–331, 1988.
Fidler. Annotating object instances with a polygon-rnn. In [22] Diederik P. Kingma and Jimmy Ba. Adam: A method for
CVPR, volume 1, page 2, 2017. stochastic optimization. CoRR, abs/1412.6980, 2014.
[9] Mauro Dalla Mura, Jon Atli Benediktsson, Björn Waske, [23] Caroline Lacoste, Xavier Descombes, and Josiane Zerubia.
and Lorenzo Bruzzone. Morphological attribute profiles for Point Processes for unsupervised line network extraction in
the analysis of very high resolution images. IEEE Trans- remote sensing. PAMI, 27(10):1568 – 1579, 2005.
actions on Geoscience and Remote Sensing, 48(10):3747– [24] Adrien Lagrange, Bertrand Le Saux, Anne Beaupere,
3762, 2010. Alexandre Boulch, Adrien Chan-Hon-Tong, Stéphane
[10] Liuyun Duan and Florent Lafarge. Image partitioning into Herbin, Hicham Randrianarivo, and Marin Ferecatu. Bench-
convex polygons. In Proceedings of the IEEE Conference marking classification of earth-observation data: from learn-
on Computer Vision and Pattern Recognition, pages 3119– ing explicit features to convolutional networks. In In-
3127, 2015. ternational Geoscience and Remote Sensing Symposium
[11] Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christo- (IGARSS), 2015.
pher K. I. Williams, John Winn, and Andrew Zisserman. The [25] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard S.
PASCAL Visual Object Classes Challenge: A Retrospective. Zemel. Gated graph sequence neural networks. In ICLR,
International Journal of Computer Vision, 111(1):98–136, 2016.
2015. [26] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He,
[12] Mark Everingham, Luc Van Gool, Christopher KI Williams, Bharath Hariharan, and Serge J. Belongie. Feature pyramid
John Winn, and Andrew Zisserman. The Pascal visual object networks for object detection. In IEEE Conference on Com-
classes (VOC) challenge. International Journal of Computer puter Vision and Pattern Recognition, pages 936–944, 2017.
Vision, 88(2):303–338, 2010. [27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
[13] Martin A. Fischler, Jay Martin Tenenbaum, and H. C. Wolf. Pietro Perona, Deva Ramanan, Piotr Dollr, and C. Lawrence
Detection of roads and linear structures in low-resolution Zitnick. Microsoft coco: Common objects in context. In
aerial imagery using a multisource knowledge integration European Conference on Computer Vision (ECCV), 2014.
technique. Computer Graphics and Image Processing, [28] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia.
15:201 – 223, 1981. Path aggregation network for instance segmentation. In Pro-
ceedings of IEEE Conference on Computer Vision and Pat- gion proposal networks. In Advances in neural information
tern Recognition (CVPR), 2018. processing systems, pages 91–99, 2015.
[29] Diego Marcos, Devis Tuia, Benjamin Kellenberger, Lisa [43] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Ha-
Zhang, Min Bai, Renjie Liao, and Raquel Urtasun. Learning genbuchner, and Gabriele Monfardini. The graph neural net-
deep structured active contours end-to-end. In IEEE Confer- work model. Trans. Neur. Netw., 20(1):61–80, Jan. 2009.
ence on Computer Vision and Pattern Recognition (CVPR), [44] James Albert Sethian. Level Set Methods. Cambridge
pages 8877–8885, 2018. University Press, The Edinburgh Building, Cambridge CB2
[30] Dimitrios Marmanis, Konrad Schindler, Jan Dirk Wegner, 2RU, UK, 1 edition, 1996.
and Silvano Galliani. Semantic segmentation of aerial im- [45] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung,
ages with an ensemble of cnns. ISPRS Annals – ISPRS Wai-kin Wong, and Wang-chun Woo. Convolutional lstm
Congress, 2016. network: A machine learning approach for precipitation
[31] Dimitris Marmanis, Konrad Schindler, Jan Dirk Wegner, Sil- nowcasting. In Proceedings of the 28th International Con-
vano Galliani, Mihai Datcu, and Uwe Stilla. Classification ference on Neural Information Processing Systems - Volume
with an edge: improving semantic image segmentation with 1, NIPS’15, pages 802–810, Cambridge, MA, USA, 2015.
boundary detection. ISPRS Journal of Photogrammetry and MIT Press.
Remote Sensing, 135:158–172, 2018. [46] Gunho Sohn and Ian Dowman. Data fusion of high-
[32] Gellert Máttyus, Wenjie Luo, and Raquel Urtasun. Deep- resolution satellite imagery and lidar data for automatic
roadmapper: Extracting road topology from aerial images. building extraction. ISPRS Journal of Photogrammetry and
In The IEEE International Conference on Computer Vision Remote Sensing, 62:43–63, 2007.
(ICCV), Oct 2017. [47] Radu Stoica, Xavier Descombes, and Josiane Zerubia. A
[33] Gellért Máttyus, Shenlong Wang, Sanja Fidler, and Raquel Gibbs Point Process for road extraction from remotely
Urtasun. Enhancing road maps by parsing aerial images sensed images. IJCV, 57(2):121 – 136, 2004.
around the world. In International Computer Vision Con- [48] Piotr Tokarczyk, Jan Dirk Wegner, Stefan Walk, and Konrad
ference, pages 1689–1697, 2015. Schindler. Beyond hand-crafted features in remote sensing.
[34] Helmut Mayer, Stefan Hinz, Uwe Bacher, and Emmanuel In ISPRS Annals of the Photogrammetry, Remote Sensing
Baltsavias. A test of automatic road extraction approaches. and Spatial Information Sciences, volume II-3/W1, pages
In IAPRS, volume 36(3), pages 209 – 214, 2006. 35–40, 2013.
[35] Volodymyr Mnih and Geoffrey E. Hinton. Learning to detect [49] Adam van Etten, Dave Lindenbaum, and Todd M. Bacas-
roads in high-resolution aerial images. In European Confer- tow. Spacenet: A remote sensing dataset and challenge se-
ence on Computer Vision, 2010. ries. arXiv, arXiv:1807.01232v2:1–21, 2018.
[36] Volodymyr Mnih and Geoffrey E. Hinton. Learning to label [50] Michele Volpi and Vittorio Ferrari. Semantic segmenta-
aerial images from noisy data. In International Conference tion of urban scenes by learning local class interactions. In
on Machine Learning, 2012. IEEE Conference on Computer Vision and Pattern Recogni-
[37] Sharada Prasanna Mohanty. Crowdai dataset. tion Workshops, pages 1–9, 2015.
https://www.crowdai.org/challenges/ [51] Michele Volpi and Devis Tuia. Dense semantic labeling
mapping-challenge/dataset_files, 2018. of subdecimeter resolution images with convolutional neu-
[38] Sharada Prasanna Mohanty. Crowdai map- ral networks. IEEE Transactions on Geoscience and Remote
ping challenge 2018: Baseline with mask Sensing, 55(2):881–893, 2017.
rcnn. https://github.com/crowdai/ [52] Jan Dirk Wegner, Javier Alexander Montoya-Zegarra, , and
crowdai-mapping-challenge-mask-rcnn, Konrad Schindler. Road networks as collections of minimum
2018. cost paths. ISPRS Journal of Photogrammetry and Remote
[39] Javier A. Montoya-Zegarraa, Jan Dirk Wegner, Lubor Sensing, 108:128 – 137, 2015.
Ladickyb, and Konrad Schindlera. Semantic segmentation of [53] Jan Dirk Wegner, Javier A. Montoya-Zegarra, and Konrad
aerial images in urban areas with class-specific higher-order Schindler. A higher-order crf model for road network ex-
cliques. In ISPRS Annals of the Photogrammetry, Remote traction. In CVPR, pages 1698–1705, 2013.
Sensing and Spatial Information Sciences, volume II(3/W4),
pages 127 – 133, 2015.
[40] Sakrapee Paisitkriangkrai, Jamie Sherrah, Pranam Janney,
and Anton van den Hengel. Effective semantic pixel la-
belling with convolutional networks and conditional random
fields. In CVPRws, 2015.
[41] Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr
Dollár. Learning to refine object segments. In European
Conference on Computer Vision, pages 75–91. Springer,
2016.
[42] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster R-CNN: Towards real-time object detection with re-