research-article

Open access

Score-based Graph Learning for Urban Flow Prediction

Authors:

Fan ZhouAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 3

Article No.: 59, Pages 1 - 25

https://doi.org/10.1145/3655629

Published: 17 May 2024 Publication History

PDF eReader

Abstract

Accurate urban flow prediction (UFP) is crucial for a range of smart city applications such as traffic management, urban planning, and risk assessment. To capture the intrinsic characteristics of urban flow, recent efforts have utilized spatial and temporal graph neural networks to deal with the complex dependence between the traffic in adjacent areas. However, existing graph neural network based approaches suffer from several critical drawbacks, including improper graph representation of urban traffic data, lack of semantic correlation modeling among graph nodes, and coarse-grained exploitation of external factors. To address these issues, we propose DiffUFP, a novel probabilistic graph-based framework for UFP. DiffUFP consists of two key designs: (1) a semantic region dynamic extraction method that effectively captures the underlying traffic network topology, and (2) a conditional denoising score-based adjacency matrix generator that takes spatial, temporal, and external factors into account when constructing the adjacency matrix rather than simply concatenation in existing studies. Extensive experiments conducted on real-world datasets demonstrate the superiority of DiffUFP over the state-of-the-art UFP models and the effect of the two specific modules.

1 Introduction

With the advancement of mobile computing technology such as GPS, cellular, and mobile apps, it has become more and more convenient to collect spatial and temporal data such as human mobility, trajectories of vehicles, and shared bicycles [8, 67], which, in turn, enable the analysis of city traffic and flow movement that play crucial roles in various applications, ranging from intelligent transportation systems and urban planning to public safety (e.g., epidemic spread analysis) and smart city management [66]. As an important and challenging problem, urban flow prediction (UFP) has received increasing attention and has been extensively studied in recent years [21, 23, 26, 27, 34, 42, 59, 61].

Because of the inherent spatio-temporal property of urban flow, early approaches [29, 60] use convolutional neural networks (CNNs) to learn the spatial correlation, whereas recurrent neural networks (RNNs) such as LSTM and GRUs are typically used to capture the dynamic temporal correlation. Later, researchers began to exploit various advanced deep learning models like residual networks [26, 59] and meta-learning [36] to better capture the spatio-temporal correlations between different areas. Graph neural networks (GNNs) have recently demonstrated their ability to effectively represent graph-structured data and have been widely adopted in spatio-temporal learning [11, 12, 53]. As a result, a number of GNN-based methods for urban traffic analysis and flow prediction have been proposed [11, 56, 63]. These methods, in general, formulate UFP as a Spatio-Temporal Graph (STG) prediction problem and have achieved promising results by combining GNNs with some temporal dynamic modeling techniques [36, 61, 62].

Nonetheless, accurate UFP is difficult due to the complex spatio-temporal interactions. Existing graph-based UFP methods still have three flaws that prevent them from addressing the UFP issue effectively.

The first flaw is semantically insufficient or inflexible construction of the urban flow graph. Existing GNN-based UFP models [30, 56, 63] partition the map into grids evenly and take the equal-sized regions as nodes to construct the urban flow graph, as illustrated in Figure 1. They model the inflow and outflow of a node by aggregating the information (i.e., flow) of its immediate neighbor (1-hop) nodes, which allows the GNN to easily handle the input flow data and forecast the output urban flow. However, this general processing method fails to consider the real-world road networks and semantic regions (e.g., schools, parks, and residential areas). For example, several grids may divide a semantic region, and a trajectory may lead to multi-hop message passing in GNNs and therefore inaccurate predictions. Besides, grid partition typically leads to a large-scale graph that requires more complex GNNs to capture the underlying interactions between nodes and costs more intensive computations.

Fig. 1.

Researchers have suggested region-based urban flow graphs to conduct relational inference on a higher semantic level to address the shortcomings of grid-based urban flow graphs. For example, a recent study [24] uses Mincut theory to cluster semantically similar regions, which is also semantically inadequate for creating the STG. Sun et al. [42] consider irregular regions partitioned according to real-world road networks as nodes to construct the STG. Nevertheless, it is inflexible for general STG construction, as it requires additional information regarding the road networks, which vary from city to city and are dynamically changed.

The second flaw is coarse-grained and manually crafted temporal features. The time attribute is another important dimension of urban traffic and a key factor in accurately predicting urban traffic. Examining the temporal dynamics allows us to learn the trends and periodicity of urban flow. Previous methods usually follow the pioneering work [59] that intercepts three different lengths of interval time series segments from the meta traffic data to reflect three temporal correlations: closeness, periodicity, and trend. For example, ST-GDN [62] uses multi-scale self-attention to explore the multi-level temporal contextual information and to capture the temporal hierarchy of traffic flow regularities. However, existing methods need to manually define the temporal resolutions in advance—the temporal property of urban traffic is too complicated to determine all temporal resolutions explicitly. Besides, temporal properties at different levels can influence each other. For example, the daily peeks of inflows and outflows in outdoor stadiums differ in winter and summer, and the seasonal temporal property influences the daily temporal property. Therefore, it is necessary to capture the temporal dynamics in a flexible way rather than manually crafted and coarse-grained time feature engineering in existing studies.

The third flaw is improper external factors usage. External factors such as weather, weekends, and holidays are important indicators of urban traffic flow [24, 61]. These factors, in essence, affect people’s travel patterns and hence significantly impact flow prediction. For example, flows in residential areas between entertainment regions will drastically reduce in poor weather. However, most previous solutions incorporate external factors straightforwardly. For instance, some approaches [26, 42, 59, 60] simply fuse the external factors with the flow data and then feed them into fully connected neural networks for prediction. The correlations between outside factors and urban flow transitions are disregarded in this simplistic utilization. To study the effects of external influences and the function of the land at the same time, STRN [24] proposes a meta learner that employs matrix factorization to identify relationships between regions and external factors. This approach, however, is constrained by fixed region partition and neglects to take into account varied combinations of outside influences.

To address the preceding challenges, we propose DiffUFP, a novel graph-based framework for UFP. First, we introduce a dynamic semantic region extractor to learn the underlying traffic structure from the Euclidean grid-like urban flow map. It excavates the key semantic regions dynamically and models the interactions between different semantic regions. Through the dynamic nodes representations, we can empower the message passing between nodes and learn the temporal property of urban traffic smoothly. Besides, we design a spatio-temporal adjacency matrix generator to understand the transitions of urban flows under diversified external factors. Specifically, DiffUFP constructs a semantic adjacency matrix to find the similarities of flow data and capture temporal semantic connections. To handle various external factors, we use a denoising score-based model to generate an adjacency matrix conditioned on external factors. After adding a spatial adjacency matrix, we can get the final spatio-temporal adjacency matrix. With the nodes and edge representations processed by these two methods, our graph-based model can effectively deal with the UFP problem.

The contributions of this work are fourfold:

—

We propose a novel probabilistic GNN-based method for UFP, which constructs the STG to model the flow of urban crowds flexibly and reasonably.

—

We design a dynamic graph-based node representation method to sufficiently capture the temporal dynamics. Our approach extracts the critical semantic regions from the Euclidean grid-like urban flow map by analyzing the spatio-temporal evolution of urban flow data.

—

We introduce a conditional denoising score-based method to generate the adjacency matrix from latent semantic space, allowing the model to capture complex spatial and temporal transitions between different regions by taking full advantage of external factors.

—

We conduct extensive experimental evaluations on real-world datasets collected from cities in different countries and regions. The results demonstrate the superiority of our method, which not only significantly improves the accuracy of UFP but also provides interpretations of the model behaviors.

The rest of this article is organized as follows. In Section 2, we review the related work and position our work in that context. Section 3 formally defines the problem and introduces necessary background knowledge. Section 4 presents the details of our solution and insightful analysis. In Section 5, we compare DiffUFP to the state-of-the-art models under various configurations and demonstrate DiffUFP’s effect in solving the UFP problem. Finally, we conclude this study and outline future work in Section 6.

2 Related Work

We now review the related studies in UFP, graph-based spatio-temporal data analysis, and diffusion-based generative models, and point out the difference of this work.

2.1 Urban Flow Prediction

As one of the essential tasks in traffic management, UFP has attracted increasing interest in the past decade [57, 69]. Classic machine learning techniques have been widely used to model urban flows, such as ARIMA [35] and support vector regression [3]. However, the conventional machine learning models suffer from the underfitting problem due to their poor ability to learn high-dimensional representations and the requirements of expensive feature engineering. Recently, deep learning methods have been introduced to improve the performance of learning non-linear relations and complex interactions between flows in different areas. Earlier research applies RNN and its variants (e.g., LSTM and GRU) to capture the temporal features of urban flow sequences [28, 58]. For example, DeepUrbanEvent [19] builds the system with RNN to predict the crowd dynamics at big events. Nevertheless, the preceding methods fail to consider intrinsic spatial correlations.

Zhang et al. [60] leverage CNNs to extract the spatial correlations between grid regions on the citywide urban flow map. To model the spatial information, the extended works combine RNNs and CNNs [29, 54] or utilize residual neural networks [13] to solve the long-term dependencies. For example, ST-ResNet [59] exploits the residual connection to alleviate the overfitting problem in UFP. DeepSTN+ [26] employs a modified convolutional network structure to model the long-range spatial correlations among crowd flows. In addition, some researchers take the transitions of urban flows as another task to perform multi-task UFP. For instance, Zhang et al. [61] propose a multi-task framework to simultaneously predict the node flow and edge flow. MT-ASTN [48] designs a shared-private framework to predict crowd flows and the origin-destination locations of the crowd flow in an adversarial learning manner. Some hybrid approaches have also been proposed and significantly improved urban flow prediction. For example, ST-MetaNet+[37] uses meta-learning to learn the traffic-related embeddings of nodes and edges from the geo-graph attributes and the traffic context from the dynamic traffic states. STDEN [17] models the spatio-temporal dynamic process of the potential energy field as a differential equation network to integrate physical principles and data-driven models. DeepCrowd [18] proposes a high-dimensional and pyramid architecture attention mechanism based on convolutional LSTM to deal with human mobility in a dataset generated from real-world smartphone applications.

With the increasing interest in acquiring fine-grained urban flow data from coarse-grained urban flow data, a few studies focus on the fine-grained urban flow inference (FUFI) problem [51, 52]. UrbanFM [23] is the first study that formalizes the FUFI problem, which employs deep residual architecture [13] as the backbone and designs a normalization method to capture the spatial constraint. FODE [65] takes advantage of neural ordinary differential equations to balance inference accuracy and computational efficiency. UrbanPy [34] adopts a pyramid architecture to solve the large-scale upsampling issues in FUFI. Liang et al. [25] propose a universal method called DeepLGR that can address the UFP and FUFI simultaneously, which consists of a global module for holistic information learning and a local module to capture the nearby information. In this article, we focus on the UFP problem. However, our method can be easily adapted to address the FUFI problem.

2.2 Graph-Based Spatio-Temporal Learning

GNN has been proved as a helpful framework to model the no-Euclidean structured graph data [31, 49, 68]. Recently, exploiting GNNs for spatio-temporal data mining has been studied extensively, among which traffic speed prediction is a typical spatio-temporal learning task. For example, Yu et al. [56] integrate a graph convolutional network (GCN) [7] and convolution sequence model to capture the spatial and temporal correlations, respectively. Yao et al. [55] combine CNNs and the graph embedding method to extract the spatio-temporal signals. STDN [54] learns the transition regularities of the traffic flow by exploiting the periodically shifted attention. Geng et al. [11] employ a multi-modal GCN to learn region-wise interactions. Wang et al. [50] propose a spatio-temporal GNN with a learnable positional attention mechanism to capture spatial and temporal patterns comprehensively. Zheng et al. [63] use the graph multi-attention mechanism and an encoder-decoder framework for traffic prediction. Geng et al. [11] also propose a multi-modal GCN to model the interactions between neighboring regions. Wang et al. [47] introduce the attention mechanism to aggregate information from adjacent roads. To overcome the limit of network depth and improve the capacity of capturing longer-range spatio-temporal correlations, STGODE [10] proposes a continuous representation method of GNNs by utilizing a tensor-based ordinary differential equation.

The graph-based methods have also been used in UFP, and the grid regions of the urban flow map are generally considered as nodes [61]. Pan et al. [36] utilize the meta-learning method to perform knowledge transfer across regions using a recurrent graph attentive network. Zhang et al. [62] propose a graph-based framework that embeds multi-level temporal contextual signals by a multi-scale self-attention network. Notwithstanding the promising improvements achieved by these approaches, they ignore the gap between grid-like urban flow map partition and real-world road networks. Meanwhile, the semantic information over time and the impact of external factors on the transition pattern of people are not well considered in previous studies. In contrast, our method can extract the traffic network’s underlying structure from the Euclidean grid-like urban flow map and learn the transitions of urban flows through a well-designed conditional denoising score-based model.

2.3 Diffusion-Based Generative Model

Diffusion-based models refer to a family of methods learning generative models as transition operations of Markov chains. In general, they define a Markov chain of diffusion steps to slowly add Gaussian noise to data and learn to reverse the diffusion process to generate desired data samples from the noise. There are two kinds of diffusion models, the denoising diffusion probabilistic model (DDPM) [14] and the score-based model [40]the latter optimizes the score matching objective [16], whereas DDPM approaches the variational lower bound to obtain the log-likelihood.

Recently, deep diffusion models have shown strong performance in a range of tasks such as image generation [14, 40] and audio processing [5, 20, 44, 45]. For example, Kong et al. [20] and Chen et al. [5] use DDPM to generate high-fidelity audio conditional on mel-spectrograms. Ho et al. [14] achieve comparable and even better sample quality than the GAN-based image generation method [2]. Song and Ermon [40] designed a noise conditional score-based network and later provided a unified framework exploring the stochastic differential equation to improve score-based generative models [41], where the diffusion process is modeled as the discretization of a continuous stochastic differential equation. Dockhorn et al. [9] propose a well-designed forward diffusion process by augmenting the data variable with an additional velocity variable, which brings a smoother diffusion process and requires fewer sampling overheads. Chao et al. [4] formulate a new training objective to assist the classifier in matching the gradients of the authentic log-likelihood density under conditional situations. Besides, Rasul et al. [38] utilize DDPM for time series forecasting and show excellent performance. Tashiro et al. [46] successfully employ a score-based model to handle probabilistic time series imputation. Niu et al. [33] adopt a score-based model to model the graph and propose a permutation equivariant GNN. However, leveraging the diffusion-based model to handle the complicated spatio-temporal dependence is still under exploration.

3 Preliminaries

In this section, we first define the UFP problem studied in this work and introduce the necessary background of score-based models. The frequently used notations throughout this article are summarized in Table 1.

Table 1.

Notation	Description
\(I, J\)	number of rows and columns of meshing city flow map
\(V=\lbrace r_{i,j}\rbrace\)	grid regions set, \(1 \le i \le I, 1 \le j \le J\)
N	number of grid regions (i.e., \(N = I \times J)\)
\(\mathcal {P}\)	traffic trajectories set
\(T_r\)	a trajectory of \(\mathcal {P}\)
\(L_{T_r}\)	number of coordinate points contained in \(T_r\)
\(\mathcal {T}\)	available time interval set
g	geospatial coordinate (i.e., longitude and latitude)
\(m_{t, i, j}^{in}, m_{t, i, j}^{out}\)	inflow and outflow of region \(r_{i,j}\) at time t
\(\tau _{in}, \tau _{\it out}\)	number of timestamps for history/future urban flow
\(\mathbf {M}^S, \mathbf {M}^T\)	historical urban flow and future urban flow
\(\mathbf {E}\)	historical external factors
\(\mathbf {m}_{t}, \mathbf {e}_t\)	tensor of node flow data and external factors at time t
\(N_r\)	number of excavated semantic regions
\(\mathbf {X}^C, \mathbf {X}^P, \mathbf {X}^T\)	fragment of historic urban flows that denotes near history, recent time, and weekly periodicity, respectively
\(\mathbf {X}^f\)	fused historic sequence of urban flows
\(\mathbf {X}^{\prime }\)	dynamically extracted nodes representations
\(\mathbf {X}\)	fused nodes representations
\(\mathbf {A}^{sp}\)	spatial adjacency matrix
\(\mathbf {A}^{te}\)	temporal semantic adjacency matrix
\(\mathbf {A}^{\prime }\)	conditional generative semantic adjacency matrix
\(\mathbf {A}^{st}\)	spatio-temporal semantic adjacency matrix

Table 1. Notations

3.1 Problem Formulation

We consider the city that is partitioned into \(N(N = I \times J)\) equal-sized cell regions based on the longitude and latitude. Let \(V=\lbrace r_{1,1},\ldots ,r_{i,j},\ldots ,r_{I,J}\rbrace\) denote all cell regions, where \(r_{i,j}\) represents the i-th row and j-th column cell region of the grid map.

Definition 1 (Urban Flow).

Let \(\mathcal {T} = \left\lbrace t_1,\ldots ,t_{|\mathcal {T}|}\right\rbrace\) be a sequence of time intervals, \(| \cdot |\) denote the cardinality of the set, and \(\mathcal {P}\) be the collection of urban flow trajectories. Given a cell region \(r_{i,j}\), the corresponding inflow \(m_{t, i, j}^{in}\) and outflow \(m_{t, i, j}^{out}\) of the urban traffic in time slot t are defined as follows:

\[\begin{eqnarray*} \begin{aligned}m_{t, i, j}^{in} =\sum _{T_{r} \in \mathcal {P}} &\left|\left\lbrace g_{l} \mid g_{l-1} \notin r_{i, j} \wedge g_{l} \in r_{i, j} \wedge \tau _{l-1} \in t \wedge l\gt 1\right\rbrace \right|, \\ m_{t , i, j}^{out} =\sum _{T_{r} \in \mathcal {P}} &\left|\left\lbrace g_{l} \mid g_{l} \in r_{i, j} \wedge g_{l+1} \notin r_{i, j} \wedge \tau _{l} \in t \wedge l\lt L_{T_r}-1\right\rbrace \right|, \nonumber \nonumber \end{aligned} \end{eqnarray*}\]

where \(T_r : g_1 \rightarrow g_2 \rightarrow \ldots \rightarrow g_{L_{T_r}}\) is a trajectory in \(\mathcal {P}\), \(g_l\) is the geospatial coordinate that \(l \in [2,L_{T_r}],\) and \(L_{T_r}\) denotes the length of \(T_r\). Note that \(g_l \in r_{i,j}\) means the objective (e.g., a person, taxi, or bicycle) is within region \(r_{i,j}\), and \(\tau _l\) is the corresponding timestamp. The inflow and outflow of all regions in time t are denoted as a crowd flow tensor \(\mathbf {m}_t \in \mathbb {R}^{2 \times I \times J }\). Following previous studies [61], we flatten \(\mathbf {m}_t\) to a two-dimensional tensor with the shape of \(1 \times 2N\).

Urban traffic is closely related to external factors, such as time of day, events (e.g., holidays, weekends) [64], and weather conditions (temperature, wind speed, etc.) [42]. At time t, we denote these external factors as \(\mathbf {e}_t \in \mathbb {R}^{l_e}\).

Now we can formally define the problem of UFP.

Problem 1 (Urban Flow Prediction).

Given the historical traffic flow \(\mathbf {M}^S=\left(\mathbf {m}_{t-\tau _{\it in}+1}, \ldots , \mathbf {m}_{t}\right)\), and the corresponding external factors \(\mathbf {E}=\left(\mathbf {e}_{t-\tau _{\it in}+1}, \ldots , \mathbf {e}_{t}\right)\), a UFP model \(f_\text{UFP}\) tries to forecast the traffic flow \(\mathbf {M}^T=\left(\hat{\mathbf {m}}_{t+1}, \ldots , \hat{\mathbf {m}}_{t + \tau _{\it out} + 1}\right)\) in future \(\tau _{\it out}\) timesteps:

\begin{align} \mathbf {M}^T = f_\text{UFP}(\mathbf {M}^S, \mathbf {E}). \end{align}

(1)

3.2 Score-Based Generative Model

Given a dataset where each point is drawn independently from an underlying data distribution \(p(\mathbf {x})\), we call \(\nabla _{\mathbf {x}} \log p(\mathbf {x})\) its score function. The score function is an unnormalized density that does not depend on the partition function and is easier to estimate and model than the probability density function in many cases.

A score-based model trains a neural network \(\mathbf {s}_{\boldsymbol {\theta }}(\mathbf {x})\) parameterized by \(\boldsymbol {\theta },\) called the score network, to make \(\mathbf {s}_{\boldsymbol {\theta }}(\mathbf {x}) \approx \nabla _{\mathbf {x}} \log p(\mathbf {x})\). In this way, they are able to utilize Langevin dynamics to sample from the corresponding distribution once the score function is known. Langevin dynamics provides an MCMC (Markov chain Monte Carlo) procedure to sample from a distribution merely utilizing its score function. Therefore, it is easy to estimate the score function from data and generate new samples with Langevin dynamics, a.k.a. \({\it score-based generative modeling}\).

However, it is often inaccurate to estimate score function in low-density regions where few data points are available for computing the score matching objective. As a remedy, NCSN [40] perturbs the data with Gaussian noise of different scales and jointly estimates the score functions of all noise-perturbed data distributions. It uses a joint neural network to evaluate the score functions of noise-perturbed data distributions on different scales. Ideally, with enough data and model capacity, one can get the optimum score network. Then, it is easy to generate samples by the annealed Langevin dynamics, a modified method from Langevin dynamics that combines information from all scales noises.

4 Methodologies

Figure 2 illustrates the overall framework of DiffUFP. As shown in the left part of Figure 2, the input of the node representation learning module includes closeness \(\mathbf {X}^C\), period \(\mathbf {X}^P\), and trend \(\mathbf {X}^T\) following previous studies [59]. The three key time series are extracted from \(\mathbf {M}^S\) and used to capture three different properties of urban traffic flow. We first leverage the multi-head self-attention to mine the critical information of each sequence. Then, the three outputs are concatenated and fed into a Multi-Layer Perceptron (MLP) to produce the fused input sequence \(\mathbf {X}^f\). After that, the \(\mathbf {X}^f\) is fed into the dynamic semantic region extractor to generate \(\mathbf {X}^{\prime }\)—an element-wise operation added to \(\mathbf {X}^f\) to produce more expressive node representations \(\mathbf {X}\). The edge representation learning module takes external factors \(\mathbf {E}\) and \(\mathbf {X}\) as inputs, and a spatio-temporal adjacency matrix generator is designed to generate edge representations \(\mathbf {A}^{st}\) which bear rich spatio-temporal semantic information. Finally, the edge representation \(\mathbf {A}^{st}\) and the node representation \(\mathbf {X}\) are fed into two STG layers to predict the future flow. In the following, we will explain the details of the main components of DiffUFP, including the dynamic semantic region extractor (Section 4.1) and the spatio-temporal adjacency matrix generator (Section 4.2).

Fig. 2.

4.1 Dynamic Semantic Region Extractor

In UFP, the explicit traffic structure is usually unavailable, and we only have access to the grid-based traffic distributions. The critical semantic regions change over time with complex spatial and temporal relationships. It is therefore inefficient to simply take grid regions as nodes. Hence, we propose a semantic regions dynamic extraction method to enhance the urban flow representations. Our approach learns to discover the well-delimited salient regions in space and time to model the interactions between various semantic entities in urban traffic networks. Semantic nodes associate with such noteworthy regions by themselves, enabling the message passing between nodes to effectively model the urban flow transitions by exploiting the inductive bias.

The main structure of the dynamic semantic region extractor is illustrated in Figure 3. This module first receives the fused input sequence \(\mathbf {X}^f \in \mathbb {R}^{\tau _{in}^{\prime } \times 2 \times I \times J}\) and then infers the locations and sizes of \(N_r\) salient semantic regions. These semantic regions are then treated as graph nodes and projected to their original positions on the input map. Therefore, the dynamic semantic region extractor consists primarily of two components: a region extractor and a graph processor.

Fig. 3.

4.1.1 Region Extractor.

The region extractor aggregates the input features to extract the salient semantic regions that are parameterized by their location \((\Delta x,\Delta y)\) and size \((w, h)\). Two convolution layers are used to capture the local information from the input while preserving sufficient positional information at each timestep. For each semantic region node, we use a fully connected network (FCN) \(f_i\) to generate the latent representation \(\hat{\mathbf {h}}_{i, t}\):

\begin{align} \hat{\mathbf {h}}_{i, t}={f_i}\Big (\text{CNN}\Big (\mathbf {{X}}^{f}_{t}\Big)\Big) \in \mathbb {R}^{C}, \forall i \in [1, N_r]. \end{align}

(2)

In our experiments, we set the channel dimension C to 2 and the convolution kernel to \(3 \times 3\) with stride 1. To capture the temporal information, we use the GRU [6] to fuse the historical latent representations of semantic regions and thus obtain the temporal hidden state \(\mathbf {z}_{i, t}\):

\begin{align} \mathbf {z}_{i,t} = \text{GRU}\big (\mathbf {z}_{i,t-1},\hat{\mathbf {h}}_{i, t}\big) \in \mathbb {R}^{C}, \forall i \in [1, N_r], \end{align}

(3)

where \(\mathbf {z}_{i,0}\) is a randomly initialized vector. After a linear projection, we get the predicted region parameters:

\begin{align} \Delta x_{i, t}, \Delta y_{i, t}, w_{i, t}, h_{i, t} =\xi \left(W_{o} \mathbf {z}_{i, t}\right) \in \mathbb {R}^{4}, \end{align}

(4)

where \(W_{o} \in \mathbb {R}^{4 \times C}\), \(\xi\) denotes another FCN that controls the initialization of the size and location of the predicted regions, \((\Delta x_{i, t}, \Delta y_{i, t})\) and \((w_{i, t}, h_{i, t})\) represent the center location and the i-th semantic region at time t. Each region node is then interpolated to a grid-based map using a kernel function \(\mathcal {K}\) which is separable from the axis:

\begin{align} &k_{x}^{i}\left(p_{x}\right)=\max \left(0, w_{i}-\left|\Delta x_{i}-p_{x}\right|\right), \end{align}

(5)

\begin{align} &k_{y}^{i}\left(p_{y}\right)=\max \left(0, h_{i}-\left|\Delta y_{i}-p_{y}\right|\right), \end{align}

(6)

\begin{align} \mathcal {K}^{i}\left(p_{x}, p_{y}\right)=k_{x}^{i}\left(p_{x}\right) k_{y}^{i}\left(p_{y}\right) \in \mathbb {R}, \end{align}

(7)

where \((p_{x}, p_{y})\) indicates a certain spatial location at the input urban flow map; \(w_{i}\), \(h_{i}\), \(\Delta x_{i}\), and \(\Delta y_{i}\) are the predicted region parameters from Equation (4); \((w_{i}, h_{i})\) represents their sizes; and \((\Delta x_{i}, \Delta y_{i})\) controls their locations. The position embedding of each node is derived by summing the linear projection of the outputs of the kernel function:

\begin{align} \mathbf {pe}_{i}=\sum _{p_{x}=1}^{J} \sum _{p_{y}=1}^{I} \mathcal {K}^{i}\left(p_{x}, p_{y}\right) \mathbf {X}^f_{p_{x}, p_{y}} \in \mathbb {R}^{C}, \forall i \in [1, N_r], \end{align}

(8)

where \(\mathbf {pe}_{i}\) denotes the position embedding of the i-th semantic node.

4.1.2 Graph Processor.

The goal of the graph processor is to map the semantic region nodes to the input city map. It achieves this by a general recurrent STG processing procedure [32]. At each timestep t, each pair of nodes exchanges messages as

\begin{align} \mathbf {pe}_{i, t}=\sum _{j=1}^{N} \kappa \big (\mathbf {pe}_{j, t}, \mathbf {pe}_{i, t}\big) \operatorname{MLP}\big (\big [\mathbf {pe}_{j, t} ; \mathbf {pe}_{i, t}\big ]\big) \in \mathbb {R}^{C}, \end{align}

(9)

where \(\kappa (\cdot ,\cdot)\) indicates the dot product attention operation and MLP is a three-layer perceptron. To incorporate the temporal information, the feature of each node is updated with a shared GRU across timesteps. We remap the semantic region nodes to the city map by multiplying the nodes’ features and the kernel function output:

\begin{align} \mathbf {X^{\prime }}_{t, p_{x}, p_{y}}=\sum _{i=1}^{N} \mathcal {K}_{t}^{i}\left(p_{x}, p_{y}\right) \hat{\mathbf {pe}}_{i, t} \in \mathbb {R}^{C}, \end{align}

(10)

where \(\hat{\mathbf {pe}}_{i, t+1}\) denotes the feature in the next time. The final input feature X is generated as

\begin{align} \mathbf {X} = \mathbf {X}^f \oplus \mathbf {X}^{\prime } \in \mathbb {R}^{\tau _{in}^{\prime } \times 2 \times I \times J}, \end{align}

(11)

where \(\oplus\) is an element-wise addition and \(\mathbf {X}\) is taken as node representation of the graph-based model.

4.2 Spatio-Temporal Adjacency Matrix Generator

It is crucial to accurately model the transitions in urban traffic flow. Currently, researchers typically create traffic flow graphs by rasterizing urban maps and connecting geographically adjacent regions [37, 61], or they use road networks as a reference point [42]. However, despite these existing methods, there is still room for improvement in three aspects. First, most studies only consider static spatial distances and neglect the temporal aspect of learning urban flow transitions. Second, external factors that directly impact traffic transitions, such as extreme weather conditions or holidays, should be considered during modeling. Last, to accurately predict future urban flows, models must have strong generalization abilities to handle unseen traffic patterns within their training dataset.

To address these issues, we design a spatiotemporal adjacency matrix generator to produce a semantically rich adjacency matrix that captures both geographic and semantically similar. We implement it via a conditional score-based model that can effectively incorporate external factors into adjacency matrix generation. The structure of the spatio-temporal adjacency matrix generator is illustrated in Figure 4. The inputs of the spatio-temporal adjacency matrix generator are node representations \(\mathbf {X}\) and external factors \(\mathbf {E}\). First, we construct the semantic adjacency matrix \(\mathbf {A}^{te}\). Then, a conditional score-based generative model is used to incorporate external factors and complement unseen traffic patterns simultaneously. The output of the conditional score-based generative model is denoted by \(\mathbf {A}^{\prime }\). Finally, we obtain the target spatio-temporal adjacency matrix \(\mathbf {A}^{st}\) by fusing \(\mathbf {A}^{\prime }\) and the spatial adjacency matrix \(\mathbf {A}^{sp}\). We present the details of the spatio-temporal adjacency matrix generator. Section 4.2.1 explains how to construct the spatial adjacency matrix, which is the basis of building the urban flow graph. Section 4.2.2 establishes the semantic edges based on the input matrix \(\mathbf {X}\), based on which we detail the conditional score-based graph generation process that produces the final spatio-temporal semantic adjacency matrix in Section 4.2.2.

Fig. 4.

4.2.1 Spatial Adjacency Matrix Construction.

We rasterize the urban map to construct the spatial adjacency matrix \(\mathbf {A}^{sp}\) that captures the connectivity between two nodes according to their geographic locations. For a pair of nodes a and b, the corresponding value of \(\mathbf {A}^{sp}\) is set as

\begin{align} \mathbf {A}_{a, b}^{sp}= {\left\lbrace \begin{array}{ll}1,& \text{ if } \left|a_i - b_i \right|\le \epsilon ^{sp} \cap \left|a_j - b_j \right|\le \epsilon ^{sp} \\ 0,& \text{otherwise }\end{array}\right.}, \end{align}

(12)

where \((a_i,a_j)\) and \((b_i,b_j)\) are the coordinates of nodes a and b in the grid-like urban map, respectively, and \(\epsilon ^{sp}\) is the hyperparameter to control the sparsity of \(\mathbf {A}^{sp}\).

4.2.2 Semantic Adjacency Matrix Construction.

In addition to the geographical neighborhood, the semantic neighborhood also influences traffic flow prediction. The semantic neighborhood here refers to the nodes with similar traffic patterns. For instance, regions around shopping malls share similar inflow and outflow patterns regardless of geographical distance. Although other traffic-related information such as the user trajectory would also contribute to traffic patterns, in the case of UFP we can only access the urban flow due to privacy policy. In this work, we propose a semantic adjacency matrix to capture this kind of semantic neighborhood.

For a given node, the traffic pattern can typically be depicted by the inflow and outflow sequences represented as time series. Therefore, traffic similarity can be measured by the inflow and outflow similarity between two nodes. Here we use the Dynamic Time Warping (DTW) [1] algorithm to calculate the similarities of flows (inflow and outflow) in different nodes.

In this way, DiffUFP can obtain more meaningful neighbors enriching the graph representation of traffic flow. It is worth noting that STGODE [10] first proposed the semantic neighbors. However, STGODE only considered the sum of inflow and outflow, which is coarse grained. The specific construction of the semantic adjacency matrix at time t \(\mathbf {A}_{t}^{\it te}\) is as follows:

\begin{equation} \mathbf {A}^{\it te}_{t,i,j} = {\left\lbrace \begin{array}{ll}1, & \operatorname{DTW}\left(S_{t,i,j}^{\it in}, S_{t,i,j}^{\it out}\right)\lt \epsilon ^{te} \\ 0, & \text{ otherwise } \end{array}\right.}, \end{equation}

(13)

where \(\epsilon ^{\it te}\) is the threshold to control the sparsity of the adjacency matrix, \(S_{t,i,j}^{\it in}\) and \(S_{t,i,j}^{\it out}\) denote the inflow and outflow sequences of cell regions \(r_{i,j}\) intercepted around time t of \(\mathbf {X}\), and \(\operatorname{DTW}(\cdot ,\cdot)\) indicates the DTW measurement on two sequences. Note that different from the static spatial adjacency matrix \(\mathbf {A}^{sp}\), \(\mathbf {A}^{\it te} \in \mathbb {R}^{\tau _{in}^{\prime } \times I \times J}\) is time varying.

4.2.3 Conditional Score-Based Graph Generation.

Two things are essential when forecasting urban flows. The first is the external factors such as weather and special events, and the other is the unseen traffic patterns.

External factors impact the mobility patterns of urban crowds and hence the transitions of urban flow. However, most existing approaches incorporate external factors in the same way as embedding urban flows or merely with several layers of feedforward neural networks. For example, they fuse the learned embeddings of external factors with the representation of the traffic flow graph via an addition operation or FCNs [26, 59, 61]. However, these methods are straightforward without considering fine-grained relations between external factors and urban flow. Here we argue that the external effect should be considered when constructing the urban flow graph rather than only in the final representation learning. Meanwhile, traffic patterns can boost the UFP, which, however, are usually absent in the data. On the one hand, the trajectory sampling frequency is too low to derive meaningful patterns in urban data. On the other hand, the inflow and outflow in a region are estimated in a period rather than in real time. Therefore, the urban flow graph derived from them is coarse grained.

In this work, we leverage the score-based generative model to address the preceding concerns simultaneously. As a kind of powerful probabilistic generative model, the score-based models have been used to enhance graph-based neural networks and achieved promising performance. For example, Niu et al. [33] propose a permutation equivariant GNN to model the score function of the input graph, which employs the desirable inductive bias and gains high generation results. Inspired by its success, we try to employ a denoising diffusion framework to learn the gradient of the data distribution of the graphs to generate a more expressive adjacency matrix in a conditional manner. Unlike Niu et al. [33], we need to generate an unweighted direct graph in the latent semantic space. Moreover, to take full advantage of external factors, we have to generate the semantic adjacency matrix with external factors as the condition.

Our conditional generation model can be simply formalized as estimating \(p(\mathbf {A}_t \mid \mathbf {e}_t)\), which means that we generate the semantic adjacency matrix at time t \(\mathbf {A}_t\) conditioned on the external factors \(\mathbf {e}_t\). Specifically, we perturb \(\mathbf {A}_t\) and \(\mathbf {e}_t\) with the white noise of varying scales to meet the requirement that the probability density function should be non-zero everywhere. The noise is assumed to be sufficiently small so that \(p_\sigma (\tilde{\mathbf {A}}_t) \approx p(\mathbf {A}_t)\) and \(p_\tau (\tilde{\mathbf {e}}_t) \approx p(\mathbf {e}_t)\). According to Bayes’ theorem,

\begin{align} p_{\sigma ,\tau }(\tilde{\mathbf {A}}_t \mid \tilde{\mathbf {e}}_t) = p_{\sigma ,\tau }(\tilde{\mathbf {e}}_t \mid \tilde{\mathbf {A}}_t) p_\sigma (\tilde{\mathbf {A}}_t) / p_\tau (\tilde{\mathbf {e}}_t). \end{align}

(14)

Correspondingly, we can decompose the conditional score function into a mixture of scores by taking the log-gradient on both sides of Equation (14):

\begin{align} \begin{aligned}\nabla _{\tilde{\mathbf {A}}} \log p_{\sigma , \tau }(\tilde{\mathbf {A}}_t \mid \tilde{\mathbf {e}}_t)=&\nabla _{\tilde{\mathbf {A}}} \log p_{\sigma , \tau }(\tilde{\mathbf {e}}_t \mid \tilde{\mathbf {A}}_t)+\nabla _{\tilde{\mathbf {A}}} \log p_{\sigma }(\tilde{\mathbf {A}}_t) -\underbrace{\nabla _{\tilde{\mathbf {A}}_t} \log p_{\tau }(\tilde{\mathbf {e}}_t)}_{=0}, \end{aligned} \end{align}

(15)

where \(\nabla _{\tilde{\mathbf {A}}} \log p_{\sigma , \tau }(\tilde{\mathbf {A}}_t \mid \tilde{\mathbf {e}}_t)\) is the posterior score, \(\nabla _{\tilde{\mathbf {A}}} \log p_{\sigma , \tau }(\tilde{\mathbf {e}}_t \mid \tilde{\mathbf {A}}_t)\) is the likelihood, and \(\nabla _{\tilde{\mathbf {A}}} \log p_{\sigma }(\tilde{\mathbf {A}}_t)\) is the prior score. A conditional score model can be derived by the log-gradient of a differentiable classifier \(p(\tilde{\mathbf {e}}_t \mid \tilde{\mathbf {A}}_t; \phi)\) and a prior score model \(\mathbf {s}(\tilde{\mathbf {A}}_t; \theta)\):

\begin{align} \nabla _{\tilde{\mathbf {A}}} \log p(\tilde{\mathbf {A}}_t \mid \tilde{\mathbf {e}}_t ; \theta , \phi)=\nabla _{\tilde{\mathbf {A}}} \log p(\tilde{\mathbf {e}}_t \mid \tilde{\mathbf {A}}_t ; \phi)+\mathbf {s}(\tilde{\mathbf {A}}_t ; \theta). \end{align}

(16)

We train a classifier trained with cross-entropy loss function \(L_{\mathrm{CE}}(\phi)\) and score model with denoising score match loss function \(L_{\mathrm{DE}}(\theta)\):

\begin{align} &L_{\mathrm{CE}}(\phi)=\mathbb {E}_{p_{\sigma ,\tau }(\tilde{\mathbf {A}} \mid \tilde{\mathbf {e}})}[-\log p(\tilde{\mathbf {e}} \mid \tilde{\mathbf {A}}; \phi)], \end{align}

(17)

\begin{align} &L_{\mathrm{DE}}(\theta)= \frac{1}{2}\mathbb {E}_{p_{\sigma }(\tilde{\mathbf {A}}, \mathbf {A})}\left\Vert \mathbf {s}(\tilde{\mathbf {A}} ; \theta)-\nabla _{\tilde{\mathbf {A}}} \log p_{\sigma }(\tilde{\mathbf {A}} \mid \mathbf {A})\right\Vert ^{2}. \end{align}

(18)

The total training loss function of the conditional adjacency matrix generation is as follows:

\begin{align} L_{\mathrm{Total}}=L_{\mathrm{DE}}(\theta)+\lambda L_{\mathrm{CE}}(\phi), \end{align}

(19)

where \(\lambda \gt 0\) is a balance coefficient. The detailed training process is shown in Figure 5.

Fig. 5.

As for the score model, we add the Gaussian noise to the temporal semantic adjacency matrix \(\mathbf {A}^{te}\) obtained in Section 4.2.2. In particular, to preserve the semantic information as much as possible, we use the \(\mathbf {A}^{te}\) before being quantized by \(\epsilon _{te}\) during the diffusing process. In the following sections, we omit the superscript of \(\mathbf {A}^{te}\) for brevity. Specifically, we choose a geometric sequence \(\lbrace \sigma _{i}\rbrace _{i=0}^{L}\) as the set of noise scales, where \(L=1,000\), \(\sigma _L = 10\), \(\sigma _0 = 0.01\), and \(\sigma _{i}=\sigma _{0}(\frac{\sigma _{L }}{\sigma _{0}})^{\frac{i}{L}}\). Let \(p_{\sigma _i}(\tilde{\mathbf {A}}_i \mid \mathbf {A})=\mathcal {N}(\tilde{\mathbf {A}}_i \mid \mathbf {A}, \sigma ^{2} \mathbf {I})\), and the corresponding noise-perturbed data distribution is denoted as \(p_{\sigma _i}(\tilde{\mathbf {A}}_i) \triangleq \int p_{\sigma }(\tilde{\mathbf {A}}_i \mid \mathbf {A}) p(\mathbf {A}) \mathrm{d} \mathbf {A}\). We train a joint score network to estimate the score function of each \(p_{\sigma _i}({\mathbf {A}})\) by optimizing with \(L_{\mathrm{DE}}(\theta)\) as

\begin{align} \frac{1}{2 L} \sum _{i=1}^{L} \mathbb {E}_{p(\mathbf {A})} \mathbb {E}_{p_{\sigma _{i}}(\tilde{\mathbf {A}}_i \mid \mathbf {A})}\left[\left\Vert \sigma _{i} \mathbf {s}_{\boldsymbol {\theta }}\Big (\tilde{\mathbf {A}}_i, \sigma _{i}\Big)+\frac{\tilde{\mathbf {A}}_i-\mathbf {A}}{\sigma _{i}}\right\Vert _{2}^{2}\right]. \end{align}

(20)

Besides, we use \(L_{\mathrm{CE}}(\phi)\) to update the classifier \(\nabla _{\tilde{\mathbf {A}}} \log p(\tilde{\mathbf {e}} \mid \tilde{\mathbf {A}} ; \phi),\) allowing it to capture the fine-grained relationships between transitions of urban flows and external factors.

Subsequently, we perform conditional sampling (outlined in Algorithm 1) based on Equation (16) to generate the semantic matrix \(\tilde{\mathbf {A}}_0\). Because the samples \(\mathbf {A}^{\text{(sample)}}\) from score-based generation are in the continuous space, we discretize the obtained continuous adjacency matrix to a binary one at the end of Langevin dynamics and acquire the enhanced temporal adjacency matrix \(\mathbf {A}^{\prime }\) by a quantization operation:

\begin{align} \mathbf {A}^{\prime }_{i,j} = \mathbb {1}_{\mathbf {A}^{\text{(sample)}}\gt 0.5}, \end{align}

(21)

where \(\mathbb {1}\) denotes the indicator function that puts the value to 1 if the condition holds and 0 otherwise. We acquire \(\mathbf {A}^{\prime }_{1},\mathbf {A}^{\prime }_{2},\dots ,\mathbf {A}^{\prime }_{\tau _{in}}\) and take the mean of them to get \(\mathbf {A}^{\prime }\). Finally, we fuse \(\mathbf {A}^{\prime }\) and the spatial adjacency matrix \(\mathbf {A}^{sp}\) to compute the eventual adjacency matrix \(\mathbf {A}^{st}\):

\begin{align} \mathbf {A}^{st}=\mathbf {A}^{\prime } \oplus \beta \mathbf {A}^{sp}, \end{align}

(22)

where \(\beta \gt 0\) is a balance coefficient.

4.3 Prediction and Loss Function

The final step of DiffUFP is to predict the future urban flow. To this end, the learned node and edge representations are fed into the STG layers. It is worth noting that, unlike STGODE [10], our DiffUFP only employs one fused adjacency matrix to simultaneously capture the spatial and semantic knowledge. This property reduces the parameters and hence accelerates the training process. After two STG layers, a max-pooling layer and an output layer are followed to generate UFP.

We use Huber loss [15]—which is the tradeoff between square error loss and absolute error loss—as our objective function. Letting x and \(\hat{x}\) be the ground truth and the prediction, respectively, the Huber loss \(\mathcal {L}(x, \hat{x})\) is formulated as

\begin{align} \mathcal {L}(x, \hat{x})=\left\lbrace \!\!\begin{array}{lr}\frac{1}{2}(x-\hat{x})^{2} & \text{ for }|x-\hat{x}| \le \delta \\ \delta |x-\hat{x}|-\frac{1}{2} \delta ^{2}, & \text{otherwise}\end{array}\right., \end{align}

(23)

where \(\delta\) is the hyperparameter to control the model sensitivity to outliers. The detailed processing steps are summarized in Algorithm 2.

4.4 Computational Complexity

Compared with previous graph-based UFP models, DiffUFP introduces two extra components: the dynamic semantic region extractorand the spatio-temporal adjacency matrix generator.

Dynamic Semantic Region Extractor. The whole procedure of the dynamic semantic region extractor can be divided into three parts: (1) extracting the local information from inputs, (2) learning the spatio-temporal interactions of semantic regions, and (3) remapping the extracted semantic regions to the urban map. Let T represent the total timesteps, E denotes the edges connecting semantic regions, and \(K(=3)\) denotes the number of iterations for spatio-temporal message passing. The total time complexity is \(\mathcal {O}(T \times (2|E|) \times K + T \times N_r \times (K+1))\). Given that \(|E|\) is upper bounded by \(N_r(N_r-1)/2\), the complexity is at \(\mathcal {O}(T \times {N_r}^2 \times K)\).

Spatio-Temporal Adjacency Matrix Generator. It usually is slow to generate a sample from the Markov chain of the reverse process, as the diffusion step T can be a large number [39]. To speed up the training and sampling process, we use the denoising score-based model to generate the adjacency matrix conditionally. Therefore, our graph-based score-based method is much faster than directly predicting the time series that takes UFP as a multivariate time series forecasting task. The computational complexity is reduced from \(N^2 \times T \times 2\) to \(N^2\).

5 Experiments

In this section, we conduct extensive experiments on two real-world urban flow datasets to evaluate DiffUFP. Through the experiments, we want to answer the following research questions:

—

RQ1: Can DiffUFP outperform the state-of-the-art methods in UFP?

—

RQ2: How do the critical designs (i.e., the spatio-temporal adjacency matrix generator and dynamic semantic region extractor) improve the prediction performance of DiffUFP?

—

RQ3: How do the important parameters of DiffUFP influence the performance of DiffUFP?

—

RQ4: Can the proposed dynamic semantic region extractor extract meaningful semantic regions?

5.1 Experimental Settings

5.1.1 Datasets.

The statistics of the two traffic flow datasets, described next, are summarized in Table 2:

Table 2.

Dataset	TaxiBJ	BikeNYC
Data type	Taxi GPS	Bike rent
Location	Beijing	New York
Start time	7/1/2013	4/1/2014
End time	4/10/2016	12/30/2014
Time interval	0.5h	1h
Holidays	41	20
Weather	15 types	\(\backslash\)
Temperature/\({ }^{\circ } \mathrm{C}\)	[–24.6, 41.0]	\(\backslash\)
Wind speed/mph	[0, 48.6]	\(\backslash\)
Latitude	[39.82, 39.99]	[40.65, 40.81]
Longitude	[116.26, 116.49]	[–74.07, –74.00]

Table 2. Statistics of the Datasets

—

TaxiBJ: This data contains the trajectories of taxi GPS in Beijing and is divided into four time periods: July 1 to October 30, 2013, March 1 to June 30, 2014, March 1 to June 30, 2015, and November 1, 2015 to April 10, 2016. Each trajectory is mapped into 32 \(\times\) 32 grid-based geographical regions.

—

BikeNYC:¹ This dataset was collected from the NYC Bike system from April 1, 2014 to September 30, 2014. Each trajectory is projected into a 16 \(\times\) 8 grid map.

Data Preprocessing. For a fair comparison, we use the same data preprocessing approaches as baseline approaches [42, 54, 59, 61]. Specifically, we remove the data without 48 and 24 timestamps in TaxiBJ and BikeNYC, respectively. To reflect the traffic dynamics from different temporal dimensions, traffic flow is partitioned into three parts: closeness, period, and trend of the crowd flow following [59]. The external factors can typically be classified into continuous features (e.g., temperature) and categorical features (e.g., weather). To combine the two different kinds of features, we embed the categorical features into low-dimensional vectors. The final external features are formed by concatenating the two types of external features. We scale the traffic volumes and numerical external factors into the range [–1,1] by using Max-Min normalization to eliminate the difference between the measurement units of different features. After that, the processed external features are fed into the adjacency matrix generator to learn the semantic adjacency matrix in a conditional fashion. The two datasets are divided into training sets, validation sets, and testing sets with a ratio of 6:2:2. During testing, we rescale the predicted flow.

5.1.2 Baselines.

We compare DiffUFP to the following 15 UFP models:

—

HA models the urban flows as a seasonal process (i.e., 1 day) and takes the average of the previous seasons as the prediction results.

—

ARIMA [35] is a classical time series model that combines autoregressive (AR) and moving average (MA) to perform prediction.

—

RNN [28] utilizes RNNs to capture the temporal dependencies to predict the sequential data.

—

Seq2Seq [43] is an encoder-decoder model that uses two stacked GRU layers to implement a sequence-to-sequence network. Node features are embedded by an FCN and fused with the outputs of the decoder.

—

DeepST [60] employs CNNs to encode the spatial interactions between grid regions on the citywide traffic flow map.

—

ST-ResNet [59] leverages the residual connection to alleviate the overfitting problem in spatio-temporal prediction.

—

DCRNN [22] uses diffusion RNNs to capture the dynamic spatio-temporal dependencies.

—

DMVST-Net [55] combines the convolutional recurrent networks and the graph embedding method to extract spatio-temporal signals and make flow predictions.

—

STDN [54] exploits the periodically shifted attention to learn the transition regularities of traffic flow.

—

MDL [61] is a multi-task deep learning framework that simultaneously predicts the node flow and edge flow.

—

ST-GCN [56] integrates a GCN and convolution sequence model to capture the spatial and temporal correlations.

—

ST-MGCN [11] employs a multi-modal GCN to learn region-wise interactions.

—

GMAN [63] integrates the graph multi-attention into an encoder-decoder traffic prediction framework.

—

ST-MetaNet [36] utilizes the meta-learning method to perform knowledge transfer across urban flows with a recurrent graph attentive network.

—

ST-GDN [62] employs a multi-scale attention network to capture multi-level temporal dynamics on the hierarchical GNNs.

5.1.3 Hyperparameters.

The hyperparameter configurations of DiffUFP are as follows:

—

The hyperparameter of the dynamic semantic region extractor: The numbers of the semantic region \(N_r\) for TaxiBJ and BikeNYC are 5 and 7, respectively. The number of channels \(C = 1\), as the outflow and inflow are two views of a semantic region. The dimensions of the hidden states of GRUs used in region extractor and graph processor are conducted grid search over \(\lbrace 32,64,128\rbrace\).

—

The hyperparameter of the spatio-temporal adjacency matrix generator: The thresholds \(\epsilon ^{sp}\) and \(\epsilon ^{te}\) of \(\mathbf {A}^{sp}\) and \(\mathbf {A}^{te}\) are set to 2.0 and 0.5, respectively. Besides, we set \(\sigma _1 = 0.01\) and \(\sigma _L = 1\). During sampling, we conduct 100 iterations, and the balance coefficient \(\beta\) is set to 0.2.

—

The structure of the STG: The hidden dimensions of TCN blocks are set to 64,32,64. Each STG layer contains two blocks. The batch sizes of training, validating, and testing are all 16.

5.1.4 Evaluation Protocols.

To evaluate the performance of DiffUFP, we adopt three widely used metrics: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE):

\[\begin{eqnarray*} \begin{aligned}&R M S E=\sqrt {\frac{1}{M} \sum _{i=1}^{M}\left(\widehat{\mathbf {X}}_{t}-\mathbf {X}_{t}\right)^{2}} \nonumber \nonumber\\ &M A E=\frac{1}{M} \sum _{i=1}^{M}\left|\widehat{\mathbf {X}}_{t}-\mathbf {X}_{t}\right| \text{, } \nonumber \nonumber\\ &M A P E=\frac{1}{M} \sum _{i=1}^{M}\left|\frac{\widehat{\mathbf {X}}_{t}-\mathbf {X}_{t}}{\mathbf {X}_{t}}\right| \text{, } \nonumber \nonumber \end{aligned} \end{eqnarray*}\]

where M is the total number of all predicted traffic flow volumes, and \(\widehat{\mathbf {X}}_{t}\) and \(\mathbf {X}_{t}\) denote the predicted flow and the ground truth, respectively. For both metrics, a smaller value means higher accuracy.

5.2 Performance Comparison (RQ1)

The performance of DiffUFP and baseline methods on two datasets is reported in Table 3 and Table 4. We test all of the baseline models five times and report the average results as “mean \(\pm\) standard deviation.” By examining the results, we have the following observations.

Table 3.

Method	30min			1h			2h
Method	MAE	RMSE	MAPE(%)	MAE	RMSE	MAPE(%)	MAE	RMSE	MAPE(%)
HA	26.16 \(\pm\) 0.00	56.47 \(\pm\) 0.00	34.21 \(\pm\) 0.00	26.16 \(\pm\) 0.00	56.47 \(\pm\) 0.00	34.21 \(\pm\) 0.00	26.16 \(\pm\) 0.00	56.47 \(\pm\) 0.00	34.21 \(\pm\) 0.00
ARIMA	24.24 \(\pm\) 0.00	41.75 \(\pm\) 0.00	32.67 \(\pm\) 0.00	27.12 \(\pm\) 0.00	58.27 \(\pm\) 0.00	32.67 \(\pm\) 0.00	41.21 \(\pm\) 0.00	76.98 \(\pm\) 0.00	32.67 \(\pm\) 0.00
RNN	17.85 \(\pm\) 0.17	33.90 \(\pm\) 0.21	24.79 \(\pm\) 0.23	19.63 \(\pm\) 0.10	40.42 \(\pm\) 0.12	27.86 \(\pm\) 0.13	31.45 \(\pm\) 0.11	68.21 \(\pm\) 0.19	29.80 \(\pm\) 0.23
Seq2Seq	17.18 \(\pm\) 0.12	30.16 \(\pm\) 0.10	24.54 \(\pm\) 0.16	17.82 \(\pm\) 0.05	35.12 \(\pm\) 0.07	27.64 \(\pm\) 0.07	22.09 \(\pm\) 0.06	62.75 \(\pm\) 0.14	29.54 \(\pm\) 0.17
DeepST	17.12 \(\pm\) 0.45	30.44 \(\pm\) 0.52	24.29 \(\pm\) 0.42	17.44 \(\pm\) 0.73	36.52 \(\pm\) 0.78	27.23 \(\pm\) 0.83	19.85 \(\pm\) 0.83	44.81 \(\pm\) 0.88	29.27 \(\pm\) 0.81
ST-ResNet	16.83 \(\pm\) 0.34	29.23 \(\pm\) 0.49	23.85 \(\pm\) 0.43	16.88 \(\pm\) 0.50	32.90 \(\pm\) 0.69	26.96 \(\pm\) 0.75	18.94 \(\pm\) 0.57	36.43 \(\pm\) 0.72	29.13 \(\pm\) 0.76
DCRNN	16.40 \(\pm\) 0.04	28.38 \(\pm\) 0.09	23.61 \(\pm\) 0.10	15.84 \(\pm\) 0.05	32.34 \(\pm\) 0.08	26.59 \(\pm\) 0.11	18.26 \(\pm\) 0.15	36.91 \(\pm\) 0.17	28.84 \(\pm\) 0.23
DMVST-Net	15.88 \(\pm\) 0.31	28.41 \(\pm\) 0.45	23.05 \(\pm\) 0.39	16.63 \(\pm\) 0.44	32.18 \(\pm\) 0.59	26.28 \(\pm\) 0.53	18.63 \(\pm\) 0.48	35.92 \(\pm\) 0.64	28.81 \(\pm\) 0.70
STDN	16.65 \(\pm\) 1.89	28.93 \(\pm\) 2.46	23.90 \(\pm\) 3.04	21.52 \(\pm\) 3.24	34.48 \(\pm\) 3.35	26.42 \(\pm\) 3.43	24.72 \(\pm\) 3.36	37.26 \(\pm\) 3.43	29.06 \(\pm\) 3.56
MDL	15.45 \(\pm\) 0.20	27.33 \(\pm\) 0.53	22.86 \(\pm\) 0.40	16.98 \(\pm\) 0.24	32.09 \(\pm\) 0.54	25.93 \(\pm\) 0.47	18.42 \(\pm\) 0.39	36.49 \(\pm\) 0.49	28.35 \(\pm\) 0.44
ST-GCN	14.73 \(\pm\) 0.13	26.79 \(\pm\) 0.31	22.34 \(\pm\) 0.26	15.56 \(\pm\) 0.14	31.74 \(\pm\) 0.35	25.85 \(\pm\) 0.36	17.96 \(\pm\) 0.27	36.31 \(\pm\) 0.31	28.26 \(\pm\) 0.32
ST-MGCN	14.01 \(\pm\) 0.18	25.13 \(\pm\) 0.20	21.53 \(\pm\) 0.23	15.08 \(\pm\) 0.11	30.72 \(\pm\) 0.23	25.17 \(\pm\) 0.25	17.18 \(\pm\) 0.22	35.03 \(\pm\) 0.34	27.75 \(\pm\) 0.39
GMAN	13.53 \(\pm\) 0.14	24.79 \(\pm\) 0.32	21.04 \(\pm\) 0.30	15.04 \(\pm\) 0.18	30.65 \(\pm\) 0.42	25.01 \(\pm\) 0.36	17.04 \(\pm\) 0.19	34.82 \(\pm\) 0.61	27.53 \(\pm\) 0.52
ST-MetaNet+	12.80 \(\pm\) 0.15	23.65 \(\pm\) 0.28	20.73 \(\pm\) 0.25	14.72 \(\pm\) 0.18	29.70 \(\pm\) 0.40	24.88 \(\pm\) 0.32	16.85 \(\pm\) 0.16	33.93 \(\pm\) 0.35	27.48 \(\pm\) 0.38
ST-GDN	12.23 \(\pm\) 0.18	23.08 \(\pm\) 0.20	20.28 \(\pm\) 0.18	14.65 \(\pm\) 0.19	29.43 \(\pm\) 0.44	24.63 \(\pm\) 0.33	16.64 \(\pm\) 0.21	33.51 \(\pm\) 0.47	27.01 \(\pm\) 0.45
DiffUFP	11.97 \(\pm\) 0.13	22.31 \(\pm\) 0.19	20.05 \(\pm\) 0.14	14.21 \(\pm\) 0.12	28.90 \(\pm\) 0.18	24.29 \(\pm\) 0.20	16.02 \(\pm\) 0.19	32.72 \(\pm\) 0.24	26.42 \(\pm\) 0.30

Table 3. Performance Comparisons on the TaxiBJ Dataset

Table 4.

Method	1h			2h			3h
Method	MAE	RMSE	MAPE(%)	MAE	RMSE	MAPE(%)	MAE	RMSE	MAPE(%)
HA	9.54 \(\pm\) 0.00	19.02 \(\pm\) 0.00	36.34 \(\pm\) 0.00	9.54 \(\pm\) 0.00	19.02 \(\pm\) 0.00	36.34 \(\pm\) 0.00	9.54 \(\pm\) 0.00	19.02 \(\pm\) 0.00	36.34 \(\pm\) 0.00
ARIMA	4.89 \(\pm\) 0.00	11.10 \(\pm\) 0.00	34.52 \(\pm\) 0.00	7.83 \(\pm\) 0.00	16.45 \(\pm\) 0.00	34.52 \(\pm\) 0.00	8.82 \(\pm\) 0.00	17.63 \(\pm\) 0.00	34.52 \(\pm\) 0.00
RNN	4.50 \(\pm\) 0.12	9.14 \(\pm\) 0.16	28.56 \(\pm\) 0.31	7.51 \(\pm\) 0.15	14.93 \(\pm\) 0.23	31.23 \(\pm\) 0.40	8.04 \(\pm\) 0.17	14.32 \(\pm\) 0.35	33.24 \(\pm\) 0.66
Seq2Seq	4.13 \(\pm\) 0.09	8.86 \(\pm\) 0.12	26.42 \(\pm\) 0.19	7.24 \(\pm\) 0.13	12.96 \(\pm\) 0.17	31.18 \(\pm\) 0.22	7.62 \(\pm\) 0.41	13.80 \(\pm\) 0.55	32.67 \(\pm\) 0.30
DeepST	3.64 \(\pm\) 0.26	8.05 \(\pm\) 0.29	25.31 \(\pm\) 0.38	7.03 \(\pm\) 0.35	12.24 \(\pm\) 0.33	31.09 \(\pm\) 0.46	7.43 \(\pm\) 0.43	13.29 \(\pm\) 0.81	32.51 \(\pm\) 0.78
ST-ResNet	3.35 \(\pm\) 0.14	6.56 \(\pm\) 0.17	24.18 \(\pm\) 0.36	6.84 \(\pm\) 0.18	11.80 \(\pm\) 0.21	30.02 \(\pm\) 0.44	7.27 \(\pm\) 0.21	12.91 \(\pm\) 0.29	32.03 \(\pm\) 0.72
DCRNN	3.40 \(\pm\) 0.02	4.61 \(\pm\) 0.03	24.02 \(\pm\) 0.20	6.58 \(\pm\) 0.05	10.42 \(\pm\) 0.10	29.43 \(\pm\) 0.25	7.32 \(\pm\) 0.08	12.84 \(\pm\) 0.12	31.98 \(\pm\) 0.36
DMVST-Net	3.12 \(\pm\) 0.06	6.10 \(\pm\) 0.14	23.59 \(\pm\) 0.26	6.62 \(\pm\) 0.10	10.93 \(\pm\) 0.18	29.24 \(\pm\) 0.28	7.14 \(\pm\) 0.13	12.42 \(\pm\) 0.19	31.87 \(\pm\) 0.42
STDN	3.10 \(\pm\) 0.48	5.85 \(\pm\) 0.63	23.37 \(\pm\) 2.48	6.63 \(\pm\) 0.62	11.25 \(\pm\) 0.82	28.78 \(\pm\) 2.67	7.13 \(\pm\) 0.78	13.09 \(\pm\) 0.89	31.23 \(\pm\) 3.14
MDL	3.34 \(\pm\) 0.03	5.14 \(\pm\) 0.05	23.11 \(\pm\) 0.42	6.71 \(\pm\) 0.11	11.41 \(\pm\) 0.13	28.42 \(\pm\) 0.49	7.50 \(\pm\) 0.16	13.06 \(\pm\) 0.27	30.82 \(\pm\) 0.57
ST-GCN	3.28 \(\pm\) 0.04	4.76 \(\pm\) 0.05	22.54 \(\pm\) 0.25	6.60 \(\pm\) 0.07	10.38 \(\pm\) 0.14	27.60 \(\pm\) 0.31	7.26 \(\pm\) 0.12	12.68 \(\pm\) 0.20	30.55 \(\pm\) 0.50
ST-MGCN	3.14 \(\pm\) 0.03	4.51 \(\pm\) 0.05	22.41 \(\pm\) 0.29	6.24 \(\pm\) 0.06	9.63 \(\pm\) 0.08	27.21 \(\pm\) 0.36	6.89 \(\pm\) 0.08	11.71 \(\pm\) 0.10	30.12 \(\pm\) 0.48
GMAN	3.08 \(\pm\) 0.03	4.32 \(\pm\) 0.05	21.93 \(\pm\) 0.24	6.14 \(\pm\) 0.05	9.42 \(\pm\) 0.11	26.89 \(\pm\) 0.31	6.57 \(\pm\) 0.10	11.58 \(\pm\) 0.17	30.06 \(\pm\) 0.46
ST-MetaNet+	3.04 \(\pm\) 0.04	4.46 \(\pm\) 0.09	21.99 \(\pm\) 0.25	5.98 \(\pm\) 0.08	9.21 \(\pm\) 0.13	26.93 \(\pm\) 0.30	6.54 \(\pm\) 0.10	11.34 \(\pm\) 0.15	29.65 \(\pm\) 0.42
ST-GDN	2.87 \(\pm\) 0.05	3.89 \(\pm\) 0.09	21.92 \(\pm\) 0.20	5.80 \(\pm\) 0.12	8.97 \(\pm\) 0.14	26.19 \(\pm\) 0.24	6.33 \(\pm\) 0.15	10.95 \(\pm\) 0.17	28.99 \(\pm\) 0.38
DiffUFP	2.42 \(\pm\) 0.03	3.54 \(\pm\) 0.05	20.84 \(\pm\) 0.17	5.53 \(\pm\) 0.07	8.42 \(\pm\) 0.09	25.31 \(\pm\) 0.20	5.97 \(\pm\) 0.08	9.54 \(\pm\) 0.12	27.67 \(\pm\) 0.30

Table 4. Performance Comparisons on the BikeNYC Dataset

First, statistical methods such as HA and ARIMA perform poorly with the most significant prediction errors on both datasets. This result demonstrates the high complexity of UFP, which cannot be accurately modeled by simple statistic approaches. Compared with the statistical methods, neural network based models exhibit much better performance. RNN and Seq2Seq model urban flow as sequences and utilize RNNs to capture the temporal property. However, they cannot learn the spatial dependencies essential in predicting urban flows. CNN-based models (e.g., DeepST, ST-ResNet) learn the spatial correlations with the convolutional operation, but they only consider the static correlations without correctly modeling urban traffic dynamics.

Second, hybrid methods (e.g., DCRNN, STDN, and MDL) introduce the attention mechanism and can capture spatial and temporal knowledge simultaneously. Therefore, these approaches usually achieve better performance than RNN-based and CNN-based models. We found that STDN is unstable as it gets notable deviations, as its convolution kernels and long short-term models share the parameters. Besides, MDL considers the grid regions of the urban flow map as nodes and the flow transitions between different areas as edges. However, it leverages a multi-task learning framework to simultaneously predict the node flow and edge flow, which may ignore the subtle correlations and compromise the prediction performance. This result suggests that we need more expressive models to learn the spatial and temporal dependencies for the UFP task.

Third, GNN-based methods including ours exhibit the best performance due to their ability to model the long-short term spatio-temporal correlations. Our DiffUFP achieves better performance than other methods due to its two special-designed modules. Compared to direct graph construction in previous studies, DiffUFP exploits a dynamic semantic region extractor to enhance the urban flow representation learning, which allows our model to discover the traffic interactions between semantic regions. Meanwhile, the spatio-temporal adjacency matrix generator enables DiffUFP to search geographically and semantically similar neighborhoods and aggregate more related urban flows in GNNs. Surprisingly, the performance becomes better as the increase of prediction horizons, which implies the superiority of our method in capturing long-term flow dependencies.

5.3 Ablation Study (RQ2)

To verify the effectiveness of different modules of DiffUFP, we conduct an ablation experiment by designing four variants of DiffUFP:

—

DiffUFP w/o N removes the dynamic semantic region extractorand directly takes the grid-like regions as the nodes to construct the STGs.

—

DiffUFP w/o E deletes the spatio-temporal adjacency matrix generator. Alternatively, we take the external information as another branch of input features and use the spatial adjacency matrix \(\mathbf {A}^{sp}\) as the eventual edge-wise representations which are similar to previous studies [62].

—

DiffUFP w/o D replaces the score-based model with a simple FCN to generate the enhanced semantic adjacency matrix \(\mathbf {A}^{\prime }\) conditioned on external factors.

—

DiffUFP w/o N&E removes both the dynamic semantic region extractor and the spatio-temporal adjacency matrix generator from DiffUFP.

The ablation results are presented in Table 5. The poor performance of DiffUFP w/o N&E proves the significant effects of our two important designs. DiffUFP w/o E gains remarkable improvement over DiffUFP w/o N&E, indicating that the dynamic semantic region extractor is capable of learning the underlying structure of traffic networks by excavating the semantic regions dynamically from the grid-like urban flow map. In most cases, DiffUFP w/o N is superior to DiffUFP w/o E, showing that our conditional score-based adjacency matrix can help the model capture the intricate transitional patterns and learn the influence of external factors. DiffUFP w/o D outperforms DiffUFP w/o E in all cases, showing that the semantic adjacency matrix can learn the dynamical semantic pattern. Meanwhile, compared with DiffUFP, DiffUFP w/o D yields weak performance, demonstrating the effectiveness of the score-based model in mining the intrinsic relationships between external factors and traffic flow transitions. According to the evaluations, both DiffUFP w/o N and DiffUFP w/o E have achieved decent performance on multi-step predictions, which verifies the effects of the dynamic semantic region extractor and the spatio-temporal adjacency matrix generator in boosting the expressiveness of the GNN on long-term temporal horizon predictions.

Table 5.

Model	1h	2h	3h
Experimental results on TaxiBJ
DiffUFP w/o E& N	31.30\(_{\pm 0.36}\)	35.52\(_{\pm 0.50}\)	37.96\(_{\pm 0.56}\)
DiffUFP w/o E	30.84\(_{\pm 0.33}\)	34.27\(_{\pm 0.50}\)	36.18\(_{\pm 0.51}\)
DiffUFP w/o N	29.93\(_{\pm 0.29}\)	33.65\(_{\pm 1.87}\)	35.64\(_{\pm 0.47}\)
DiffUFP w/o D	29.57\(_{\pm 0.31}\)	33.04\(_{\pm 0.48}\)	35.10\(_{\pm 0.45}\)
DiffUFP	28.97\(_{\pm 0.18}\)	32.71\(_{\pm 0.24}\)	34.47\(_{\pm 0.41}\)
Experimental results on BikeNYC
DiffUFP w/o E& N	4.51\(_{\pm 0.10}\)	9.70\(_{\pm 0.18}\)	12.08\(_{\pm 0.23}\)
DiffUFP w/o E	4.05\(_{\pm 0.08}\)	9.14\(_{\pm 0.15}\)	10.59\(_{\pm 0.22}\)
DiffUFP w/o N	3.88\(_{\pm 0.07}\)	8.86\(_{\pm 0.12}\)	10.87\(_{\pm 0.15}\)
DiffUFP w/o D	3.83\(_{\pm 0.09}\)	8.73\(_{\pm 0.15}\)	10.44\(_{\pm 0.17}\)
DiffUFP	3.54\(_{\pm 0.05}\)	8.42\(_{\pm 0.09}\)	9.54\(_{\pm 0.12}\)

Table 5. Ablation Results (RMSE) on Two Datasets

We compare DiffUFP to its three variants and make DiffUFP w/o E& N as the base to compute the improvements of predictions on different horizons.

5.4 Hyperparameter Sensitivity Analysis (RQ3)

Finally, we investigate the impacts of important parameters on DiffUFP in terms of UFP performance. Figure 6 reports the results of the parameter sensitivity of DiffUFP.

Fig. 6.

—

Batch Size B. We select the batch size from \(\lbrace 16, 32, 64, 128, 256\rbrace\). Figure 6(a) illustrates that for both datasets, the predictions are most accurate when the batch size is 16 (i.e., the larger the batch size, the worse the prediction results). This result suggests that a small batch size is enough for our model to learn the dynamics of urban flows.

—

Extracted Semantic Regions \(N_r\). The parameter \(N_r\) controls the number of semantic regions extracted by DiffUFP. We vary \(N_r\) from 3 to 15 and plot the results in Figure 6(b). The best predictions occur when \(N_r=5\) and \(N_r=7\) in TaxiBJ and BikeNYC, respectively. Such difference is mainly due to the different trajectory properties in the two datasets. TaxiBJ is generated by taxis, whereas BikeNYC is produced by bikes. Typically, bikes are more flexible than taxis in visiting more semantic regions.

—

Threshold \(\epsilon ^{sp}\). The threshold \(\epsilon ^{sp}\) controls the sparsity of the spatial adjacency matrix. We vary the spatial adjacent threshold from 1.0 to 4.0. As shown in Figure 6(c), the prediction errors are lowest when \(\epsilon ^{sp}\) is 2.0 on both datasets. The results also reveal that even though modern transportation enlarges the range of human mobility, people prefer to transfer in a moderately sized zone most of the time.

—

Coefficient \(\beta\). This is a variable used to balance the spatial information fusion in Equation (22). We adjust this coefficient from 0.05 to 1.0 to test the sensitivity of DiffUFP to the parameter \(\beta\), as illustrated in Figure 6(d). For both datasets, DiffUFP achieves the best performance when \(\beta\) is 0.2. This result demonstrates that a relatively small partition of spatial adjacent information is good enough for accurate flow predictions. Meanwhile, it also suggests that semantic adjacent knowledge is more important to learning the transitions of urban flows, as the semantic adjacency matrix is good at capturing the complex long-range dependencies.

5.5 Semantic Region Discovery (RQ4)

A meaningful semantic region should have a significant volume of flow interactions and correspond to important districts of traffic networks. To validate the effect of our dynamic semantic region extractor, we plot the extracted semantic regions in Beijing. As depicted in Figure 7, the extracted semantic regions are consistent with the roads and transfer stations that are experiencing or will experience high volumes of in/outflows. For instance, the red area in Figure 7(a) covers the Sanyuan bridge—a large-scale flyover where traffic jams frequently occur. Besides, by contrasting the current urban flow map (Figure 7(a) and Figure 7(c)) and the extracted semantic regions (Figure 7(b) and Figure 7(d)), we find that these regions will have vast flow transitions during the next short period. These phenomena demonstrate that our dynamic graph-based node representation method can discover salient regions as well as the underlying structure of urban traffic networks.

Fig. 7.

Finally, we scrutinize the prediction errors made by DiffUFP and its variants in different regions to visualize the prediction errors. Specifically, we plot the results on TaxiBJ in Figure 8, where the brighter pixels mean more significant prediction errors. Clearly, the three variants are generally brighter than DiffUFP, which further verifies the effectiveness of the two essential designs in improving the flow prediction performance. Furthermore, we map an extracted semantic region (the areas in the green rectangles) back to the urban map. In the rest of the map, DiffUFP generally performs better than the other three variants, which means the semantic region plays an important role in learning the spatio-temporal interactions between traffic flows. However, the UFPs inside the semantic areas are very similar (i.e., it is difficult for DiffUFP to discriminate the inner flows within the green rectangles). This happens because the proposed dynamic semantic region extractor extracts the semantic regions without access to the actual road networks, which limits its ability to make more accurate predictions in the semantic regions. How to improve the performance of DiffUFP in learning the spatio-temporal correlations inside the semantic regions is beyond the scope of this article and left as our future work.

Fig. 8.

6 Conclusion and Future Work

We presented DiffUFP, a novel GNN-based model for accurate UFP. It provided a new perspective on solving the spatio-temporal learning problem through constructing a more meaningful flow graph. Besides, we proposed a dynamic semantic region extractor that can extract semantic regions dynamically. Moreover, we introduced a conditional score-based adjacency matrix generator to capture the fine-grained impact of the external factors on the transitions of urban flows. Comprehensive experiments on benchmark datasets verified the performance of the proposed model. In the future, we are interested in speeding up the training and sampling process of the conditional score-based networks. Meanwhile, applying DiffUFP to solve other spatio-temporal tasks such as flow super-resolution [24] and traffic speed forecasting [10] is of interest in our future work.

Footnote

https://ride.citibikenyc.com/system-data

References

[1]

Donald J. Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (AAAIWS). 359–370.

Abstract

1 Introduction

2 Related Work

2.1 Urban Flow Prediction

2.2 Graph-Based Spatio-Temporal Learning

2.3 Diffusion-Based Generative Model

3 Preliminaries

3.1 Problem Formulation

3.2 Score-Based Generative Model

4 Methodologies

4.1 Dynamic Semantic Region Extractor

4.1.1 Region Extractor.

4.1.2 Graph Processor.

4.2 Spatio-Temporal Adjacency Matrix Generator

4.2.1 Spatial Adjacency Matrix Construction.

4.2.2 Semantic Adjacency Matrix Construction.

4.2.3 Conditional Score-Based Graph Generation.

4.3 Prediction and Loss Function

4.4 Computational Complexity

5 Experiments

5.1 Experimental Settings

5.1.1 Datasets.

5.1.2 Baselines.

5.1.3 Hyperparameters.

5.1.4 Evaluation Protocols.

5.2 Performance Comparison (RQ1)

5.3 Ablation Study (RQ2)

5.4 Hyperparameter Sensitivity Analysis (RQ3)

5.5 Semantic Region Discovery (RQ4)

6 Conclusion and Future Work

Footnote

References

Cited By

Index Terms

Recommendations

Urban Region Profiling via Multi-Graph Representation Learning

Spatial temporal incidence dynamic graph neural networks for traffic flow forecasting

3DGCN: 3-Dimensional Dynamic Graph Convolutional Network for Citywide Crowd Flow Prediction

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations