The development of IoT and communication technologies has opened up numerous opportunities to assess a variety of phenomena in cities, like traffic, pollution, and economic wealth. City data are diverse in nature and has a variety of formats, availability, volume, spatio-temporal dependencies, and sensitivity concerns, to name a few. All this data should be processed and analyzed to derive comprehensive insights. Therefore, solutions are needed to work with such diverse data in a robust, efficient, secure, and ethical manner. This section reviews the main issues and approaches developed in the smart cities context in (1) data availability and quality, (2) data heterogeneity and integration, (3) data management, (4) data analysis, (5) ethics, (6) data privacy, and (7) data security.
4.1 Data Availability and Quality
Different taxonomies have been applied to classify urban data. For instance, Zheng et al. [
372] suggest a division of urban data by the nature of the phenomena they present, like geographical, traffic, mobile phone signals, commuting, environment monitoring, social network, economy, energy, and health care data. Another suggested taxonomy is based on data structures (point- and network-based types of data) and spatio-temporal properties (spatio-temporal static, spatial static but temporal dynamic, and spatio-temporal dynamic) [
371]. Additionally, available urban data can be divided into five pools, including firewall (within the legacy systems of public agencies), open data, social, sensors/IoT, and commercial data [
141]. Finally, urban data has also been divided based on including personal information, like non-personal data, aggregate data, de-identified data, and personal information [
214]. In this subsection, we will highlight the urban data availability aspect, categorizing our exploration into open data, citizen-contributed data, and commercial data solutions. Also, we will discuss corresponding data quality considerations.
Open Data. Data are the key enabler for the vision and realization of smart cities. According to a European strategy for data, Big Data are considered as one of the key enablers to maximize the growth potential of the European digital economy and society [
109]. Therefore, a large effort has been made to promote data suppliers and owners, even municipalities and governments to open their data for both research and business. To gain the benefits, an adaptation of municipal vision and governance strategies could be required to coordinate, enable, and support various forms of data-sharing initiatives [
202]. Open data are the data that anyone can access, use, and share; it is available in machine-readable format, as well as licensed to permit data use in any way [
35]. Governments and municipalities play a crucial role in the management of cities’ data assets so that data-driven tools can be used to address challenges that cities face [
73]. Therefore, there is also a strong recent trend to release much of public agencies’ data as open data [
141]. This is known as Open Government Data, which is defined as information collected, produced, or paid for by public bodies and licensed for free re-use for any purpose [
35]. A number of open-source and commercial data portal platforms exist, providing the ability to publish data, enabling data access and visualizations like CKAN,
1 DKAN,
2 Socrata,
3 Opendatasoft,
4 PublishMyData.
5 Their availability, as well as the strong demand to share urban data, has resulted in a number of urban data platforms, containing both open and restricted in-use data. Barns [
73] classifies these into data repositories—open data portals with the main goal to provide data sharing capabilities; data showcases that aim to visualize data, but the data itself is not always available or machine-readable; city scores—visualization of city performance in regard to a certain set of indicators; and data marketplaces enabling data access and reuse with performance monitoring. Examples of data repositories include, the New York City open data portal [
119], which enables data access within a number of categories. Among the full information about the dataset, it is also possible to see the data snapshots and visualize the data in external services. Another example of data repositories is the Moscow City Government open data portal,
6 providing access to data classified into thematic topics, like healthcare, education, and culture. Datasets are equipped with basic information, including, among others, dates, formats, links to the source, and contact information of the persons responsible. Well-known city dashboards include the Dublin Dashboard,
7 which provides rich visualization opportunities as well as possibilities to get the data available. The London Datastore [
32] also provides rich opportunities to visually explore the data, as well as gain access to it. However, when compared to other city dashboards, the London Datastore provides data-driven analytics based on their alignment to strategic planning and governance challenges for City Hall [
73].
Table 5 provides a brief summary of selected available datasets. For deeper insights, an interested reader could refer to Ma et al. [
228], who survey available city datasets.
There are a number of initiatives in the
European Union (EU) advancing data sharing. For instance, the open data portal
8 provides access to data published by EU institutions and bodies. In addition, the portal provides opportunities for data visualizations and work with linked data. Furthermore, the European Data Portal harvests the metadata of public sector information available on public data portals across European countries.
9 Other data sharing activities include INSPIRE Geoportal
10 that collects data provided by EU Member States and several European Free Trade Association countries under the
Infrastructure for Spatial Information in Europe (INSPIRE) Directive that focuses on creating an infrastructure for sharing environmental spatial information. Yet another known initiative is Copernicus, the Earth observation programme coordinated and managed by the European Commission and is implemented with the Member States, the European Space Agency, the European Organization for the Exploitation of Meteorological Satellites, the European Center for Medium-Range Weather Forecasts, EU Agencies and Mercator Océan.
11 Copernicus provides a number of services categorized under atmosphere, marine, land, climate change, security, and emergency themes, as well as access to satellites and
in situ sensor data.
While acknowledging the power of such dashboards and portals, it is important to note that they require considerable effort to remain useful and provide utility for communities, municipalities/governments, and businesses. First, their purpose and interpretation should be as clear as possible, since the data itself, as well as data processing and analysis steps are known to be technology and methodology dependent, limited in time and location, and could be biased in interpretation [
200,
201]. Second, such data platforms require active maintenance and support to ensure that they contain up-to-date information of the required quality. Support is also needed for both data providers and data consumers. For instance, proper effort is required to share the data. The data provider must ensure the content quality (completeness, cleanness, and accuracy), timeliness and consistency support, data representation model (use of standardized solutions, proper formats, and linked data), supply of proper metadata, as well as, addressing the legal aspects, i.e., to provide a license to use the data [
35]. After the data are published, it should be properly maintained, i.e., checking data access and assessing and updating data itself and its metadata, as the data lineage and metadata allow users to assess the trustworthiness and data quality [
201].
Legal issues regarding publishing and use of the data require careful treatment. For example, data ownership, legal grounds, and terms of use are often unclear for particular data sources within data repositories. Many data repositories have statements and references to legal documents in their terms and conditions on what kind of data are stored and how to use it, e.g., the Moscow City Government open data portal. However, e.g., including licence information in the data source description itself provides better transparency and eliminates confusion, check the London Datastore for an example.
Citizen-Contributed Data. The premise of citizen-contributed data are to facilitate and collect input for decision-making at large. Different approaches exist to harnessing citizens’ data [
211], including:
—
crowd markets: to enable the aggregation of online individuals as collaborative input;
—
social media mining: to retrieve publicly expressed opinions and content;
—
urban and in situ sensing platforms: to unobtrusively collect data from citizens’ daily dwellings.
Crowd Markets. Amazon’s Mechanical Turk [
2] and Figure-Eight [
21] (previously Crowdflower) are today’s largest platforms for aggregating online individuals’ time to complete tasks that are computationally intensive but relatively trivial to a human. These platforms are purposefully generic, and a variety of tasks can be created. These tasks range from answering surveys, writing reviews, annotating images, transcribing audio, and others, i.e., tasks that are challenging to automate due to a high risk of error. The main challenge of crowd markets is to sustain the crowd size and quality. The literature shows that higher-paid tasks can attract workers at a higher rate. Emphasis on the importance of the work has a statistically significant and consistent positive effect on the quality of the work [
290]. A practical example of leveraging crowd markets is Zensors [
45], which enables sensing from any visual observable property. Zensors streams images where the crowd processes and labels them according to a well-defined set of instructions, enabling near-instant counting and another high-level sensing. Once sufficient human-based input is available, ML is applied to fully automate the process once the accuracy of the algorithms is high (
\(\gt\)90%). This approach is also used by the Google Crowdsource initiative [
19], where gamification and recognition given as badges are used to sustain and train ML classification algorithms.
Social Media Mining. Online social media mining on a large-scale allows us to consider users’ posting of opinions and content in online social media to gain insights into unfolding events [
290]. The widespread availability of smartphones and high-speed Internet has enabled a range of systems that collect a variety of different types of user contributions. For example, it is now possible to collect videos and photos in the field, e.g., YouTube, Instagram, Twitter, and Facebook. These platforms allow user-driven tagging with relevant keywords. The primary use of this media is for the platform, but researchers have found such user-generated content as sensor data, originating from end-users. Providing a system that allows users to easily contextualize and tag high-level data results in a valuable repository of knowledge. For example, Wheelmap
12 allows users to tag and search for wheelchair accessible places using a smartphone and a browser. Others share where they are [
343] or whether a place is recommended [
208], or reported the destruction aftermath of an earthquake [
348]. Researchers keep exploring ways to use devices’ sensors usage, as Citizen Science [
266]. Citizen Science can be interpreted as individuals becoming active participants and stakeholders of data. Large-scale efforts, such as Wikipedia and OpenStreet Maps, allow users to publicly augment and annotate online information as text or geo-fenced markers. This wealth of everyday information about and around us creates numerous possibilities for new applications and research in general. Social media-enabled applications are primarily driven by smartphones for
in situ context and are often deployed on application stores for ease of installation and updating the platform.
Urban and In-Situ Sensing Platforms. Urban and
in situ systems pervasively collect data from citizens without the need to set up or install an app on someone’s smartphone. These platforms often deploy sensors throughout a city. These can be invisible to citizens, e.g., underground traffic sensors, weather monitoring stations on top of a building, or they can be an integral part of the city, e.g., interactive public displays. A number of studies have investigated the use of public interactive displays for the purpose of data collection [
63,
89,
176]. Opinionizer [
89] is designed and placed in social gatherings (parties) to encourage socialization and interaction. Participants would add comments to a publicly visible and shared display. Due to the fear of “social embarrassment,” the authors suggest public interactions to be purposeful.
The environment, both on and around the display, also affects its use and the data collected. The environment produces strong physical and social affordances, and such devices or solutions need to reveal their purpose regarding the social activity under study rapidly and to be able to seamlessly and comfortably encourage citizens to switch from being onlookers to becoming participants. TextTales [
63] explored providing story authorship and civic discourse by installing a large, city-scale, and interactive public installation that would show a grid of text. A discussion on a certain photograph would start with text messages sent by citizens, displayed in a stream of comments.
Beyond public display, citizens can also be involved in larger efforts to affect society at large. One such project is vTaiwan,
13 which is an online-offline consultation process that brings together government ministries, elected representatives, scholars, experts, business leaders, civil society organizations, and citizens. The platform allows lawmakers to implement decisions with a greater degree of legitimacy. It combines a website, meetings, hackathons, and consultation processes. For example, vTaiwan was crucial in the debate on Uber operations in Taiwan.
14 In a similar approach, Decidim
15 is a digital platform for citizen participation, helping citizens, organizations, and public institutions to self-organize democratically at scale. It provides a political network, citizen-driven initiatives and consultations and raises participatory budgets, thus allowing a democratic and flexible system where everyone can voice their opinion.
Overall, citizen-contributed data are a valuable source of information, and in some cases, it is the only way to understand the phenomenon of interest. However, such data collection initiatives and subsequent data analyzes should be planned well and performed with care. For instance, if citizens are asked to perform a measurement, they should be instructed on how to do it to get reliable value [
90]. Some measurements may also require a calibration of the device [
272]. In addition, one should have a strategy to deal with data gaps due to behavioral patterns of people taking measurements [
284]. As in each study, one should ensure that a sample of users, contributing the data to the system, represents the population as fully as possible, and that no bias is introduced into the data collection strategy. Finally, privacy issues from such data collection initiatives should be checked and treated appropriately.
Commercial Data and Private–Public Partnership. A number of commercial organizations deploy infrastructures and utilize available urban data to provide and improve their services. Sharing these data with municipalities has been a subject of debate for a long time [
46]. However, challenges with data have enabled various forms of commercial involvement, such as data markets and hubs. Such organizations facilitate connections between data providers and data consumers, especially if the data cannot be openly shared. One example of such a solution is the Platform of Trust
16 in Finland, that enables data movement between systems and organizations, taking care of trustworthiness and data harmonization issues. They also involve the community so that interested people can participate in creating harmonization models that are then published as open-source code.
Additionally, possibilities have been explored for data exchange between public and private organizations, e.g., the
City Data Exchange (CDE) project created a marketplace for data exchange between public and private organizations [
251]. This project was a collaborative effort of the Municipality of Copenhagen, the Capital Region of Denmark, and Hitachi. The CDE service provided collaboration between different partners on supply and demand of data and a platform for selling and purchasing the data for both public and private organizations. Based on the project, a number of challenges were identified, e.g., immature market as even though some companies buy data for their services, generally many are not yet ready to include data sharing in their core business or strategy; lack of use cases could affect the reluctance to invest resources in selling/buying the data; fragmented landscape; reluctance to share data on an open data portal, e.g., due to ethics or competitors’ advantage reasons; lack of skills and competences to work with the data [
251].
The development of such joint efforts requires trustworthy data stewardship. That is, “trustworthiness is the virtue of reliably meeting one’s commitments, while trust is the belief of another that the trustee is trustworthy” [
253]. Several models have been suggested to collaborate in data use and share [
185]. For example, data collaboratives
17 represent a form of partnership where a number of parties, like governments, companies, and others, collaborate to exchange and integrate data to help to solve societal problems or create public value [
204]. Therefore, through such cross-sector and public–private collaboration initiatives it is possible to achieve much wider goals that may be difficult to perform by the parties by themselves only. One noteworthy example of data collaboratives in smart city context is 9,292
18 which is public–private collaborative, gathering and sharing public transportation data in the Netherlands. Obviously, data collaboratives possess all the challenges that data integration initiatives have, since the data comes from diverse providers, in different formats and has varying structures. However, as Klievink et al. [
204] emphasize, data collaboratives are a collaboration and innovation phenomenon rather than a data phenomenon. Therefore, organizational, incentivization, and governance challenges should be considered as well. From this perspective, a number of additional challenges arise regarding vulnerabilities in opening the data, its possible misuse, and overall trust within the partnership. Coordination problems also include matching potential data providers and data users, maintaining data control and its unforeseen uses when shared, matching a problem with the data attributes, ensuring the shared data are useful and usable by the user, and aligning the incentives of providers to share proprietary data with the goals of the users [
330]. Moreover, data collaboratives are not isolated constructs, therefore partners’ incentives, goals and collaboration overall depend on the context, like institutional and governance frameworks, government interests, transparency/inclusiveness culture, and the means by which collaboration is legitimized [
204]. Therefore, to achieve a successful collaborative, it could be helpful to organize the overall collaboration process and context in such a way that perceived vulnerabilities are dealt with [
204].
Another initiative is data trust. The interest in data trusts started in 2017 where this model was proposed as a “set of relationships underpinned by a repeatable framework, compliant with parties’ obligations, to share data in a fair, safe and equitable way” [
162]. The Open Data Institute defines data trust as “a legal structure that provides independent stewardship of data” [
166]. There are a number of interpretations of data trusts, e.g., it is assumed that a data trust could be simply an arrangement of governance or a legal agreement or such practices could be aggregated into architecture [
253]. Hardinges places different interpretations and uses of data trust term into the following categories, including repeatable framework of terms and mechanisms; a mutual organization formed to manage data on behalf of its members; a legal structure; a store of data with restricted access; and public oversight of data access [
165]. For instance, Sidewalk Labs proposes the establishment of an Urban Data Trust (that could evolve into a public sector agency over time) serving as an independent digital governing entity for their Sidewalk Toronto project, ensuring that responsible data handling is in place for digital innovation activities (Responsible Data Use) [
214]. In addition to privacy laws, Sidewalk Labs suggests that all innovations aiming to collect/use urban data must go through Responsible Data Usage Assessment conducted by Urban Data Trust. This way, Sidewalk Labs aims to achieve the proper privacy and security practices, provide and use consistent and transparent guidelines for responsible use of data, and make urban data a public asset [
214]. These goals align with O’Hara’s emphasis on the purpose of a data trust, which is “to define trustworthy and ethical data stewardship, and disseminate best practice” [
253].
Generally, successful engagement in any form of data-sharing partnership could require the adaptation of urban governance visions and strategies [
202], as well as a transformation of the parties’ institutional cultures and processes [
124]. A certain level of data quality could be expected from commercial or private–public partnership data, since such data are often an asset for the commercial success of the organizations. However, the technological and methodological biases should not be excluded, since the data could be generated for a particular purpose, but shared for potential other ones [
200,
201]. Moreover, partnerships could suggest proper formalization of the responsibilities in data sharing (e.g., data representation models and metadata availability), usage (e.g., who, how, and for what purpose), and maintenance processes between collaborating parties, making sharing and usage of the data smoother.
4.2 Data Heterogeneity and Integration
During the last few years, a large amount of heterogeneous data has been available from various applications and tools. This is also true in the smart cities context, where the rapid adoption of intelligent applications has created new, different, and numerous data collections. These new sources have given new opportunities but also emerging challenges. An effective data analysis in the smart cities context has to consider the increasing amount of data coming from connected devices, multiple software solutions (developed by public and/or private institutions), and historical archives. However, since the systems producing and collecting data are heterogeneous, they provide data in multiple formats that must be integrated to be combined for running an effective analysis. The siloed and often incompatible nature of these sources has also made the interpretation and use of data more challenging [
279]. We will explore the different strategies that, according to the literature, can be applied for integration of data for smart cities, summarized in
Table 6:
—
Semantic data integration
—
Structural data integration
—
Software-delegating data integration
Model Data Integration. This approach to data integration has been developed in the previous decades starting from proposals focused on the integration of classical data models (such as Relational, XML, and Object-Oriented) [
161,
237], and continuing with suggestions more focused on recent data formats (such as streams, NoSQL databases) [
66,
217]. According to this methodology, all data, coming from different sources is collected in a central repository where an abstract model, grouping all the characteristics of the diverse sources, supports all the operations [
78]. A major benefit of this methodology is that data collected and integrated (in theory) contains no redundancy, can be accessed uniformly, and can be trusted thanks to its integrity. Unfortunately, the definition of such a model is difficult since integrating concepts coming from different data models is not always easy. For example, it could be quite challenging to integrate two dissimilar concepts into the same model, such as a link from a graph data model and a column from a columnar data model. Moreover, the characteristics of Big Data make the maintenance of such a unified model tricky since the data model must be updated each time a new data source with a different data model is defined and needs to be integrated.
In the context of smart cities, the work by Ballari et al. [
70] presents one of the first approaches in this direction. The authors focus on integrating sensor data and highlight the difficulties in finding a global scalable solution. Even though they introduce a global model (providing dynamic interoperability and considering the concepts of proximity, adjacency, and containment in different dynamic contexts), they still cannot manage to introduce a global schema that can be used to store data in a scalable manner. The CitySDK project [
269] goes in the same direction, defining a global data model for integrating data concerning tourist information. Their global model designs structures for points of interest, events, itineraries, and categories/tags. The approach bases the data collection on a set of adapters that transfer the information from heterogeneous sources (mainly CVS, JSON, and XML files) to the global data model (implemented in document format and stored in MongoDB) using a REST API. This approach tries to solve the problem of the flexibility of the central data model by requiring the definition of a new adapter each time a specific data source is added to the system.
More recent approaches have managed to establish architectures based on the meta models provided by new technologies. This is the case of the data hub-like architecture, proposed by Koh et al. [
207]. This approach integrates the technologies of stream processing, like Apache Kafka [
10] with the support of Apache Spark [
14] (also used for batch processing); the knowledge graph-structured base of Virtuoso for semantics, and the storage of Apache HBase [
9] for quick real-time retrieval. Finally, they use Vert.x [
42] a Java framework to provide scalability through its natively asynchronous task processing and abstraction of microservices. The design is still quite new and would have to be tested to evaluate its performance.
Cacho et al. [
92] proposed viewing a smart city as a
System-of-Systems (SoS) to help develop a framework upon which governments can benefit from the integration of public and private systems for planning, administrative, and operative purposes. They also identify challenges to the development of smart cities, namely: the escalation and complexity of the SoS to be developed, the multitude of stakeholders, the variety of domains, and emergent behaviors of the systems within. In this context, they described the challenge of the unification of the information to handle the heterogeneity and the interoperability of the system under analysis using a global meta-layer.
Semantic Data Integration. One popular strategy for data integration is to use knowledge representation and ontologies. In computer science, an “ontology is an explicit specification of conceptualization. The term is borrowed from philosophy, where Ontology is a systematic account of Existence” [
157]. To define an ontology on the top of a domain, in computer science, a representation of the knowledge with a set of concepts within a domain and the relationships between those concepts must be provided. This approach has been implemented and described in multiple cases, like [
71,
84,
275,
325]. The benefits of semantic data integration are modularity, scalability, and the fast and easy integration of different formats of data while removing the need to have a centralized system to store all the data together. Bansal et al. [
71] define a general Extract-Transform-Load framework, involving the creation of a semantic data model as a basis to integrate data from multiple sources. This is followed by the development of a distributed data collection that can be queried using the SPARQL query language. Psyllidis et al. [
276] focus on the smart cities domain and present a similar approach. The data from multiple heterogeneous urban sources are integrated into a global ontology. On top of that, the authors define various interactive Web components (e.g., a Web ontology browser and interactive knowledge graph) to access the integrated ontology graph. Bianchi et al. [
79] try to combine the definition of a semantic layer with a tool that provides to domain experts the possibility to perform in autonomy the integration of multiple and heterogeneous smart city data sources. Gaur et al. [
148] propose a multi-level smart city architecture integrating data from wireless sensors for pressure, temperature, electricity, and others. Their architecture is composed of four layers and each layer has one responsibility. Layer 1 receives data in many different formats. Layer 2 is in charge of processing all the data into a single format, like
Resource Description Framework (RDF) format. Layer 3 contains the inference engine for data integration and reasoning using semantic web technologies. Finally, Layer 4 is responsible for querying data. A different approach based on RDF-format data integration is presented by Consoli et al. [
111]. There, the authors describe a platform implementing an ontology-integration approach that leverages the help of domain experts. For each data source, an ontology is created. The common conceptual layer allows to convert all the data in a target RDF data model. A similar solution to the RDF-format data integration from sensors is presented in [
326].
Bischof et al. [
84] share the consensus on the effectiveness of a semantic modeling strategy for smart cities and on the conceptual data model. The approach considers the data stream annotation with descriptions for data privacy and security, and data contextualization using hierarchies to categorize smart city data. In detail, the solution is based on the definition of a semantic description for smart city data, which is heterogeneous in nature, to facilitate discovery, indexing, querying, and so forth for future services. They consider data heterogeneity not just from the format point of view but also explore the nature of the data considering, for example, the different units of measurement that are provided. They propose to start collecting metadata and semantic descriptions and try to find a compromise with respect to the volume that this metadata might represent. The approach ends with the definition of the Semantic Sensor Network ontology developed by a World Wide Web Consortium incubator group which focuses on organizing and describing sensor capabilities and data processing. The HyperCat [
117] project developed a standard knowledge representation using knowledge graphs to provide a uniform and machine-readable way to discover and query data distributed among many data hubs, where each data hub can provide inputs from different IoT components and networks. In this approach, applications can identify and use the data they need independently on the specific data hub they belong to. Finally, we can also cite the CityGML open data model based on XML format that is a standard for the storage and exchange of virtual 3D city models [
156].
A semantic data integration approach is of interest to organization bodies as well. For example, it has been proposed by the Alliance for Internet of Things Innovation working group. Special attention must be devoted to the SAREF extension for smart cities [
37] that provides a detailed model for some interesting use cases. The ISO [
28] also works on smart city ontologies, for example, the foundation level concepts [
31], the indicators [
29] (populations, and so forth), and the city-level concepts [
31]. These ontologies constitute a very interesting and rich source for developing standardized access tools and models and have been considered in multiple approaches that follow a semantic modeling strategy.
Structural Data Integration. Many efforts have recently considered data integration from a less abstract point of view and explore new possibilities offered by cloud platforms or data distribution tools. This kind of data integration looks at data as small pieces that must be integrated from the structural point of view. No generic data model is provided and no abstraction is defined at the application level. Structural data integration differs from model data integration because it does not strictly need a generic and abstract schema in a target model unifying the global vision of the data. This kind of data integration differs also from the software data integration that we will see below because it operates at the physical layer. The integration step is done in the storage layer of the platforms and frameworks. It is immediate to see that the data integration step is purely handled from a technology and a structural point of view. Petrolo et al. [
271] tackle the challenge of creating a smart city from the sensor standpoint. That is, they approach the problem from the bottom-up and focus on the layers of data generation and consolidation. The authors propose a VITAL Platform combining the IoT and the Cloud of Things to help alleviate the heterogeneity of data generated from different systems on a pay-as-you-go scheme. This platform combines several protocols and communication technologies, including ontologies, semantic annotations, linked data, and semantic web services to promote system interoperability. However, they mention that the challenges that still remain to be tackled are big data and privacy and security issues. Both of these challenges have been approached by Rodrigues et al. [
289] with their SMAFramework. Their framework promises to reduce the trouble of dealing with multiple heterogeneous sources (both historical and real-time generated) while allowing for multiple layers of access and security that can satisfy arising privacy and security norms. Furthermore, the SMAFramework can add additional data sources in a plug-and-play fashion. Their framework is based on a Multi Aspect Graph, which they have tested on geospatial and temporal data from New York City combining tweets with trips carried out by yellow taxis. Puiu et al. [
279] propose a distributed framework called CityPulse to perform knowledge discovery and reasoning over real-time IoT data streams in cities. Their architecture includes a layer called “Sensor Connection,” which is responsible for collecting the read data from the different sensors. Later, the data gathered is passed to another layer that parses it to extract relevant information. After the parsing, there is a module that performs semantic annotations by using an ontology created within the CityPulse framework. After the messages are annotated, the data are published in a message bus. Since data in the bus is already annotated with the Uniform Resource Identifiers from the framework ontologies, an RDF Stream Processing module is able to query the data over the streams. Moreover, the framework is able to discover certain events based on the analysis of the incoming annotated streams. Finally, they use a Service Oriented Architecture to allow consumers to query relevant streams of the different sources or events that were discovered in the message bus.
ML has also become a powerful methodology nowadays. According to research [
128,
297], there is a synergy between ML and data integration and it becomes stronger over time. Modern ML models help to solve the schema-matching phase that can be considered one of the hardest problems in data integration [
76]. For example, Deep learning allows the comparison of long text values by their embedding representations and starts to show promising results when matching texts and dirty data. Recently, SLiMFast [
285] has been proposed as a framework that expresses data fusion as a statistical learning problem over discriminatory probabilistic models and that can be adapted to explore the smart city data integration scenario. In the same context, Costa et al. [
112] define a framework having a unified data warehouse that collects and stores all the available data in raw format. Their approach uses an internal model that exploits the characteristics of the Hadoop framework [
8]. Unfortunately, their meta-model is not accessible from the outside and not many details about the conceptual data integration task are provided. Finally, Raghavan et al. [
280] propose a prototype application based on a cloud-based API and architecture. Their solution defines specific layers providing (and restricting) simple but useful standard operations that hide the heterogeneity of the components. In these approaches, the tuning and optimization phases are critical steps that strictly depend on the characteristics of the input dataset. The challenges behind the generalization and optimization of these methodologies are just at the first exploration phase, and much interest has been shown by the database research community [
146,
334].
Software-Delegating Data Integration. During the last few years, a new category of data integration approaches has been developed leveraging the power and the flexibility of the data access software layers available on cloud computing platforms and architectures. We classify these approaches under the name of software-delegating data integration. Specifically, this kind of data integration is performed by using the various services that are provided by the cloud platforms [
129]. For example, Ribeiro et al. [
287] propose an architecture based on microservices developed on the top of the Hadoop framework. Their proposal implements and improves the approach presented in InterSCity [
122] with a more scalable objective. An approach also based on distributed architecture is described in [
307]. In the proposed approach, data are collected from heterogeneous sources, converted internally in a target model according to a common protocol, and made available for the target analysis. This approach can be used in any context and can be exploited also by smart city applications. A similar scenario is also presented in [
116] where a data integrator component is in charge of dispatch requests to data sources. Software-delegating data integration is very flexible and allows quick access and integration of data according to standard operations and patterns. On the other hand, the integration possibilities and the global maintenance become fully dependent on tools and operations offered by specific platforms and offered APIs. Any change and the evolution of the APIs can change the result and impact the data access.
4.3 Data Management
In recent years, data has gained significant momentum with the evolution of smart cities; therefore, data management at such a scale brings challenges [
69,
239]. Big data tools and technologies now support data acquisition, storage, analysis, and governance [
69]. However, given the volume, heterogeneity, and distributed environment nature of smart cities, it is still difficult to integrate and manage smart city data [
258]. This section will explore the challenges and state-of-the-art solutions for data acquisition, integration, storage, analysis, and governance.
Data Acquisition. Data collection or acquisition means retrieval of the data from the data sources and feeding this data into the analytics platform for storage and further processing [
336]. Data in smart cities is generated by diverse sources such as IoT, economic platforms, government offices, transportation, and social media [
56,
228]. These data vary greatly in their nature (text/images/video/numeric), velocities, and formats. Some data sources are quite
static, that is, they do not change often, like geospatial map data. Some data sources provide data at regular long-enough intervals, such as daily or monthly. Often, such static data sources have defined APIs to get the data, or data may be downloaded from other storage solutions. Since such data does not need to be processed and analyzed immediately, it can be loaded onto the data analysis platform, integrated with other data sources, and made available for deeper offline analysis (so-called batch processing) [
203,
336].
Many data sources generate data
continuously and at a high frequency, like sensor readings. Often, such data needs to be processed as it becomes available, to react quickly or detect certain patterns or anomalies. Such incrementally available data are referred to as a stream, the data record as an event, and the near-real time processing of data as stream processing [
203,
336]. In data stream terminology, we have producers (who generate events) and consumers (who process events) [
203]. Collecting and processing streaming data requires dealing with delayed, missing, or out-of-order data; managing situations where producers send messages at a faster rate than consumers can process; and ensuring fault tolerance [
203,
324,
336]. This also means that streaming data requires loosely coupled communication schemes. Common approaches here include messaging systems [
203] that implement different communication patterns. For example, in a request-reply pattern, the client expects a reply from the server. In a publish-subscribe pattern, clients subscribe to certain messages published by the server they are interested in. In a pipeline pattern, producers push the results, assuming that consumers are pulling for them [
336,
341]. Message-queuing systems facilitate communication between producers and consumers by inserting and reading the messages in the queues [
336,
341]. Such an approach provides loose coupling in time, solving a number of challenges of streaming systems, such as the lag in capabilities to process events. Another issue is to handle the heterogeneity of producers and consumers. Message-queuing systems treat this via message brokers, namely application-level gateways that convert incoming messages into ones that recipients can understand [
336,
341]. For example, in a publish-subscribe pattern, the brokers match the topics subscribed by the consumers to the topics published by producers [
203,
336,
367]. Examples of such systems include Apache ActiveMQ [
3] and Apache Kafka [
10].
The recent developments in big data and smart cities have given birth to a number of reliable, fault tolerant and flexible data acquisition and ingestion solutions, like Apache Flume [
6], Apache Spark [
14], Apache Kafka [
10], Apache Flink [
5], and Apache NiFi [
11]. Each of these frameworks is being widely used in academia and industry depending upon the requirements. In some cases, only one framework can meet the requirements, whereas the combination of these frameworks has also been observed [
239,
258]. Therefore, while choosing any of such frameworks, one needs to be heedful of the final requirements. For example, if the data are being collected at its origin, it may require initial transformation and cleaning. In addition, as the data sources may have diverse acquisition frequencies and may require frameworks with capabilities for handling low-latency and batch-oriented data alongside data cleaning and data transformation functionalities.
Data Storage. The number of connected IoT devices worldwide is expected to reach 50 billion [
296]. Since data are a key ingredient for smart city services, solutions and tools for efficient data storage and access are needed [
93,
125].
Generally, smart city applications can be considered to be data-intensive. In addition to application-specific requirements, such applications should ensure that data are stored reliably and available for later use, search, and processing, and the results of expensive operations should be saved for speedy retrieval [
203].
In recent years, a number of advanced SQL, document, graph, NoSQL, NewSQL, and Big Data data storage systems have been proposed and adopted by researchers and engineers. It is clear that some of them work better for certain tasks, provide certain guarantees, and the choice is always made based on the data model and system requirements [
93,
168,
203]. Examples include MongoDB [
33] which is a widely used document database, Apache Cassandra [
4] as a representative of wide-column data storage solutions, or VoltDB [
43] as a representative of NewSQL databases. Modern storage solutions enable distributed storage and processing by utilizing replication and sharding; they provide data querying capabilities and interfaces for most commonly used programming languages and third-party systems, and cluster management functionality. Distributed implementation enables scalability, fault tolerance, and latency reduction. However, as CAP theorem says, “in a distributed database system, you can have at most only two of Consistency, Availability, and Partition tolerance” [
168]. Here, Consistency refers to the property to deliver every user of the database an identical data view at any given instant; Availability promises an operational state in the event of failure; and Partition tolerance ensures the ability to maintain operations in the case of the network’s failing between segments of the distributed system [
168]. Therefore, in distributed implementations, usually, there is a tradeoff between consistency guarantees and other features.
Off-the-shelf big data management and processing platforms are available, such as Apache Hadoop [
8] and
High Performance Computing Cluster (HPCC) Systems platform [
26]. Such platforms and the software ecosystem of applications developed around them provide complete solutions from data acquisition to data storage, analysis, and results delivery to the end user.
Apache Hadoop is an open-source Java-based framework developed for data storage and processing in a distributed environment on commodity hardware. The main components of Apache Hadoop are: the
Hadoop Distributed File System (HDFS): a distributed file system
facilitating storage and high-throughput access to massive-scale data; Hadoop YARN: a cluster resource management framework; Hadoop MapReduce: a system for parallel processing of data; and Hadoop Common: common utilities supporting other modules [
8]. In addition, a number of tools have been developed for different purposes, e.g., to efficiently load the data to HDFS (like Apache Flume [
6]), facilitate data storage and access (like Apache HBase [
9]), process and analyze the data (like Apache Flink [
5] and Apache Spark [
14]), and to maintain configuration (Apache Zookeeper [
17]).
The
HPCC System platform is an open-source data lake platform supporting different data workflow capabilities [
296]. Its main components are: Enterprise Control Language—a data-oriented declarative programming language; Thor—a bulk data processing cluster that cleans, standardizes, and indexes inbound data; and Roxie—a real-time API/Query cluster for querying data after refinement by Thor [
40]. It also uses a distributed file system for storing data in the cluster following a record-oriented approach [
241]. The indexed data available in Thor clusters can be used for low-latency querying by copying in Roxie clusters, which has been specifically designed for much faster results, unlike the Thor Cluster with batch orientation [
241,
296]. In addition, as in Apache Hadoop, data are collected using different data acquisition frameworks such as Apache Flume [
6], whereas in HPCC Thor, simply a web service can be used for uploading data to Thor clusters [
267].
A number of big data storage solutions have also been proposed. For instance,
Apache Ozone [
12] is a scalable, robust, distributed object store for big data applications. It is designed to handle large amounts of data consistently, providing HTTP interfaces for integration with third-party applications. Ozone is built on top of existing Hadoop components, such as Hadoop YARN, HDFS, and Hadoop Key Management Server, and leverages their capabilities and integrations [
262]. Ozone is also compatible with the existing Hadoop ecosystem, such as MapReduce, Spark, Hive, and Impala, and can be deployed alongside HDFS or as a standalone storage system. Apache Ozone in comparison with HDFS has several benefits. For example, HDFS has a single namespace which can become a major challenge for metadata operations. It does not support object-based protocols, such as
Simple Storage Service (S3) [
351], which are commonly used in cloud-native applications these days. Moreover, the fixed block size in HDFS can lead to inefficient storage space utilization and network overhead when it comes to small files. Apache Ozone supports S3 protocol and implements Hadoop Compatible File System to cater different application needs and preferences. Ozone also provides a rich set of features, such as security, replication, fault tolerance, and monitoring [
351]. The fault tolerance of Ozone is ensured through its self-healing properties that allow it to recover from sudden node failures, making the data highly available. In addition, it is capable of supporting a hierarchical namespace, enabling the maintenance of data in multiple buckets and directories [
12].
Smart city services often need to analyze patterns of moving entities changing their location in time (like vehicles or mobile phone users) or extent as well (like the spread of epidemic diseases) [
158]. Such time-dependent geometries are called moving objects [
158], therefore, storage solutions should be equipped with the opportunities to represent and query the dynamics of such data. Ilarri et al. [
180] categorize state-of-the-art support for moving objects into two categories:
Moving Object Databases (MODs) and data streams. However, they do emphasize that the boundary between these two groups is not always clear. MODs enhance database technologies with representation and management of moving objects [
158,
180]. When compared to early spatio-temporal databases, MODs also allow for tracking continuous changes [
158]. In particular, research has been conducted into models to track moving objects and corresponding query languages, handling uncertainty, indexing ensuring a low update overhead and efficient retrieval of the objects is conducted, please refer to [
180] for details. Prominent examples of MODs that are in active development are MobilityDB [
376], extending PostgreSQL and PostGIS with the moving object support, and SECONDO [
249], an extensible database management system supporting various data models. The development of big data technologies has facilitated the storage and processing of traces of a large number of moving objects. A number of efforts exist nowadays to work with spatial and spatio-temporal big data [
342]. Starting from equipping Apache Hadoop with support for spatial data, like data formats, spatial index structures, spatial operations (SpatialHadoop [
38]), and spatio-temporal capabilities (ST-Hadoop [
39]). To more recent proposals enriching Apache Spark [
14] and distributed storage products with spatial or spatio-temporal capabilities. For instance, Apache Sedona [
13] extends Apache Spark [
14] and Apache Flink [
5] with a set of tools for working with large-scale spatial data in cluster computing environments. Beast [
134] is a Spark-based solution for exploratory data analysis on spatio-temporal data supporting a variety of data formats. GeoMesa [
24] provides a set of tools for large geospatial data analytics. For instance, it adds spatio-temporal indexing on top of Accumulo, HBase[
9], and Cassandra[
4] databases to store spatial data types like points, lines, and polygons. Stream processing is enabled there by having spatial semantics on top of Apache Kafka [
10].
Graph databases enable efficient storage and processing of graph data models, which is often met in the smart city domain, e.g., road network. A graph data model handles varying granularity and hierarchical differences in data well; and enables evolvability, meaning that the graph can be extended to reflect changes in the application domain [
168]. Examples of solutions available to help store and work with graph data models in a largely distributed environment include Neo4j Graph Data Platform [
34] and the Apache Giraph [
7] processing system. Such solutions enable deploying graph data models on large clusters, if needed, and enable distributed graph processing by partitioning the data and processes between the nodes.
Data Processing. Most of the smart city applications rely on processing a large amount of data [
339]. Depending on the application’s requirements, this processing can be roughly divided into two groups: batch processing and stream processing.
Batch processing, often also called offline processing, takes a large amount of input data, runs a job to process it, and produces the output [
203]. It is clear that jobs in batch processing could take a while. Therefore, they are often scheduled to run periodically, like once a day. If we consider the big data landscape of methods and technologies, then the MapReduce programming model [
123], allowing processing of a large amount of data in a distributed manner, was the most popular approach, implemented also in the Apache Hadoop framework [
8]. A MapReduce job consists of Map and Reduce tasks. First, the input data are split into portions that are processed by Map tasks in a parallel manner. Then, the results of Map tasks are used by the Reduce tasks to compute the final output. It is also common for MapReduce jobs to be chained together into workflows so that the output of one job becomes the input to the next job [
203]. However, the Hadoop MapReduce framework, e.g., does not have direct support for workflows, so the chaining occurs explicitly via storing intermediate results in the file system. This has certain downsides, such as a waste of storage space when intermediate results get replicated, redundancy of some programming code in map tasks, and the inability to start subsequent tasks before the previous ones are completed [
203]. Dataflow engines have been developed that aim to solve these issues. They handle an entire workflow as one job rather than breaking it up into independent subjobs. Examples include Apache Flink [
5], Apache Spark [
14], and Apache Tez [
16].
Stream processing, also often called near-real-time processing, processes events shortly after they happen. Therefore, stream processing has lower delays. There are a number of cases, when stream processing is required, such as anomaly detection, finding patterns, or simply streaming analytics. Basic terminology and technologies required to get stream data to processing engines were already presented in the previous
Section 4.3. Here, we cover approaches for stream processing. Generally, there are two ways to process stream data: one-at-a-time and micro-batching [
203]. For example, Apache Spark allows the use of a micro-batching approach [
14]. In this approach, the processing engine splits the input data into small micro-batches, processes them, and produces the micro-batches of the results. The one-at-a-time approach is implemented by Apache Storm [
15], for example.
Smart city applications are complex constructs fueled by diverse kinds of data. Therefore, hybrid approaches, combining both batch and stream processing, are often required. A number of architectural solutions to combine batch and stream processing have been suggested [
121]. For instance, the Lambda architecture incorporates layers for batch processing, a speed layer for computation on recent data (real-time views), a serving layer which is a specialized distributed database allowing doing queries for batch analysis results (batch views). The query result is composed of both batch and real-time views [
235]. Another approach is the Kappa architecture [
212], which simplifies the Lambda architecture by removing the batch layer. This architecture relies on the use of a log-based system (e.g., Apache Kafka) able to retain all the data that may be reprocessed if needed. Then, we need to deal only with one type of system and making changes means just running a new instance of the job on the whole data, writing the results into a new table and redirecting the application to read the results from this new table. The old job and old results table can be stopped and removed. Liquid architecture [
138] incorporates incremental processing, therefore reducing re-computation from scratch. Davoudian and Liu [
121] discuss these and some other data system architectures (incorporating, e.g., Semantic Web technologies).
Data Governance. Data governance refers to the overall management of the availability, usability, integrity, and security of a platform’s data assets. In the context of data management, governance covers aspects related to data access control, metadata, the data lifecycle, data usage, and regulation compliance [
52]. It involves defining and implementing policies, standards, and procedures to ensure that the data are properly managed, compliant with regulations, e.g., the General Data Protection Regulation, and protected throughout its lifecycle. Data governance sits on top of other aspects of data management, i.e., acquisition, storage, processing, and analysis, and addresses the above-mentioned challenges.
A well-defined data governance framework is critical to ensuring compliance with existing data regulations and potential updates or modifications in real time [
198]. A reliable governance framework can also enable evidence-based auditing and granular reporting to the data controllers and data processors, especially in situations requiring legal examination. Additionally, data lifecycle management offers several advantages, including:
—
Enhanced Agility and Efficiency: By ensuring that useful, accurate, and relevant data are readily available to recipients, data lifecycle management increases the agility and efficiency of data handling.
—
Robust Data Protection Infrastructure: A well-implemented data lifecycle management system guarantees a strong data protection infrastructure, contributing to overall data security.
—
Automation Feasibility: There is the potential to automate data management processes, leading to significant savings in terms of human resources and time.
Once data are created at source, it goes through various stages during its lifetime. These stages include collection, ingestion, storage, access, alteration, archival, and destruction [
133]. Various challenges exist when handling data governance at each of this stage in a smart city environment. For example:
—
Data Ownership and Sharing: Smart city platforms involve multiple stakeholders, including government agencies, private companies, and citizens. Clarifying ownership and sharing policies for data are crucial to avoid conflicts and ensure that data are shared in a transparent and fair manner [
143,
187].
—
Mismatch Between Organizational Structures: This may lead to data silos, duplications or lack of control as smart city platforms often involve multiple systems and data sources that may not be integrated [
187]. To resolve such issues, organizations must have robust and standardized governance models across the entire data lifecycle, e.g., using the 4I framework [
118].
—
Interoperability and Data Quality: Smart city platforms rely on high-quality data to make informed decisions and enable intelligent automation. However, data quality can be affected by factors such as data entry errors, duplication, and data inconsistency. As discussed in
Section 4.2, ensuring data quality can be challenging, particularly when the data are generated from multiple sources and is in multiple formats [
72].
—
Data Access Management: Ensuring that data access policies are enforced consistently and efficiently requires a comprehensive automated access management system that includes authentication, authorization, and audit trails. This is challenging in large-scale smart city platforms with multiple stakeholders, and smart solutions are needed to address the challenges, e.g., by using an automated smart-contract driven framework [
366].
The dynamic and distributed nature of modern smart city platforms emphasizes the necessity of comprehensive data governance through the identification of each stage in the data lifecycle, and appropriate application of relevant controls, policies, and regulations. Identifying tags and metadata linked to each stage of the data lifecycle is also an essential requisite. This meticulous identification and tagging process from administrators of smart city platforms (see
Section 3.3) will not only contribute to effective data management but will also ensure adherence to specific regulations governing each phase of the data’s lifecycle [
196]. Tools such as Apache Atlas
19 or DataHub
20 provide frameworks to manage metadata and tags and enable enterprises to effectively and efficiently meet their compliance obligations. As an example, some of the above lifecycle stages can consist of, but are not limited to, the following tags:
—
Data Collection: source, timestamp, collection region, owner, data format, unit of measurement, and description.
—
Data Ingestion: whether the data are encrypted, anonymized and/or transformed, encryption algorithm, and quality status.
—
Data Storage: timestamp, cloud provider, retention policies, storage format, storage locations, access point, and checksum.
—
Data Access: duration or scope, user ID, role, access type, access, timestamp, and date of modification.
—
Data Deletion: deletion method, expiry date, destruction timestamp, retention policy, confirmation, and reporting.
Once these stages and associated tags are identified, an efficient management mechanism can be developed for smart city platforms [
281]. Policy engines, such as Apache Ranger,
21 can be employed to implement data lifecycle policies. Such engines and solutions should comprise the following essential components:
—
Policy Manager: Maintains a comprehensive list of data regulations and policies that a smart city provider is required to comply with, when handling the user applications and data.
—
Auditor: Records events occurring during the data lifecycle and maintains a track of these events for auditing by internal or external third-party auditors.
—
Policy Enforcer: Applies the policies and regulations stored in the policy manager to user data stored or processed in the platform. Enforcers can be configured as plugins that run on top of data processing or storage components.
By combining these three elements, a solid foundation for trustworthy data governance can be established for smart city platforms.
4.4 Data Analysis
Data analysis is a key enabler when it comes to finding knowledge about how citizens and smart city operations function and interact, and for discovering unknown patterns and potential for optimization. Often, data, enabling the analysis, comes from the ICT infrastructure of the cities. Larger cities, as well as wealthier communities, are teeming with ICT technologies. However, diversity and inequality in sensing and communication infrastructures exist within and between cities. These issues further complicate ordinary data analytics pipelines in the smart city context, see
Figure 4. For discussion purposes, we have organized the data analysis challenges in the context of smart cities into four categories: Trustworthiness challenges bring issues of reliability, confidence, and the truth of data analytics results; Technological challenges include tools and platforms enabling the analysis of smart city big data streams. Methodological challenges include the development of methods and algorithms to treat particular aspects of urban data, such as how heterogeneous data can be fused for analysis. Finally, Ethical challenges explore issues coming from the rapid equipment of the cities with the ICTs, like data privacy. However, we believe that this categorization of challenges is also relevant to other application domains.
Trustworthiness Challenges. Data analysis requires research to satisfy certain validity criteria, which, in turn, can be compromised by biases and challenges coming either from decisions/choices made over the data processing pipeline or some circumstances over which the researcher does not have control [
256].
Table 7 provides comments for addressing trustworthiness challenges.
As it was discussed already in
Section 4.1, data are not neutral and contains bias since there are decisions involved on what, how, when, by whom, and for what purpose the data has been measured/retrieved [
200,
201]. When, e.g., social data are used for analysis, data platforms may even have embedded functional and normative biases coming from the possibilities to interact with the system and expectations of acceptable behavioural patterns, and so forth. [
256]. Moreover, some data collection methods may favor certain kinds of communities over others, e.g., the use of mobile applications for reporting certain city issues, resulting in digital divide issues where not all the communities are equally presented [
252]. Therefore, the data itself becomes biased, i.e., contains “systematic distortion in the sampled data that compromises its representativeness” [
256]. In addition, often, in smart cities, data used for analysis was originally generated for some other purposes, like, e.g., usage of mobile phone data for identifying mobility patterns [
350]. Therefore, a clear understanding is needed of the problem at hand and the data to be selected for analysis, as well as possible bias risks. One way to help in this direction is to support proper documentation of the data source and dataset itself, clearly stating the purpose, phenomena, means, and limitations of the data collection and subsequent use in the analysis. One example are datasheets, proposed by Gebru et al. [
149], accompanying each dataset and documenting its motivation, creation, intended uses, and other relevant information.
Challenges also arise when moving the data through the data processing pipeline. For instance, data cleaning, enrichment, and aggregation procedures may significantly affect the dataset content, structure, or representation [
256]. For example, decisions should be made on what to consider as an outlier and how to treat missing data. Manual data annotation is also prone to errors and subjectivity. Therefore, the quality of data should be assessed at each step [
74].
The data analysis methodology should be adequate for the goals of the research. Moreover, expertise and thoroughness are required in both method selection and results evaluation and interpretation [
256]. Algorithms, similarly to data, can be biased. For example, this may be due to the fact that biased data or too little good quality data are used for their construction, or due to design choices selected based on current or limited understanding of phenomena. Koene et al. [
205] highlight four potential sources of unfairness in algorithmic systems: biased values in design (e.g., favoring one feature over another), biased training data, biased data (if the made algorithm works with problematic data), and inappropriate implementation of an algorithmic system. Since algorithms are becoming more and more integrated into human lives, appropriate measures must be in place to ensure their fairness, trustworthiness, and impartiality. However, how to assess the fairness of the algorithm is still an open research question [
291,
378]. Research into discrimination-aware ML and data mining has emerged to discover and prevent possible discrimination (“adversary treatment of people based on belonging to some group rather than individual merits” [
378]). For example, solutions have been proposed to prevent discrimination by either pre-processing the training data, model post-processing, or model regularization [
378]. Furthermore, transparency and accountability are considered to be promising tools to achieve algorithmic fairness [
205].
Efforts exist on different levels addressing particular algorithmic practices in legislation [
205]. For example, the Automated Decision Systems Task Force was established in New York City to develop recommendations for the use and policy regarding automated decision systems helping agencies and offices in urban decision-making [
50]. Additionally, expert groups and initiatives have been established to acknowledge the importance of dealing with ethical concerns of algorithmic systems. For instance, the Ethics Guidelines for Trustworthy AI have been published by the High-Level Expert Group on AI, a panel established by the EU [
25]. These guidelines present a comprehensive framework with an emphasis on ethics and robustness, aiming to attain reliable and trustworthy AI [
25]. The
Institute of Electrical and Electronics Engineers (IEEE) Global Initiative on Ethics of Autonomous and Intelligent Systems [
27] aims to support stakeholders involved in the development of autonomous systems in the ethical implementation of intelligent technologies. This initiative also works on the IEEE P70XX series of standards to put the ethical principles discussed by the initiative into practical guidelines. For example, IEEE P7003 “Algorithmic Bias Considerations” aims to provide a framework that helps developers to identify and mitigate biases in the outcomes of the algorithmic system [
206]. Working groups of the AI Subcommittee within ISO and International Electrotechnical Commission Joint Technical Committee (ISO/IEC JTC 1/SC 42) are examining the entire AI ecosystem, involving also aspects of AI trustworthiness [
30].
Technological Challenges. A number of technological challenges and solutions to support both batch and stream data analysis were already discussed in
Section 4.3. Therefore, we refer the reader to the Data Processing paragraphs of this section for details.
Methodological Challenges. A number of great surveys exist on the methods for urban data analysis, like heterogeneous data source fusion, methods to treat data sparsity issues, data analysis, and data visualization approaches [
103,
244,
339,
369,
371,
372]. Therefore, we do not repeating such works here and provide a brief summary in
Table 8 with selected methods. However, obviously, the actual landscape of methods used for the urban data analytics pipeline is much larger and readers are advised to go to the original publications for details.
Instead, here we would like to discuss a few important aspects that are usually less discussed in data surveys in the context of smart cities: knowledge transfer and adaptation to real-world changes.
As we have discussed at the beginning of this section, cities (and even regions of cities) vary in the data available. Therefore, due to unique data characteristics, scarcity or data insufficiency issues, the knowledge gained from one urban place cannot be directly applied to another one.
Humans can recognize and apply relevant knowledge and skills from experience to learn new tasks in new situations. For example, a person who can already play one musical instrument can learn to play another one much faster than a person who has never played any instrument before [
353]. However, it is challenging to design a computer system able to apply the acquired knowledge and skills to a new, not seen before, task. Moreover, traditional ML and data mining technologies have the assumption that both training and future data come from the same input feature space and have the same distributions [
353]. However, this is often not the case in the real-world, as it might be expensive, time-consuming, or difficult to obtain training data that matches the feature space and distribution of the test data. For instance, an activity recognition system may be developed for one person but used by another one with different sensors [
353], or some sensing capabilities may simply be unavailable in an urban space [
369,
371]. For such real-world examples, it is essential to utilize the already existing knowledge in new situations.
In urban computing, for instance, some knowledge could be received from one city and partially used in a city that does not possess that much data. It is clear that we cannot directly transfer the inference model learned based on the source city data, as the variables of interest in the target city could differ in their availability and characteristics. However, the relations discovered in one city could hold for the city of interest, and this information could be useful for the problem at hand [
371].
To achieve knowledge sharing between urban spaces,
transfer learning methods can be utilized. Transfer learning methodology transfers knowledge between domains [
352,
353,
361,
372,
375]. Examples of implementing transfer learning in urban computing can be found for traffic and human mobility prediction [
189,
191], points of interest recommendation [
127], and for optimizing locations [
177,
223].
Pan and Yang [
260] and Zheng et al. [
372] provide a great introduction to the topic and taxonomy of transfer learning methods in general. When discussing what could be transferred in urban computing scenarios, Yang et al. [
361] suggest three general categories: cross-modality transfer, cross-region/cross-city transfer, and cross-application transfer.
Cross-modality transfer here refers to the situation when some data modality is missing from the region of interest, however, it is present in another one. Therefore, it would be useful if the information about the modality of interest could be inferred and used to improve the performance of the target application.
Cross-region/cross-city transfer refers to the cases when the knowledge from data-rich cities is applied for the same or similar application in another city.
Cross-application transfer refers to the cases when the knowledge is retrieved from some existing related application for which the data are available [
361].
The research community suggests different methods to facilitate knowledge transfer in urban computing [
361,
371,
372]. However, there are also certain challenges. First, urban computing has some unique characteristics, such as heterogeneous data modalities and spatio-temporal patterns and relationships [
352]. Also, Yang et al. [
361] emphasize the challenges of finding the appropriate source domain, application-specific linking of the source and target domains requiring expertise with different methods, data privacy-preserving issues becoming more common in urban computing, and assessing transferability—that is quantitatively measuring the possible gain from applying transfer learning methods for particular resource and target domains.
Cities are also living constructs that constantly change, making previously gained knowledge obsolete. For example, developed ML models can decrease their performance due to evolved changes in data that occur for different reasons, e.g., because of changes in the physical environment where the sensor was deployed. This implies that systems developed for smart cities should be able to detect the changes and adapt themselves to provide adequate performance. Such issues are related to adaptive learning and concept drift adaptation [
145,
225,
240]. Here, the
concept drift refers to a phenomenon when the data distribution changes over time in a dynamically changing and non-stationary environment; and
adaptive learning means updating the ML models on the fly in response to concept drift [
145]. A large number of approaches have been suggested to deal with concept drift [
225,
240]. However, there are still a number of challenges. For instance, when we deal with large data systems relying on a number of data streams, it could be a case that a drift could occur across multiple data streams [
365]. One more challenge is to deal with multiple types of concept drift that can occur in the real-world [
105]. Distributed ML, like federated learning, poses certain challenges for handling concept drift [
95]. Concept drift detection research is not well presented for non-traditional data streams, such as when the data are represented as a graph [
265]. Finally, there is still not much research on concept drift within unsupervised or semi-supervised settings [
145,
225].
Ethical Challenges. It is clear that privacy is one of the major concerns in the smart city context [
96,
200]. When we talk about data processing, one approach to preserve privacy is to reduce the amount of data to be transmitted and be able to carry out data analyzes on the nodes with the data themselves. For example, edge computing suggests the analysis of the data in proximity to where the data are collected, therefore supporting privacy [
197,
270]. There are also ML approaches that allow learning the model in a privacy-preserving way, for example, federated ML [
360]. However, there are certain challenges as well when we deal with distributed ML and edge intelligence, like data scarcity and consistency on edge devices and slow performance of collaborative learning tasks [
357]. Here we would like to discuss ethical challenges beyond surveying methods and technologies for distributed data analysis; for such information, an interested reader could refer to, e.g., [
344,
357,
360]. Therefore, we explore data-related ethical concerns in
Section 4.5, data privacy in particular in
Section 4.6, and measures to secure data in
Section 4.7.
4.5 Ethics
At face value, the stated goals of smart cities—improved quality of life, ecological sustainability, and so forth—are highly ethical ones. However, concerns have been raised that these goals may be pursued at the cost of harmful side effects, and/or in such a way that some groups are excluded from enjoying the benefits. As noted above, there is no clear consensus on what exactly constitutes a smart city, and as a result of this, compiling a comprehensive and systematic presentation of the ethics of smart cities is a considerable challenge. However, in recent years there have been several attempts to conceptualize and categorize the various ethical concerns relevant to smart cities.
Calvo [
94] identifies hyperconnectivity, algorithmization and datafication as key aspects of urban digital society, along with eight major ethical implications of these aspects: intrusion of privacy, social and economic exclusion, misuse of data, bias in decision-making algorithms, obsolescence of human skills and labor, dissolution of responsibility for decisions, objectification of human beings, and the imposition of technology on people. Goodman [
154] names three challenges to the democratic governance of smart cities: privatization of functions (e.g., planning) and assets (e.g., data) traditionally held to belong to the public sector, the conception of cities as platforms offering service providers access to public data, and a loss of autonomy through e.g., technology failure or vendor lock-in. Based on a review of the literature on smart city ethics, Ziosi et al. [
377] establish four dimensions that are invariant across multiple smart city definitions and give rise to ethical concerns; these are the network infrastructure, post-political governance, social inclusion, and sustainability.
In a synthesis of the above three articles, focusing specifically on ethical concerns related to the collection and use of data in smart cities, two major themes emerge:
—
Techno-Centric vs Human-Centric Smart Cities: adopting a techno-centric and techno-optimistic approach to smart city building can lead to emphasizing technological capabilities over human needs and optimizing relatively easy-to-quantify metrics such as economic efficiency over more elusive ones such as livability.
—
Public vs Private Control of Resources and Processes: as decisions related to city planning and governance are increasingly determined by data and algorithms, power over these decisions is increasingly being transferred from elected representatives and public authorities to private businesses that control the data and provide the algorithms.
From the perspective of data, perhaps the most obvious ethical issue involving smart cities is privacy. Regardless of the definition, the collection and processing of digital data in large quantities is one of the characteristic features of a smart city, and a significant portion of this will be the personal data of the citizens. As pointed out by König [
209], collecting ever more data are in contradiction with the principle of data minimization, and even if the data are rendered impersonal through anonymization, it may still become a threat to individuals’ informational autonomy through re-identification or re-purposing. A smart city is thus effectively a vast surveillance system with no feasible informed consent or opt-out mechanism available, and while the intention of the surveillance may be benign, the data flows involved may be so complex that the technical and legal safeguards in place are not enough to guarantee the security and privacy of the data. If some of the data are controlled by companies, these may have an interest in exploiting it commercially, exacerbating the risk to privacy; by partnering with such companies, the city is effectively acting as an enabler for what Zuboff [
380] has termed surveillance capitalism.
An archetypal example of surveillance technology is the surveillance camera. Combined with modern AI techniques, the footage captured by such a camera is no longer merely something to be viewed by a human authority after some kind of incident has occurred, but acts as input data to ML algorithms for purposes such as facial recognition.
Facial Recognition Technology (FRT) has a variety of security-related applications that can be argued to enhance safety in the city, but their civil rights implications cannot be ignored; in addition to the privacy issues, facial recognition systems have been observed to be prone to racial bias where people belonging to certain groups are more likely to be misidentified than others [
98], leading to concerns about the social justice impact of using biased algorithms for policing. Similar concerns have been raised about predictive policing systems, although there have been few independent empirical evaluations of the fairness of such systems and these have not produced clear evidence linking them to increased discrimination [
59]. In contrast, the use of FRT in policing has been found to contribute to greater racial disparity in arrests, although this cannot be simply attributed to algorithmic bias as the sole explanatory factor [
192].
Altogether, seven different types of harmful bias (or “sources of downstream harm”) in ML are identified in [
328]: historical, representation, measurement, aggregation, learning, evaluation and deployment bias. If these are not identified and eliminated, increased reliance on data and algorithms in smart city decision-making will result in decisions whose fairness is questionable. There is an issue with the transparency of the decisions as well, since the explainability of some popular ML techniques is poor [
80,
91] and the implementations may be guarded as trade secrets by their vendors, making it difficult, if not impossible, to subject them to thorough external auditing. Furthermore, there are accountability implications if the transition from traditional to smart city means that governance decisions are increasingly determined by data through opaque computational processes, since there is then a risk that responsibility for the decisions will become detached from traditional democratic processes.
Ziosi et al. [
377] use the term “post-political” to describe the increasing role of private organizations and automated decision-making in smart city governance. Besides the issues identified above, another problematic aspect of this is that underneath the ostensible objectivity and rationality of post-political governance through data and algorithms, the selection and prioritization of optimization targets is inherently political, since these reflect the values of the smart city. Goodman [
154] captures this by conceptualizing smart cities as digital platforms where the pursuit of efficiency may sideline other important values. Furthermore, they point out that decisions regarding what data to collect and how are also political, and if some groups of citizens are not adequately represented by the data, the members of such groups are at risk of being excluded from the benefits of the smart city. The people most likely to be excluded are those who are affected by existing digital divides and, therefore, are already at a disadvantage [
94,
154,
377].
Various authors have criticized the focus on technology and efficiency in smart cities and have advocated a more human-centric approach. Biloria [
83] puts this idea succinctly by introducing the concept of an empathic city. Human-centric models for smart city data governance are discussed in [
209] and [
264]. The MyData Global Network is advocating more human-centric governance of personal data in general; its guiding principles are codified in the MyData Declaration [
273], which calls for a transition from formal rights to actionable ones, from data protection to data empowerment, and from closed ecosystems to open ones. Several examples of cities pioneering initiatives aligned with the MyData principles are given by Lähteenoja and Sepp [
215].
In the literature, the concept of a smart city is frequently paired with that of a smart citizen. In terms of having an established definition, the latter is even more elusive than the former, but insofar as a smart city is one that emphasizes human values and needs over technological capabilities, a key aspect of smart citizenship is empowerment. From a data perspective, a human-centric smart city is thus one that not merely protects the data of its citizens but empowers them to use data to advance their personal values and goals and to participate in the definition of new data-based services. Technological innovation is a necessary enabler for this, but it is also necessary to ensure that the citizens have a sufficient level of data literacy to take advantage of the opportunities presented by smart city technology, lest this become another divide where some people are excluded from enjoying the benefits of the smart city. Proposed solutions are scarce in the literature, but the Urban Data School initiative described in [
355] is aimed at this exact purpose in the context of the Milton Keynes smart city project in the UK.
Table 9 presents a summary of the ethical considerations involved in addressing the data challenges of smart cities. Three relevant smart city aspects are identified here:
smart city goals, referring to the determination of the objectives and underlying values of the smart city;
smart city governance, referring to how decisions are made in the planning and operation of the smart city; and
smart city life, referring to how the everyday life of the individual citizen is transformed in the smart city. Associated with each of these aspects is an opportunity for betterment, and associated with each opportunity are risks arising from the central dichotomies identified above, techno-centric vs human-centric and public vs private.
4.6 Data Privacy
New and emerging technologies are promoting the development of an ecosystem for connected places within smart cities, but at the expense of a rapidly widening threat landscape. Attacks against smart infrastructure and privacy have made it clear in recent years that the demands of the smart city transformation, including data collection and processing needs, face significant multi-level governance requirements, such as the need for more transparency, accountability, and security and privacy [
171,
349]. Prior work has focused on establishing comprehensive threat modeling tools and conceptual frameworks to better protect smart cities, as well as describing threat actors, their
tactics, techniques, and procedures (TTPs), and how to mitigate attacks against connected things and places [
227].
Smart City Threat Modeling. Threat modeling is a method for systematically identifying various types of threat actors, attack vectors, and mitigation actions against malicious activities that may harm applications, networks, or other computer systems [
305]. Smart cities have unique cyber risks that span many vertical sectors and industries such as energy, transportation, healthcare, education, and public services. Particularly, the increased interconnection of devices and systems generates new challenges for city security management that go beyond conventional security issues. The four innovations listed below are expected to have, or already have, a significant impact on cyber risks in connected cities [
261]: (1) convergence of IT and Operational Technology, (2) the interoperability of new and old systems, (3) the integration and fusion of services, and (4) the proliferation of AI and automation [
195].
Against this backdrop, more research on smart city threat modeling is now available, with the goal of developing approaches and tools for better assessing system vulnerability and adopting cyber-security analytics [
106,
115,
144,
194,
356,
379]. Similarly, municipal, regional, and national governments are becoming more proactive in their legislative approaches to smart city threat modeling, allowing for a more focused and concentrated approach to smart and connected city cyber security [
171,
227]. An indicative example of this trend is the threat model for future smart cities developed by the
European Union Agency for Network and Information Security (ENISA) covering the healthcare and public transport sectors [
136,
220]. Similar efforts have been noticed in the respective national cyber-defense authorities across the globe [
113,
171,
288]. The subsections below introduce the high-level components of the threat modeling approaches for smart cities.
Threat Actors. Cyber Threat Actors (CTA) are responsible for a considerable number of threats to smart cities [
219]. These are groups or individuals who engage in malicious activities that intentionally aim to harm infrastructure for monetary of other gains. CTA groups are frequently divided into the following categories according to their underlying motives, goals, and known affiliations: (1) cybercriminals, (2) insiders, (3) nation-states, (4) hacktivists, (5) terrorist organizations, and (6) script kiddies [
298]. Among these threat actors, nation-state actors, also known as
Advanced Persistent Threats (APTs), are regarded as the most dangerous and stealthy operators [
219]. The MITRE corporation, which curates one of the most widely accessible knowledge bases of adversary tactics and techniques, currently lists about 135 APT groups and associates (i.e., threat groups, activity groups, and threat actors) that share similar methodologies (i.e., TTPs) and operate in different geographical regions [
242].
Attack Vectors. While the classification of threat actors can help analysts determine the magnitude of a threat, smart city threat modeling also requires prior knowledge
vis a vis the initial origin (i.e., tail) of the attack vectors. Namely, the approach developed by ENISA considers two broad conditions that pivot around the perceived intentionality of a threat [
220]. These are distinguished between threats from
intentional attacks and threats from
accidents. In the context of public transport systems, intentional attacks can include the following: (1) eavesdropping and sniffing, (2) theft, (3) tampering and alteration, (4) unauthorized use and access, (5) distributed denial of service, (6) loss of reputation, and (7) ransomware. In addition to the threats that might be caused by certain individuals or groups, there is also the possibility of threats being caused by accidents, including: (1) hardware failure and/or malfunctioning, (2) operator or user error, (3) end of support or obsolescence, (4) electrical and frequency disturbance or interruption, (5) acts of nature, and (6) environmental incidents. Evidently, in the context of smart and connected people, places, and things, attacks against data (intentional or accidental) are the most common security threat that can inevitably erode privacy.
Data Privacy Models. Data privacy and confidentiality in the smart city are a diverse problem due to the usage of data aggregation to form links, which makes anonymity difficult to accomplish [
349]. In addition to the attack vectors outlined above, research in recent years has concentrated on numerous risks against data privacy. Cyber threats against privacy often take aim at: (1) personally identifiable information comprised of personal attributes such as Social Security Numbers that uniquely identify a person; and (2) quasi-identifying attributes comprised of a combination of attributes, such as name, age, and address that, when combined with external information, may be used to re-identify all or part of the respondents to whom the information pertains. In particular, prior works have looked at various types of information disclosure that can lead to a privacy breach (i.e., to reidentification) [
368].
—
Identity disclosure happens when an adversary achieves the correct mapping of microdata (i.e., individual population unit records files) from a database to an existing real-life entity [
229].
—
Attribute disclosure occurs when the adversary is able to deduce more accurately any additional features of a person from the information accessible in the disclosed data [
331].
—
Inferential disclosure occurs when the attacker can infer or more accurately determine the confidential value of a variable in a dataset by comparing the statistical properties of the released data to the information available [
304].
—
Social link disclosure occurs when an attacker can re-identify a hidden relationship between two users that may lead to identity, attribute or inferential disclosure [
368].
—
Affiliation link disclosure happens when the adversary can determine that a person is affiliated to a specific group, resulting in a higher risk that may lead to identity, attribute or social link disclosure [
368].
With the advent of data-driven sectors in smart cities (e.g., healthcare, transport, and governance), protecting data privacy without compromising the utility of the collected data has become a conundrum. A number of privacy-preserving algorithms and models have been proposed to address the various information disclosure risks, including k-anonymity [
331], l-diversity [
229], t-closeness [
221], and differential privacy [
130]. Similarly, solutions exist toward achieving trajectory privacy [
107,
190]. These algorithms leverage anonymization methods including
generalization, suppression, anatomization, bucketization, permutation, and perturbation [
329].
Figure 5 presents an overview of the cybersecurity challenges, threat landscape, and privacy issues encountered by smart cities.