4s PDF
4s PDF
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 1
                     Abstract—Recently, the development of Internet of Things                                        which machine-type communication devices (MTCDs) can
                  (IoT) provides plenty of opportunities and challenges in various                                   communicate intelligently with very limited human interven-
                  fields. As an essential part of IoT, machine-to-machine (M2M)                                      tions, such as wearable devices, automotive electronics, smart
                  communications open a novel way that the machine-type com-
                  munication devices (MTCDs) are connected and communicated                                          grids, industry automation, etc. [2]–[4]. According to lots of
                  without any human intervention. Meanwhile, delay-tolerant                                          predictions or reports by research institutions and companies,
                  data plays an important role in M2M communications-based                                           the M2M connections will reach 14.6 billion by 2022 and
                  IoT, and it puts more emphasis on powerful data caching,                                           around 50 billion in the near future [5]–[7].
                  computing and processing, as well as the security and stability                                       Different from traditional human-to-human (H2H) com-
                  of data transmission. To meet these requirements in M2M
                  communications networks, in this paper, we introduce some                                          munications, for M2M communications, there exists a large
                  promising technologies such as edge computing and blockchain,                                      portion of the data traffic that can tolerate relative long
                  and propose a joint optimization framework about caching,                                          delay, for instance, the data traffic in intelligent meter, envi-
                  computation and security for delay-tolerant data in M2M                                            ronmental monitoring, and other non-real-time services [8].
                  communications networks based on dueling deep Q-network                                            This delay-tolerant data in M2M communications is uploaded
                  (DQN). According to dynamic decision process by DQN, the
                  optimal selection and decision of caching servers, computing                                       or downloaded periodically, and it can permit higher trans-
                  servers and blockchain systems can be made to achieve maxi-                                        mission latency but requires powerful processing rate [9].
                  mum system rewards, which includes higher efficiency of data                                       Meanwhile, most of delay-tolerant data computing tasks
                  processing, lower network costs and better security of data                                        almost cannot be cached, processed and executed solely
                  interaction. Extensive simulation results with different system                                    on the local devices, since the MTCDs usually equipped
                  parameters show that our proposed framework can effectively
                  improve the system performance for blockchain-enabled M2M                                          with limited resource of battery, storage and computation
                  communications compared to the existing schemes.                                                   for a relatively long working life [10], [11], and micro-
                                                                                                                     central processing unit (micro-CPU) on the MTCDs also
                    Index Terms—Machine-to-machine communications, edge
                  computing, edge caching, blockchain, dueling deep Q-network.                                       cannot execute complicated computing tasks to fulfill their
                                                                                                                     computing needs [12]. On the other hand, as a distinctive
                                                                                                                     feature in M2M communications, the security and reliability
                                                  I. I NTRODUCTION                                                   are considered even more important, because some sensitive
                                                                                                                     data in IoT is usually scheduled, transmitted and interacted
                        URRENTLY, with increasing number of electronic de-
                  C     vices, lots of them are expected to be linked to the In-
                  ternet and constituted the Internet of Things (IoT) [1]. As the
                                                                                                                     between the various MTCDs, without artificial control [13].
                                                                                                                        Face to these issues and challenges, in recent years, lots
                                                                                                                     of researches or reports have been focused on improving
                  important part of IoT, machine-to-machine (M2M) communi-                                           the capability of data caching and computing, as well as
                  cations emerge as a promising communication paradigm, in                                           enhancing the security and reliability of data traffic in various
                     Copyright (c) 20xx IEEE. Personal use of this material is permitted.                            areas of IoT. In [14], the authors propose a novel nonorthog-
                  However, permission to use this material for any other purposes must be                            onal multiple access (NOMA)-based edge computing model
                  obtained from the IEEE by sending a request to pubs-permissions@ieee.org.                          for narrowband IoT (NB-IoT) networks, and they present a
                     This work was jointly supported in part by the National Natural Science
                  Foundation of China under Grant 61901011 and 61671029, the China                                   joint optimization framework that minimizes the maximum
                  Postdoctoral Science Foundation under Grant No. 2018M640032, the Beijing                           task execution latency required per task bit across NB-IoT
                  Postdoctoral Science Foundation under Grant No. ZZ2019-73, the Chaoyang                            devices. The authors in [15] propose a new architecture for
                  District Postdoctoral Science Foundation under Grant No. 2019ZZ-4, and
                  the International Cooperation Seed Foundation of Faculty of Information                            data synchronization based on fog computing, and design a
                  Technology, Beijing University of Technology. (Corresponding author:                               synchronization algorithm for data caching and computing
                  Pengbo Si.)                                                                                        to the fog servers in order to decrease the communication
                     Meng Li, Pengbo Si, Wenjun Wu, and Yanhua Zhang are with Faculty
                  of Information Technology, Beijing University of Technology, Beijing,                              cost and reduce the latency. Moreover, focus on the data
                  100124, P.R. China (e-mail: limeng720@bjut.edu.cn; sipengbo@bjut.edu.cn;                           security in IoT, the authors of [16] propose and analyze a
                  wenjunwu@bjut.edu.cn; zhangyh@bjut.edu.cn).                                                        novel scheme for IoT nodes based on blockchain, which
                     F. Richard Yu is with the Department of Systems and Computer En-
                  gineering, Carleton University, Ottawa, ON, K1S 5B6, Canada (e-mail:                               aggregates the blockchain data in periodic updates and fur-
                  Richard.Yu@carleton.ca).                                                                           ther reduces the communication cost of the connected IoT
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 2
                  devices. A blockchain-enabled efficient data collection and                                                ticularly, to handle the dynamic and large-dimension
                  secure sharing scheme is proposed in [17], which combines                                                  characteristics of optimization and decision process, the
                  Ethereum blockchain and deep reinforcement learning (DRL)                                                  dueling DQN-based optimization algorithm has been
                  to achieve maximum amount of collected data and ensure                                                     adopted and the appropriate caching storage, computing
                  security and reliability of data sharing.                                                                  servers and the blockchain systems can be selected after
                     Although various of excellent works have been done on the                                               the training.
                  data caching, computing as well as security based on edge                                              •   Extensive simulation results with different parameters
                  computing or blockchain in M2M communications-based                                                        show that the proposed scheme has more advantages and
                  IoT, these three important aspects were generally considered                                               effectiveness compared with the existing schemes. It is
                  separately. Most of existing works only focus on the resource                                              revealed that the system rewards have been increased
                  allocation of caching and computing through edge computing                                                 and the average service latency has been decreased
                  or cloud computing respectively, and they have ignored                                                     significantly through dynamic decision process.
                  joint optimization with local computing, edge computing and                                           The rest of this paper is organized as follows. We review
                  cloud computing in M2M communications [18]. More im-                                               the related works about M2M communications, blockchain
                  portantly, delay-sensitive data and delay-tolerant data almost                                     and edge computing in Section II. Next, the proposed
                  have not been differentiated in the existing works. Then this                                      network architecture and system model are presented in
                  mixed transmission strategy results in the over consuming of                                       Section III. In Section IV, the selection and decision process
                  communication resources and the serious degradation of the                                         of caching, computing servers and blockchain systems for
                  quality-of-service (QoS). Meanwhile, based on the features                                         delay-tolerant data in M2M communications is formulated,
                  of blockchain technology, the delay-sensitive data usually                                         followed by the maximum system rewards derived from
                  cannot uploaded and applied into the blockchain systems,                                           the optimal strategy. Then, the solution of the proposed
                  because the blockchain systems need enough times to execute                                        optimization strategy is given and discussed in V. In Section
                  the transactions and the smart contracts [19].                                                     VI, we present and discuss the simulation results. Finally, we
                     To address the above problems and challenges, in this                                           conclude this work in Section VII with future works.
                  paper, focused on delay-tolerant data, we propose a novel
                  framework to jointly consider caching, computing and se-
                                                                                                                                                   II. R ELATED W ORKS
                  curity to improve system performance based on edge com-
                  puting and blockchain in M2M communications networks.                                                In this section, some related works about delay-tolerant
                  In addition, we introduce the promising algorithm named as                                         data in M2M communications are reviewed at first. Then,
                  dueling deep Q-network (DQN) to learn, train and derive the                                        some backgrounds of blockchain-enabled IoT networks are
                  optimal decisions, then the maximum system rewards can                                             presented, followed by the description of integrated caching
                  be obtained. The distinct features of this paper are listed as                                     and computing in blockchain-enabled M2M communications.
                  follows.
                      •   In this framework, edge computing is introduced in
                                                                                                                     A. Delay-Tolerant Data in M2M Communications
                          order to improve the capability of data caching and
                          computing. Based on edge computing, the delay-tolerant                                        The concept of delay-tolerant network was initially pro-
                          computing tasks carried by MTCDs can be offloaded                                          posed for the InterPlaNetary Internet (IPN), because in that
                          to closer edge computing servers, then more computing                                      network environments, data transmission has to suffer from
                          tasks can be accommodated and executed selectively on                                      very large latency, low data rates, possibly time-disjoint
                          local device, edge computing servers or cloud comput-                                      periods of reception, and intermittent scheduled connectiv-
                          ing servers according to current network states, network                                   ity [20], [21]. However, different from IPN, in general IoT
                          environments and QoS requirements.                                                         networks with M2M communications, there are lots of data
                      •   In order to enhance the data security and efficiency in                                    traffic that can tolerate relatively long latency [9], such as
                          M2M communications, the blockchain is considered as                                        environment monitoring, e-health, smart meters, surveillance,
                          a crucial technology in the proposed network model.                                        etc. For example, deploying smart meters in family, the
                          Based on the distributed blockchain node, the delay-                                       data uploads or downloads periodically by each day, each
                          tolerant data can be uploaded into the blockchain sys-                                     week or each month. Thus, on one hand, the delay-tolerant
                          tems after computing and processing, and the data se-                                      data does not require transmission or execution with limited
                          curity can be authorized and ensured through consensus                                     spectrum resources in real time [22]. On the other hand,
                          mechanism.                                                                                 this data in M2M communications networks has enough
                      •   The schedule and strategy of caching, computation of-                                      latency to be processed [23], such as executing data com-
                          floading as well as blockchain systems usage are jointly                                   puting on edge/cloud computing servers or processing data
                          considered and formulated as a discrete markov decision                                    at blockchain systems.
                          process (MDP) to maximize the system rewards, which                                           It should be noted that delay-tolerant data traffic does not
                          include higher caching reward, lower time overhead                                         mean no delay limitation in M2M communications. It also
                          of data computation for edge computing servers, and                                        has its delay requirements of life time, which are much longer
                          efficient data processing for blockchain systems. Par-                                     than delay-sensitive data traffic [24].
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 3
                  B. Blockchain-Enabled IoT Networks                                                                 account in wired blockchain networks, and the limited
                     With rapid development of urbanization, various chal-                                           capacity of caching and computing in conventional wireless
                  lenges and problems have emerged in all of the world. In                                           communication networks is often ignored. In other words, it
                  order to solve these problems, the concept of smart cities has                                     is essential to support powerful data storage and computation
                  been proposed and considered as an effective solution [25].                                        for blockchain-enabled M2M communications. As another
                  IoT networks play a key role in the implementation of smart                                        promising technology, edge computing can deploy caching
                  cities, while M2M communications are important foundation                                          and computation resources closer to MTCDs, and data
                  for IoT [6]. Nevertheless, for M2M communications, data in-                                        caching and computing can be executed at edge computing
                  teraction and data security are more important than through-                                       servers [35]. Compared to conventional cloud computing, it
                  put increasing or energy saving, since vast amounts of data                                        can significantly reduce the network overload and shorten the
                  in M2M communications is used and supported industrial                                             transmission latency based on edge computing. Therefore,
                  production, metering system, and etc., which requires higher                                       integrating caching and computing through edge computing
                  reliability and security [7].                                                                      paradigm can bring huge advantages and benefits for
                     Fortunately, as an emerging technology, blockchain has a                                        blockchain-enabled M2M communications.
                  huge potential to promote the development of smart cities and
                  to enhance IoT services. Blockchain is firstly used as a peer-                                         III. N ETWORK A RCHITECTURE AND S YSTEM M ODEL
                  to-peer (P2P) ledger for Bitcoin economic transactions [26],                                          In this section, the physical and logical network architec-
                  can guarantee data security and efficiency by enabling anony-                                      ture of the proposed scheme is presented at first. Then, the
                  mous and trustful transactions and removing all kinds of                                           communication model, caching model, computation model,
                  intermediaries. Nowadays, blockchain technology will bring                                         as well as latency constraint are given and discussed in detail.
                  lots of good features to IoT, such as trust-free, transparen-                                      The latency model of blockchain systems is discussed and
                  cy, automation, decentralization, security and etc. [27]. For                                      presented at last.
                  instance, based on the features of blockchain, it is difficult
                  to have a single point of failure that apply the blockchain-
                  based decentralized systems in M2M communications. Thus,                                           A. Network Architecture
                  the security of machine-type communications network can                                               An example of the network architecture for blockchain-
                  be enhanced effectively [19]. Moreover, in the blockchain-                                         enabled M2M communications is depicted in Fig. 1, which
                  enabled IoT network, any collected data is signed using                                            consists of M small cells, N MTCDs for different ap-
                  digital signatures, and it is linked and secured through the                                       plications or services and one cloud services platform. In
                  one-way cryptographic hash functions [28]. Therefore, the                                          the mth (m=1, 2, . . . , M ) small cell, we assume that it
                  data collection through blockchain-enabled M2M commu-                                              deploy one wireless access point (AP), which is equipped
                  nications can be deemed as transparent, and the reliability                                        with edge computing server and blockchain systems, it is
                  of IoT networks can be guaranteed adequately. Based on                                             defined as APm in this paper. The edge computing servers
                  these advantages of blockchain technology, it will promote                                         enable computation and storage of computing tasks and the
                  the implementation and deployment of a trusted, secure and                                         blockchain systems can upload data, record transactions and
                  transparent network environments to IoT.                                                           share information. In addition, for the MTCDs in this small
                                                                                                                     cell, we consider that there are Nm MTCDs can connect the
                  C. Integrate Caching and Computing in Blockchain-Enabled                                           nearest AP to transmit delay-tolerant data randomly, and each
                  M2M Communications                                                                                 MTCD is also equipped a micro CPU to execute lightweight
                                                                                                                     computing tasks. We use nm (n=1, 2, . . . , N ) to represent
                     Edge caching and computing are wildly studied and re-                                           the nth MTCD in the mth cell.
                  searched in recent years, and it will be beneficial for content                                       Moreover, in the cloud services platform, it includes
                  retrieval and data processing for different M2M applica-                                           one core controller equipped with powerful caching and
                  tions [29]. Many existing works have focused on content                                            computing servers. Typically, the cloud services platform is
                  storage and computation offloading at the edge of network                                          connected all the APs through wired link, and the heavy or
                  in order to decrease network overload and system costs. For                                        complicated computing tasks by blockchain systems can be
                  example, the optimization scheme of content caching and                                            offloaded and executed at the cloud computing servers [18].
                  request routing is proposed in [30], in order to minimize data                                     Thus, the data computation in blockchain systems can be
                  traffic latency. In [31], the joint optimization of computation                                    operated and processed on edge computing servers or cloud
                  offloading and resource allocation is discussed to decrease                                        computing servers.
                  energy consumption.
                     For blockchain-enabled M2M communications, the
                  caching and computing process of data, which is called                                             B. Communication Model
                  “mining”, has been deemed as a rigorous challenge                                                      In the communication model, we assume that the up-
                  in current network systems [32]. In order to provide                                               link and down-link channels between APs and MTCDs are
                  powerful computation capacity, lots of data computation                                            symmetry according to the reciprocity theorem when the
                  management schemes are proposed and discussed in the                                               transmissions occur in the same coherence interval. In the
                  existing works [33], [34], but they are usually taken into                                         proposed model, the channel radio propagation is assumed to
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 4
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 5
                  be processed and executed on local devices, edge computing                                         Furthermore, due to a wired transmission between APm and
                  servers or cloud computing servers.                                                                cloud service platform, the time consumption for transmitting
                     For the computational ability, in other words, the number                                       the input data from APm to the cloud computing servers
                  of CPU cycles per second of the nm th MTCD, the edge                                               and returning the accomplished data backward is represented
                  computing servers connected with the APm th AP and cloud                                           as tc,trans . Therefore, in the case of cloud computing, the
                  computing servers, are represented as Fnm , FAPm , and Fc ,                                        overhead of data processing time can be represented as
                  respectively. In addition, the delay-tolerant computing tasks                                                                      αnm (t)
                  carried by the nm th MTCD at time slot t is represented                                                         tAPm ,c =                     + tc,comp + tc,trans .                       (10)
                                                                                                                                                   cnm ,APm (t)
                  as Inm (t) , {αnm (t), βnm (t)}, where αnm (t) denotes the
                  size of input data involved and βnm (t) denotes the total                                             Accordingly, considering different modes of data comput-
                  number of CPU cycles required to accomplish these com-                                             ing about all MTCDs in the mth small cell, the total execution
                  puting tasks [36]. At each time slot, the computing tasks                                          time of the computing tasks can be represented as
                  may be executed on the local device, or transmitted and
                  offloaded to edge computing servers or cloud computing                                                            ∑λ               ∑µ (                          )
                                                                                                                                         βnm (t)              αnm (t)      βn (t)
                  servers. Focus on different situations, the different time costs                                    ttotal =                   +                      + m          +
                                                                                                                                          Fnm              cnm ,APm (t)    FAPm
                  for data computing will be given in detail as follow.                                                            nm =0            nm =0
                                                                                                                                           ∑ν    (                                  )
                     1) Local Computing: For the simple computing tasks, they                                                                        αnm (t)
                  cloud be executed on the MTCD immediately. Then, the time                                                                                     + tc,comp + tc,trans ,
                                                                                                                                          n =0
                                                                                                                                                   cnm ,APm (t)
                                                                                                                                                 m
                  consumption to execute these tasks is represented as
                                                                                                                                                                                                             (11)
                                                             βn (t)
                                                     tnm    = m .                                           (6)      where λ, µ, and ν are the number of the computing tasks
                                                              Fn m                                                   that select to execute on MTCDs, edge computing servers
                     2) Edge Computing: For the complicated computing                                                and cloud computing servers, respectively, and they need to
                  tasks, such as mining with blockchain systems, they are                                            be satisfied 0 ≤ λ, µ, ν ≤ Nm , and 0 ≤ λ + µ + ν ≤ Nm .
                  unavailable to be processed on the MTCDs by themselves.
                  Therefore, the computing tasks need to be offloaded the full                                       E. Latency Model with Blockchain Systems
                  tasks Inm to its associated APm . In this case, it has to be of-
                                                                                                                        Although the latency is permissible for the delay-tolerant
                  floaded the computing tasks at first, and then be transferred it
                                                                                                                     data and its tolerant delay is usually much higher than delay-
                  through wireless communication link. Thus, if the computing
                                                                                                                     sensitive traffic, it not means that the latency of data traffic
                  tasks are decided to execute on edge computing servers, the
                                                                                                                     and computation is unlimited. Especially, for the blockchain
                  transmission time of offloading computing data in the first
                                                                                                                     systems, the latency is considered as an important index and
                  step is represented as
                                                                                                                     cannot be ignored. Generally, the transaction processing of
                                                                  αnm (t)                                            data in blockchain systems has two phases, generates a block
                                              tnm ,of f =                    .                              (7)
                                                                cnm ,APm (t)                                         at first, and then reaches a consensus on the generated block
                                                                                                                     among the all the users. Consequently, the latency of the data
                    When the full computing tasks have been offloaded, the
                                                                                                                     processing in blockchain systems includes latency of block
                  edge computing servers which are connected by APm exe-
                                                                                                                     generation and block confirmation, which is represented as
                  cute these computing tasks, and the time consumption in this
                  phase is represented as                                                                                                         Tblock = Tg + Td + Tv .                                    (12)
                                                                      βnm (t)                                        where Tg is the average time required for the blockchain sys-
                                                tAPm ,comp =                  .                             (8)
                                                                      FAPm                                           tems to produce a new block, Td is the time consumption of
                      As a result, the total execution time turns out to be                                          data delivery and Tv is time cost for validation in blockchain
                                                                                                                     systems. In this paper, we select practical Byzantine fault
                                                           αnm (t)     βn (t)                                        tolerance (PBFT) as the consensus mechanisms in the pro-
                                     tnm ,APm =                       + m .                                 (9)
                                                         cnm ,APm (t)  FAPm                                          posed blockchain systems and according to [19], Td can be
                     3) Cloud Computing: Due to limited energy supply and                                            calculated as
                  data processing capacity, it is difficult to rely solely on local                                        1          M Sb                                    M Sb
                                                                                                                     Td = [min{                , Tlim } + min{ max ′                 , Tlim }
                  devices or edge computing servers to accomplish the compli-                                              P         cnm ,n′m                    ni ̸=nm ,nm cnm ,ni
                  cated computing tasks. To improve the computing capacity,                                                                        M Sb
                  cloud computing is also considered in the proposed network                                               + min{ max                    , Tlim }+
                                                                                                                                    ni ̸=nj ̸=nm cni ,nj
                  architecture. Similar with edge computing, the delay-tolerant
                                                                                                                                          M Sb                              M Sb
                  computing tasks need to be offloaded to associated AP at                                                 min{ max              , Tlim } + min{ max               , Tlim }],
                                                                                                                                ni ̸=nj cni ,nj                    ni ̸=nm cnm ,ni
                  first, the time consumption is represented as tnm ,of f , which is
                  given in (7). Nevertheless, because of the powerful computing                                                                                                         (13)
                  ability of cloud computing servers, we take into account that                                      where P is the bath size of block, Sb represents the number of
                  it can execute unlimited computing tasks concurrently and                                          bytes contained in each block, Tlim means the average time
                  return the results within a fixed time consumption tc,comp .                                       required for the block producer to create a new block, and
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 6
                  cni ,nj denotes the data transmission rate of the link between                                     aca (t) be the set of data caching selection in each time slot,
                  each pair of MTCDs.                                                                                and it can be denoted as
                     Moreover, the time cost for validation Tv can be calculated
                  as                                                                                                                   aca (t) = {0, aca,EC (t), aca,cloud (t)}.                             (17)
                                          {                              }
                           1                P γ + [P + 4(K + ς − 1)]δ                                                where 0 represents that the data does not need to store
                   Tv =          max                                       , (14)
                           P nm =1,...,Nm              Fnm                                                           on any caching server, aca,EC (t) represents the decision
                  where γ and δ are the computing costs for verifying signa-                                         of data caching on the edge computing servers. Similarly,
                  tures and generating message authentication codes, K is the                                        aca,cloud (t) represents the decision that the data will be
                  number of block producers, ς is the number of faulty replicas.                                     cached on the cloud servers.
                     For delay-tolerant M2M communications in IoT networks,                                             Computing Node Selection: Following the data caching
                  the smart MTCDs also expect to receive the finality of                                             selection, the corresponding computing node will be selected
                  transactions within a finite time, which satisfies the its delay                                   and determined at each time slot. We denote the decision
                  requirements. Therefore, we assume that one block should be                                        of computing node selection as acomp (t), which can be
                  issued and validated within a number of consecutive block                                          represented as
                  intervals, the constraint should be satisfied as
                                                                                                                           acomp (t) = {acomp,l (t), acomp,m (t), acomp,c (t)},                              (18)
                                                     Tblock ≤ ρ · Tg ,                                    (15)
                                                                                                                     where acomp,l (t) represents that the MTCD will execute
                  where ρ is the number of block intervals, it should be satisfied                                   the computing tasks at local device. Meanwhile, acomp,m (t)
                  ρ > 1.                                                                                             means that the computing task will be executed at edge
                                                                                                                     computing servers which are connected with APs. In the same
                         IV. P ERFORMANCE A NALYSIS AND DRL-BASED                                                    way, acomp,c (t) means that the cloud computing servers will
                                  O PTIMIZATION F RAMEWORK                                                           execute the computing tasks offloaded by the MTCDs at each
                     In this section, we formulate the optimization problem                                          time slot t.
                  in deep reinforcement learning to handle the dynamic and                                              Blockchain Systems Decision: In the proposed network
                  large-dimensional characteristics of M2M communications in                                         architecture, the blockchain systems can be used in order to
                  IoT networks. Based on a prominent characteristic of deep                                          ensure data security, but with inevitable latency and energy
                  reinforcement learning framework, it includes an offline deep                                      consumption. Thus, in each time slot, the decision of whether
                  convolutional neural network (CNN) construction phase and                                          to select the blockchain systems can be represented as
                  an online dynamic deep Q-learning phase. In other words, the
                                                                                                                                                      ablock (t) = {0, 1}.                                   (19)
                  action-value function with corresponding actions and states
                  can be formulated as offline, the action selection as well as                                      where ablock (t) = 0 means the blockchain systems will not
                  dynamic network updating can be obtained as online. With                                           be selected and utilized, while ablock (t) = 1 represents that
                  the modeling of action space, state space and reward function,                                     the blockchain systems will be selected and the data will be
                  the optimization problem is formulated as a DRL process as                                         loaded to the block.
                  follows.
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 7
                    Moreover, dnm ,APm (t) is the distance between the nm th                                            Next State                                                State s(t)              Re re(t)
                                                                                                                          s(t+1)
                  MTCDs and the APm , and cnm ,APm (t) is the data traffic
                  capacity of the link between the nm th MTCDs and the APm .
                                                                                                                                       Environment                                       RL Agent
                  χ(t) is represented as a union of the data transaction size in
                  blockchain systems.
                                                                                                                                                                                 Action a(t)
                  decision-making process and dimension increase exponen-                                            where θ is the learning rate and it should be satisfied θ ∈
                  tially with the number of actions and states, we introduce                                         (0, 1).
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 8
                                                                                                                                          .
                                                                                                                                          .
                                                                                                                                          .
                                                                                                                                                                                     .
                                                                                                                                                                                     .
                                                                                                                                                                                     .
                  a target DQN is introduced into deep Q-learning algorithm
                  to assist training, it is updated with a smaller learning rate
                  to maintain stable and smooth for the training process. Then,
                                                                                                                                                                  .
                  the mean square deviation between the target Q value and                                                                                        .
                  the current Q value is defined as a loss function L(ω), and                                                       0                             .
                  it is represented as                                                                                                                                                            x
                                 [                                                ]
                                                   ′     ′
                     L(ω) = E (re + ξ max      Q(s   , a   ; ω −
                                                                 ) − Q(s, a; ω))2
                                                                                    ,                                Fig. 3: Actions and system states mapped of an RGB image.
                                             ′
                                                       a
                                                                        (26)
                  where ω is the parameter of the neural network, ω − is the                                         since the system state space is finite, we need to map the
                  the parameter of target DQN to keeping Q value stable and                                          continuous state to discrete state set. Meanwhile, different
                  the training process smooth. Based on formula (26), DQN                                            state sets are set as different levels and mapped to the RGB
                  updates its network parameters by minimizing L(ω).                                                 image. In other words, different colors in the image represent
                                                                                                                     different levels of the state sets. Then, the agent explores
                  B. Dueling Deep Q-learning Network                                                                 the grid model with different actions. According to different
                     As an improved algorithm of model-free DRL, dueling                                             actions and corresponding system states, the different system
                  DQN, can estimate the Q-values with lower variance and                                             rewards can be achieved after moving one step. Based on this
                  use the greedy policy to ensure adequate exploration of                                            method, the agent will continuously explore different paths in
                  the action space. However, different with traditional value-                                       the grid model, and the available system rewards will reach
                  based DRL, dueling DQN enables to calculate the state value                                        an optimal value and convergence after the whole training.
                  and action advantage, respectively. Thus, the calculation                                          Repeating this training process, the training network model
                  and derivation process of dueling DQN algorithm can be                                             will be completed and obtained. In order to understand easily,
                  formulated as combination value of environmental state and                                         the proposed scheme is shown in the Fig. 3.
                  executive action, and it can be written as                                                            The work flows of dueling DQN is shown in Fig. 4, and
                                                                                                                     the whole process of the proposed algorithm is shown in
                                              Q(s, a) = V (s) + A(a).                                     (27)       Algorithm 1 in detail.
                     Through dueling DQN, the problem about repeated calcu-
                  lation of the same state value can be addressed, and it can also                                             VI. S IMULATION R ESULTS AND D ISCUSSIONS
                  improve the capability of estimating environmental state with                                         In this section, simulation results are presented to demon-
                  a clear optimization objective [13]. Moreover, dueling DQN                                         strate the performance improvement for delay-tolerant data
                  also utilizes a novel strategy called experience replay [45].                                      transmission in blockchain-enabled M2M communications
                  It stores the past experiences into a replay memory, and                                           by our proposed scheme. Significant advantages can be
                  randomly samples mini-batches from the pool to train the                                           observed in the results with various training parameters,
                  deep neural network, which refrains the agent from only                                            different data sizes for computation offloading as well as
                  concentrating on what the network is currently doing. In                                           different acceptable delay constraints.
                  addition, ϵ-greedy policy is used to balance the exploitation
                  and the exploration [13]. According to above training process
                  and continuous exploration, the available system rewards will                                      A. Simulation Environment
                  reach an optimal value and convergence. Therefore, in this                                            In the simulation, we consider and deploy a blockchain-
                  paper, we adopt dueling DQN to find the optimal strategy of                                        enabled M2M communications environment with M = 5
                  data caching, computing, as well as the blockchain systems                                         small cells in a 1000 m ×1000 m region. In each small
                  selection for delay-tolerant data in M2M communications                                            cell, there are Nm = 50 MTCDs and one AP equipped with
                  networks.                                                                                          blockchain systems and edge computing servers to offer wire-
                     In addition, according to [46], [47], in order to utilize the                                   less access and data computation services. Moreover, we also
                  DRL model simply, we map the action, state and system                                              take into account one core controller with caching and cloud
                  rewards of the proposed scheme to a grid model. The                                                computing servers in the proposed network architecture. In
                  whole training process just likes playing the grid game. The                                       the initial time slot, the channel bandwidth between the
                  formulated action space can be designed and discretized, and                                       MTCD and AP is set as 10 MHz. The transmit power of each
                  the action set which is taken probably in each time slot is                                        MTCD and each AP is set 100 mWatts and 10 Watts, while
                  mapped the moving direction in the grid model. Moreover,                                           the system background noise power is 5 mWatts. The channel
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 9
                                                                                                                                                State
                                                                                                                                                Value
                                                                                                                                                 V(s)
                                                                                                                                Full
                                                                                                                                                                  Q(s(t),a(t))           Action a(t)
                                                                                                                             Connected
                                  Input                                                                                        Layer
                                                                                                                                                Action
                                                Size: 84×84×3 Size: 20×20×32 Size: 9×9×64                    Size: 7×7×64                       Value
                                                                                                                             Size: 1×1×512       A(a)                    Update
                                                Filter: 8×8   Filter: 4×4    Filter: 3×3                     Filter: 7×7
                                                Stride: 4     Stride: 2      Stride: 1                       Stride: 1
                  Algorithm 1 Performance Optimization Framework for                                                        Meanwhile, the CPU computation capability of MTCD, edge
                  Delay-Tolerant Data in Blockchain-Enabled M2M Commu-                                                      computing servers or cloud computing servers is set to be
                  nications Based on Dueling DQN                                                                            Fnm = 0.5 GHz, FAPm = 5 GHz and Fm = 20 GHz,
                    1:   Initialize:                                                                                        respectively. For the aspect of data caching, the unit price
                    2:   Offline dueling DQN construction:                                                                  for data caching in servers at AP or at cloud is set as 10 and
                    3:   Parameters of DQN ω;                                                                               5, and the capacity of cache storage is set as 10 MB. Besides,
                    4:   Parameters of DQN ω − ;                                                                            for the aspect of data uploading into blockchain systems, the
                    5:   Loading the historical system state and Q(s, a) value                                              bath size of block P is set as 3, the number of bytes contained
                         estimates in experience memory;                                                                    in each block Sb = 5 MB, the average time required to create
                    6:   Pre-training the dueling DQN with input pairs with each                                            a new block Tlim = 1s, the computing costs for verifying
                         action and state (s, a) and corresponding Q(s, a; ω);                                              signatures or generating message authentication γ or δ is set
                    7:   Online dueling DQN execution:                                                                      as 2 MHz or 1 MHz, and the number of block producers is
                    8:   Initialize the environment and the initial state;                                                  set as 20. Furthermore, the weights υ, σ and τ are set as 0.3,
                    9:   for t = t, t + 1, t + 2, . . . , t + T do                                                          0.3 and 0.4. The aforementioned parameters are used widely
                   10:        for nm = 1, 2, . . . , Nm do                                                                  in the existing works [19], [36], [39].
                   11:            Execute a(t) based on ϵ-greedy policy, and obtain                                            The CNN is used as the evaluation network to calculate the
                         reward re(t), and next state s(t + 1);                                                             Q value and the target Q value. A 4-layer CNN is designed
                   12:            Form a sample (s(t), a(t), re(t), s(t + 1)), store                                        and adopted in the simulations. In the proposed scheme, we
                         it into experience memory;                                                                         formulate that the initial input image is firstly resized into
                   13:            Calculate the state value V (s(t)) and action ad-                                         84 × 84 × 3. The first hidden layer convolves 8 × 8 filters
                         vantage A(a(t));                                                                                   with stride 4 with the input image. The second hidden layer
                   14:            Obtain Q(s(t), a(t); ω) based on V (s(t)) and                                             convolves 4 × 4 filters with stride 2, and the size turns to
                         A(a(t)) through Eq. (27);                                                                          20 × 20 × 32, again followed by a rectifier nonlinearity. The
                   15:            Calculate the target Q-value from the target                                              third hidden layer convolves 3 × 3 filters with stride 1, the
                         DQN by Qtarget ← re(t + 1) + ξQtarget (s(t +                                                       size turns to 9 × 9 × 64. The final hidden layer convolves
                         1), arg max(Q(s(t + 1), a(t + 1); ω); ω − ));                                                      7 × 7 filters with stride 1, the size turns to 7 × 7 × 64. In each
                                   a∈A
                   16:            Update the target Q network with learning rate θ                                          convolutional layer, the rectified linear unit (ReLU) function
                         and loss function L(ω) in each step;                                                               is selected as the activation function. Then, through 4 layers
                   17:        end for                                                                                       of convolutional operations, the output nodes will be fully
                   18:   end for                                                                                            connected to be trained in the deep reinforcement learning.
                                                                                                                               The system performance of the proposed scheme is com-
                                                                                                                            pared with several existing schemes in the simulations, such
                                                                                                                            as greedy strategy and random selection strategy. We also
                  gain identically follows a Gaussian distribution with zero                                                consider and study the impact of different parameters in order
                  mean and unit variance, and the path loss exponent κ is set as                                            to ensure comparison fairness, such as different episodes in
                  3. In addition, for the aspect of data computing, the data size                                           the dueling DQN, different number of CPU cycles, different
                  for the computation offloading is αnm (t) = 600 KB, and the                                               number of data sizes for computation offloading, different
                  total number of CPU cycles is βnm (t) = 1200 Megacycles.                                                  capacity of cache storage at edge computing servers, different
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 10
                                           550
                                                                                                                                               450                                        The Proposed Scheme
                                           500                                                                                                                                            Edge Computing Only
                                                                                                                                                                                          Cloud Computing Only
                                           450                                                                                                 400
                                           400
                                                                                                                                               350
                       System Rewards Re
                                                                                                                           System Rewards Re
                                           350
                                                                                                                                               300
                                           300
250 250
                                           200                                                                                                 200
                                           150
                                                                                                                                               150
                                           100                                                 Learning Rate 10 -2
                                                                                               Learning Rate 10 -3                             100
                                           50
                                                                                               Learning Rate 10 -4
                                            0                                                                                                  50
                                                     200    400       600     800   1000   1200    1400      1600                               500   600   700   800   900   1000    1100    1200     1300     1400
                                                                              Episode                                                                        Number of CPU Cycles (Megacycles)
                  Fig. 5: Comparison of the system rewards with different                                               Fig. 7: Comparison of the system rewards with different
                  learning rates.                                                                                       numbers of CPU cycles.
                                           300
                                                                                                                        discussed above, due to using the dueling DQN, the con-
                                           250                                                                          vergence of the proposed scheme is the fastest compare with
                                                                                                                        other existing schemes. Meantime, the system rewards in the
                                           200
                                                                                                                        proposed scheme are still higher than other three existing
                                           150                                                                          schemes, the reason is that the system rewards depend on
                                                                                                                        the appropriate selection and decision of different caching
                                           100
                                                                                                                        and computing servers as well as utilization of blockchain
                                           50                                                                           systems. In other words, the optimal selection and decision of
                                                                                                                        caching resource, computing resource as well as blockchain
                                            0
                                                 0    200       400         600     800    1000       1200       1400   systems are benefit to delay-tolerant data in M2M commu-
                                                                              Episode                                   nications networks.
                  Fig. 6: Comparison of the system rewards with different
                  schemes.                                                                                              C. Performance Comparison for Computation Offloading
                                                                                                                           Fig. 7 depicts the comparison of the system rewards with
                                                                                                                        different numbers of CPU cycles. In this figure, the system
                  block size limitations, and etc.                                                                      rewards by the proposed scheme increase much faster than
                                                                                                                        those by only consider edge computing or cloud computing
                                                                                                                        schemes. The advantage of the proposed scheme is prominent
                  B. Performance of Convergence                                                                         because joint edge computing and cloud computing servers
                     In Fig. 5, it shows that the system rewards in the proposed                                        can be utilized by the MTCD through the DQN. Hence, with
                  blockchain-enabled M2M communications networks under                                                  the increasing number of CPU cycles, the optimal selection
                  different learning rates. Learning rate in DRL refers to the                                          and decision can be made to handle the computing tasks
                  magnitude of the network parameter updated by the gradient                                            carried by MTCDs through computation offloading.
                  of the loss function. In other words, higher learning rate                                               In Fig. 8, it shows the comparison of the system rewards
                  means larger parameter update range. As shown in Fig. 5,                                              with different data sizes by different schemes. In order to
                  the system rewards keep more stable performance with lower                                            ensure compare fairly, we compare the proposed scheme with
                  learning rate, because it has the capability to find the precise                                      the existing schemes includes conventional DQN strategy,
                  position of the optimal value. Moreover, it also can be                                               greedy-based strategy and random selection strategy. From
                  seen that higher learning rate enables better convergence.                                            this figure, it can be seen that with the increasing sizes for
                  Therefore, in the simulation of this paper, we select the                                             computation offloading, the system rewards in the proposed
                  learning rate as 10−3 .                                                                               scheme and conventional DQN strategy increase obviously.
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                                             Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 11
                                                             300
                                                                                                                                              computing ability, the average service latency almost keeps
                                                                                                                                              stable in different number of CPU cycles. However, con-
                                                             250                                                                              sidering the data transmission and offloading, the average
                                                                                                                                              service latency by the cloud computing strategy is more
                                                             200
                                                                                                                                              higher than other two schemes. Meanwhile, the advantage
                                                                                                                                              of the proposed scheme is prominent because the optimal
                                                                                                                                              decision can be made based on training by dueling DQN, and
                                                             150                                                                              the appropriate computing servers can be selected according
                                                                                                                                              to different network environments.
                                                               100   200     300     400      500     600      700     800      900    1000
                                                                      Number of Data Sizes for Computation Offloading (KB)
                                                                                                                                              D. Performance Comparison for Joint Optimization
                  Fig. 8: Comparison of the system rewards with different data
                  sizes for computation offloading.                                                                                              In order to further compare performance improvement in
                                                                                                                                              the various aspects by the proposed scheme, we focus on
                                                                                                                                              the average service latency with different capacity of cache
                                                              3                                                                               storage at edge computing servers, and also take into account
                                                                                           Executing All Tasks at Cloud Servers               the system rewards with different block size limitations,
                                                             2.8                           Executing All Tasks at Edge Computing Servers
                                                                                           The Proposed Scheme
                                                                                                                                              different acceptable delay constraints as well as different
                                                             2.6                                                                              number of MTCDs by different optimization schemes.
                           The Average Service Latency (s)
                                                             2.4
                                                                                                                                                 In Fig. 10, with increasing number of the capacity of
                                                                                                                                              cache storage at edge computing servers, the average service
                                                             2.2                                                                              latency decreases in all of the schemes, but it decreases much
                                                              2                                                                               faster in the proposed scheme than those by other existing
                                                                                                                                              schemes. As can be seen in this figure, the service latency of
                                                             1.8
                                                                                                                                              the proposed scheme outperforms the existing conventional
                                                             1.6                                                                              DQN strategy and greedy-based strategy especially when
                                                             1.4
                                                                                                                                              the capacity of the cache storage is small. In addition, for
                                                                                                                                              the existing scheme of the random selection strategy, the
                                                             1.2
                                                                                                                                              service latency is longer because it has not optimization
                                                              1                                                                               strategy about caching. Thanks to the dueling DQN in the
                                                              500    600     700     800      900    1000     1100    1200     1300    1400
                                                                              Number of CPU Cycles (Megacycles)
                                                                                                                                              proposed scheme, the advantage is prominent and the system
                                                                                                                                              performance can be improved significantly.
                  Fig. 9: Comparison of the average service latency with                                                                         As an important part in the proposed scheme, the impact
                  different numbers of CPU cycles.                                                                                            of blockchain systems for the proposed system performance
                                                                                                                                              cannot be ignored. In Fig. 11, we discuss the comparison
                                                                                                                                              of system rewards with different block size limitations.
                  The performance improvement in the DRL outperforms the                                                                      In this figure, it shows that the blockchain-enabled M2M
                  traditional schemes, such as greedy-based strategy and ran-                                                                 communications networks have ability to get more system
                  dom selection strategy. Moreover, the system rewards in the                                                                 rewards with increasing block size in all of the schemes.
                  proposed scheme are still higher than in the conventional                                                                   However, it cannot increase infinitely since the constraint
                  DQN strategy and greedy-based strategy. The reason is that                                                                  restricts in the blockchain systems. Obviously, based on the
                  face to different processing methods of data computation,                                                                   training by the dueling DQN, more delay-tolerant data in
                  the optimal selection of computing servers in different time                                                                the proposed scheme has chance to select and upload the
                  slots can be determined through training in dueling DQN.                                                                    blockchain systems, and ensure the data security. Although
                  The simulation results also demonstrate that different data                                                                 the system rewards increase in both the conventional DQN
                  computing resources should be allocated properly to increase                                                                strategy and greedy-based strategy, the proposed scheme
                  system rewards.                                                                                                             outperforms it since the better decision can be made by
                     In order to consider the influence of the proposed scheme                                                                dueling DQN, and the proposed scheme pays more attention
                  for service latency, we compare the proposed scheme with                                                                    on long-term rewards in whole time frames.
                  the other schemes such as the computing tasks only executing                                                                   Fig. 12 depicts the variation of system rewards with
                  on edge computing servers or cloud computing servers. As                                                                    different acceptable delay requirements. From the Fig. 12,
                  shown in Fig. 9, it reveals the comparison of the average                                                                   it can be seen that with increasing acceptable latency, the
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                                                               Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 12
                                                             5.5                                                                                                         450
                                                                                                                 Random Selection Strategy                                                                               The Proposed Scheme
                                                              5                                                  Greedy-Based Strategy                                                                                   Conventional DQN Strategy
                                                                                                                 Conventional DQN Strategy                               400                                             Greedy-Based Strategy
                           The Average Service Latency (s)
                                                                                                                                                     System Rewards Re
                                                              4
3.5 300
3 250
                                                             2.5
                                                                                                                                                                         200
                                                              2
                                                                                                                                                                         150
                                                             1.5
                                                              1                                                                                                          100
                                                                   1        2        3       4        5     6      7       8       9         10                                1   2         3        4       5     6       7       8      9         10
                                                                       The Capacity of Cache Storage at Edge Computing Servers (MB)                                                               Acceptable Delay Constraint (s)
                  Fig. 10: Comparison of the average service latency with                                                                         Fig. 12: Comparison of the system rewards with different
                  different capacities of cache storage at edge computing                                                                         acceptable delay constraints.
                  servers.
                                                             350
                       System Rewards Re
                                                                                                                                                                         300
                                                             300
                                                                                                                                                                         250
                                                             250
                                                                                                                                                                         200
                                                             200
                                                                                                                                                                         150
                                                             150
                                                                                                                                                                         100
                                                             100
                                                                                                                                                                         50
                                                                                                                                                                           20      30        40      50      60     70      80      90    100        110
                                                             50
                                                                   1        2        3       4        5     6      7       8       9         10                                                           Number of MTCDs
                                                                                            Block Size Limitation (MB)
                                                                                                                                                  Fig. 13: Comparison of the system rewards with different
                  Fig. 11: Comparison of the system rewards with different                                                                        numbers of MTCDs.
                  block size limitations.
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                                                               Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 13
                                VII. C ONCLUSIONS AND F UTURE W ORK                                                  [11] F. Ghavimi and H.-H. Chen, “M2M communications in 3GPP
                                                                                                                          LTE/LTE-A networks: Architectures, service requirements, challenges,
                     This paper proposed a novel scheme to jointly consider                                               and applications,” IEEE Commun. Surveys and Tutorials, vol. 17, no. 2,
                  resource allocation of caching storage, computing servers as                                            pp. 525–549, Secondquarter 2015.
                                                                                                                     [12] M. Chiang and T. Zhang, “Fog and IoT: An overview of research
                  well as blockchain systems for delay-tolerant data in M2M                                               opportunities,” IEEE Internet of Things Journal, vol. 3, no. 6, pp.
                  communications networks, in order to decrease unnecessary                                               854–864, Dec. 2016.
                  latency and improve system performance. In the proposed                                            [13] C. Qiu, F. R. Yu, H. Yao, F. Xu, and C. Zhao, “Blockchain-based
                  framework, edge computing or cloud computing servers can                                                software-defined industrial Internet of Things: A dueling deep Q-
                                                                                                                          learning approach,” IEEE Internet of Things Journal, vol. 6, no. 3,
                  be selected and executed the complicated computing tasks,                                               pp. 4627–4639, Jun. 2019.
                  and the blockchain systems can be utilized to ensure the                                           [14] L. Qian, A. Feng, Y. Huang, Y. Wu, B. Ji, and Z. Shi, “Optimal SIC
                  data security and authenticity. Due to different selections and                                         ordering and computation resource allocation in MEC-aware noma
                                                                                                                          NB-IoT networks,” IEEE Internet of Things Journal, vol. 6, no. 2,
                  decisions of network resource such as storage, computation                                              pp. 2806–2816, Apr. 2019.
                  or blockchain, we employ the dueling DQN to solve the joint                                        [15] T. Wang, J. Zhou, A. Liu, M. Bhuiyan, G. Wang, and W. Jia, “Fog-
                  decision-making optimization problem. After training, the                                               based computing and storage offloading for data synchronization in
                                                                                                                          IoT,” IEEE Internet of Things Journal, vol. 6, no. 3, pp. 4272–4282,
                  optimal decisions about caching servers, computing servers                                              Jun. 2019.
                  as well as blockchain systems can be made with the maxi-                                           [16] P. Danzi, A. Kalør, C̆. Stefanović, and P. Popovski, “Delay and
                  mum system rewards, which include lower data transmission                                               communication tradeoffs for blockchain systems with lightweight IoT
                                                                                                                          clients,” IEEE Internet of Things Journal, vol. 6, no. 2, pp. 2354–2365,
                  latency, lower network costs and better data security guaran-                                           Apr. 2019.
                  tees. Simulation results demonstrated that, with the proposed                                      [17] C. Liu, Q. Lin, and S. Wen, “Blockchain-enabled data collection and
                  framework, the system rewards can be increased significantly                                            sharing for industrial IoT with deep reinforcement learning,” IEEE
                                                                                                                          Internet of Things Journal, vol. 15, no. 6, pp. 3516–3526, Jun. 2019.
                  compared with the existing schemes, and the stability of                                           [18] M. Li, F. R. Yu, P. Si, and Y. Zhang, “Green machine-to-machine
                  the proposed scheme can also be kept. Future work is in                                                 (M2M) communications with mobile edge computing (MEC) and
                  progress to consider other important issues, such as integrated                                         wireless network virtualization,” IEEE Commun. Mag., vol. 56, no. 5,
                                                                                                                          pp. 148–154, May 2018.
                  smart cities with energy-efficient M2M communications and                                          [19] M. Liu, F. R. Yu, Y. Teng, V. C. M. Leung, and M. Song, “Performance
                  blockchain systems proposed in our framework.                                                           optimization for blockchain-enabled industrial Internet of Things (I-
                                                                                                                          IoT) systems: A deep reinforcement learning approach,” IEEE Trans.
                                                                                                                          Indust. Infor., vol. 15, no. 6, pp. 3559–3570, Jun. 2019.
                                                 ACKNOWLEDGMENT                                                      [20] P. R. Pereira, A. Casaca, J. J. P. C. Rodrigues, V. N. G. J. Soares,
                                                                                                                          J. Triay, and C. Cervelló-Pastor, “From delay-tolerant networks to
                    We thank the editor and reviewers for their detailed reviews                                          vehicular delay-tolerant networks,” IEEE Commun. Surveys and Tu-
                  and constructive comments, which have helped to improve                                                 torials, vol. 14, no. 4, pp. 7–38, Fourthquarter 2012.
                  the quality of this paper.                                                                         [21] E. Bulut, Z. Wang, and B. K. Szymanski, “Cost-effective multiperiod
                                                                                                                          spraying for routing in delay-tolerant networks,” IEEE/ACM Trans.
                                                                                                                          Netw., vol. 18, no. 5, pp. 1530–1543, Oct. 2010.
                                                       R EFERENCES                                                   [22] S. Burleigh, A. Hooke, L. Torgerson, L. Fall, V. Cerf, B. Durst,
                                                                                                                          K. Scott, and H. Weiss, “Delay-tolerant networking: An approach to
                   [1] A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and                                           interplanetary Internet,” IEEE Commun. Mag., vol. 41, no. 6, pp. 128–
                       M. Ayyash, “Internet of Things: A survey on enabling technologies,                                 136, Jun. 2003.
                       protocols, and applications,” IEEE Commun. Surveys and Tutorials,                             [23] M. Li, F. R. Yu, P. Si, H. Yao, and Y. Zhang, “Software-defined
                       vol. 17, no. 4, pp. 2347–2376, Fourthquarter 2015.                                                 vehicular networks with caching and computing for delay-tolerant data
                   [2] N. Xia, H. Chen, and C. Yang, “Radio resource management in                                        traffic,” in Proc. IEEE Int. Conf. Commun. (ICC). Kansas City, MO,
                       machine-to-machine communications-a survey,” IEEE Commun. Sur-                                     May 2018, pp. 1–6.
                       veys and Tutorials, vol. 20, no. 1, pp. 791–828, Firstquarter 2018.                           [24] P. Si, Y. He, H. Yao, R. Yang, and Y. Zhang, “DaVe: Offloading delay-
                   [3] Y. Lin, J. Huang, C. Fan, and W. Chen, “Local authentication and                                   tolerant data traffic to connected vehicle networks,” IEEE Trans. Veh.
                       access control scheme in M2M communications with computation                                       Tech., vol. 65, no. 6, pp. 3941–3953, Jun. 2016.
                       offloading,” IEEE Internet of Things Journal, vol. 5, no. 4, pp. 3209–                        [25] E. Tabane, S. M. Ngwira, and T. Zuva, “Survey of smart city initiatives
                       3219, Aug. 2018.                                                                                   towards urbanization,” in Proc. IEEE ICACCE. Durban, South Africa,
                   [4] B. Al-Kaseem and H. Al-Raweshidy, “SD-NFV as an energy efficient                                   Nov. 2016, pp. 437–440.
                       approach for M2M networks using cloud-based 6lowpan testbed,”                                 [26] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,”
                       IEEE Internet of Things Journal, vol. 4, no. 5, pp. 1787–1797,                                     http://www.bitcoin.org/bitcoin.pdf, 2009.
                       Oct. 2017.                                                                                    [27] J. Xie, H. Tang, T. Huang, F. R. Yu, R. Xie, J. Liu, and Y. Liu,
                   [5] Cisco, “Cisco visual networking index (VNI) complete forecast for                                  “A survey of blockchain technology applied to smart cities: Research
                       2017c2022,” Tech. Rep., 2018.                                                                      issues and challenges,” IEEE Commun. Surveys and Tutorials, vol. 21,
                   [6] A. Ali, W. Hamouda, and M. Uysal, “Next generation M2M cellular                                    no. 3, pp. 2794–2830, Thirdquarter 2019.
                       networks: Challenges and practical considerations,” IEEE Commun.                              [28] A. Kosba, A. Miller, E. Shi, Z. Wen, and C. Papamanthou, “Hawk:
                       Mag., vol. 53, no. 9, pp. 18–24, Sep. 2015.                                                        The blockchain model of cryptography and privacy-preserving smart
                   [7] A. Barki, A. Bouabdallah, S. Gharout, and J. Traoré, “M2M security:                               contracts,” in Proc. IEEE SP. San Jose, CA, May 2016, pp. 839–858.
                       Challenges and solutions,” IEEE Commun. Surveys and Tutorials,                                [29] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey
                       vol. 18, no. 2, pp. 1241–1254, Secondquarter 2016.                                                 on mobile edge computing: The communication perspective,” IEEE
                   [8] M. T. Islam, A.-E. M. Taha, and S. Akl, “A survey of access manage-                                Commun. Surveys and Tutorials, vol. 19, no. 4, pp. 2322–2358,
                       ment techniques in machine type communications,” IEEE Commun.                                      Fourthquarter 2017.
                       Mag., vol. 52, no. 4, pp. 74–81, Apr. 2014.                                                   [30] M. Dehghan, B. Jiang, A. Seetharam, T. He, T. Salonidis, J. Kurose,
                   [9] M. Li, P. Si, and Y. Zhang, “Random access and virtual resource allo-                              D. Towsley, and R. Sitaraman, “On the complexity of optimal re-
                       cation in software-defined cellular networks with machine-to-machine                               quest routing and content caching in heterogeneous cache networks,”
                       communications,” IEEE Trans. Veh. Tech., vol. 67, no. 10, pp. 9073–                                IEEE/ACM Trans. Netw., vol. 25, no. 3, pp. 1635–1648, Jun. 2017.
                       9086, Oct. 2018.                                                                              [31] Y. Mao, J. Zhang, S. Song, and K. B. Letaief, “Stochastic joint radio
                  [10] M. Islam, A. M. Taha, and S. Akl, “A survey of access management                                   and computational resource management for multi-user mobile-edge
                       techniques in machine type communications,” IEEE Commun. Mag.,                                     computing systems,” IEEE Trans. Wire. Commun., vol. 16, no. 9, pp.
                       vol. 52, no. 4, pp. 74–81, Apr. 2014.                                                              5994–6009, Sep. 2017.
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 14
                  [32] Z. Xiong, Y. Zhang, D. Niyato, P. Wang, and Z. Han, “When mobile                                                          F. Richard Yu (Fellow, IEEE) received the PhD
                       blockchain meets edge computing,” IEEE Commun. Mag., vol. 56,                                                             degree in electrical engineering from the University
                       no. 8, pp. 33–39, Aug. 2018.                                                                                              of British Columbia (UBC) in 2003. From 2002 to
                  [33] A. Kiayias, E. Koutsoupias, M. Kyropoulou, and Y. Tselekounis,                                                            2006, he was with Ericsson (in Lund, Sweden) and
                       “Blockchain mining games,” in Proc. ACM Conf. Econ. Comput.                                                               a start-up in California, USA. He joined Carleton
                       Maastricht, The Netherlands, Jul. 2016, pp. 365–382.                                                                      University in 2007, where he is currently a Profes-
                  [34] B. A. Fisch, R. Pass, and A. Shelat, “Socially optimal mining pools,”                                                     sor. He received the IEEE TCGCC Best Journal Pa-
                       in Proc. 13th Conf. Web Inter. Eco. Bangalore, India, Dec. 2017, pp.                                                      per Award in 2019, Distinguished Service Awards
                       205–218.                                                                                                                  in 2019 and 2016, Outstanding Leadership Award
                  [35] M. Liu, F. R. Yu, Y. Teng, V. C. M. Leung, and M. Song, “Computation                                                      in 2013, Carleton Research Achievement Award
                       offloading and content caching in wireless blockchain networks with                                                       in 2012 and 2020, the Ontario Early Researcher
                       mobile edge computing,” IEEE Trans. Veh. Tech., vol. 67, no. 11, pp.                          Award (formerly Premiers Research Excellence Award) in 2011, the Ex-
                       11 008–11 021, Nov. 2018.                                                                     cellent Contribution Award at IEEE/IFIP TrustCom 2010, the Leadership
                  [36] X. Chen, “Decentralized computation offloading game for mobile                                Opportunity Fund Award from Canada Foundation of Innovation in 2009
                       cloud computing,” IEEE Trans. Parallel Distrib. Syst., vol. 26, no. 4,                        and the Best Paper Awards at IEEE ICNC 2018, VTC 2017 Spring, ICC
                       pp. 974–983, Apr. 2015.                                                                       2014, Globecom 2012, IEEE/IFIP TrustCom 2009 and Int’l Conference
                  [37] Y.-S. Chen, C.-H. Cho, I. You, and H.-C. Chao, “A cross-layer protocol                        on Networking 2005. His research interests include connected/autonomous
                       of spectrum mobility and handover in cognitive LTE networks,”                                 vehicles, security, artificial intelligence, distributed ledger technology, and
                       Simulation Modelling Practice and Theory, vol. 19, no. 8, pp. 1723–                           wireless cyber-physical systems.
                       1744, Oct. 2010.                                                                                 He serves on the editorial boards of several journals, including Co-Editor-
                  [38] Y. He, N. Zhao, and H. Yin, “Integrated networking, caching and                               in-Chief for Ad Hoc & Sensor Wireless Networks, Lead Series Editor
                       computing for connected vehicles: A deep reinforcement learning                               for IEEE Transactions on Vehicular Technology, IEEE Communications
                       approach,” IEEE Trans. Veh. Tech., vol. 67, no. 1, pp. 44–55, Jan. 2018.                      Surveys & Tutorials, and IEEE Transactions on Green Communications and
                  [39] Y. Wei, F. R. Yu, M. Song, and Z. Han, “Joint optimization of caching,                        Networking. He has served as the Technical Program Committee (TPC) Co-
                       computing, and radio resources for fog-enabled IoT using natural actor-                       Chair of numerous conferences. Dr. Yu is a registered Professional Engineer
                       critic deep reinforcement learning,” IEEE Internet of Things Journal,                         in the province of Ontario, Canada, an IEEE Fellow, IET Fellow, and
                       vol. 6, no. 2, pp. 2061–2073, Apr. 2019.                                                      Engineering Institute of Canada (EIC) Fellow. The Web of Science Group
                  [40] J. Feng, F. R. Yu, Q. Pei, X. Chu, J. Du, and L. Zhu, “Cooperative                            has identified him as a Highly Cited Researcher. He is an IEEE Distinguished
                       computation offloading and resource allocation for blockchain-enabled                         Lecturer of both Vehicular Technology Society (VTS) and Comm. Society.
                       mobile edge computing: A deep reinforcement learning approach,”                               He is an elected member of the Board of Governors of the IEEE VTS.
                       IEEE Internet of Things Journal, pp. 1–15, 2019, to appear, available
                       online.
                  [41] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
                       D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement
                       learning,” https://arxiv.org/abs/1312.5602.
                  [42] V. Mnih, K. Kavukcuoglu, D. Silver, and et al., “Adaptive resource
                       allocation in future wireless networks with blockchain and mobile edge
                       computing,” Nature, vol. 518, no. 7540, pp. 529–533, Feb. 2015.                                                      Pengbo Si (Senior Member, IEEE) received his
                  [43] F. Guo, F. R. Yu, H. Zhang, H. Ji, M. Liu, and V. C. M. Leung, “Adap-                                                B.S. degree and Ph.D. degree from Beijing Univer-
                       tive resource allocation in future wireless networks with blockchain and                                             sity of Posts and Telecommunications in 2004 and
                       mobile edge computing,” IEEE Trans. Wire. Commun., vol. 19, no. 3,                                                   2009, respectively. He joined Beijing University
                       pp. 1689–1703, Mar. 2020.                                                                                            of Technology in 2009, where he is currently a
                  [44] C. J. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8,                                                  Professor. During 2007 and 2008, he visited Car-
                       no. 34, pp. 279–292, May 1992.                                                                                       leton University, Ottawa, Canada. During 2014 and
                  [45] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experi-                                               2015, he was a visiting scholar at the University
                       ence replay,” https://arxiv.org/abs/1511.05952.                                                                      of Florida, Gainesville FL.
                  [46] W. Liu, P. Si, E. Sun, M. Li, C. Fang, and Y. Zhang, “Green mobility                                                    Dr. Si serves as the Associate Editor of Inter-
                       management in uav-assisted IoT based on dueling DQN,” in Proc.                                                       national Journal on AdHoc Networking Systems,
                       IEEE Int. Conf. Commun. (ICC). Shanghai, China, May 2019, pp.                                 the Editorial Board Member of Ad Hoc & Sensor Wireless Networks, and
                       1–6.                                                                                          the Symposium Chair of IEEE Globecom 2019. He also served as the
                  [47] Y. He, C. Liang, F. R. Yu, and Z. Han, “Trust-based social networks                           Guest Editor of Advances in Mobile Cloud Computing, IEEE Transactions
                       with computing, caching and communications: A deep reinforcement                              on Emerging Topics in Computing Special Issue, TPC Co-Chair of IEEE
                       learning approach,” IEEE Trans. Network Science and Engineering,                              ICCC’13-GMCN, Program Vice Chair of IEEE GreenCom’13, and TPC
                       vol. 7, no. 1, pp. 66–79, Mar. 2020.                                                          member of numerous conferences. His research interests include blockchain,
                                                                                                                     SDN, resource management, cognitive radio networks, etc.
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2020.3007869, IEEE Internet of
                                                                                                               Things Journal
IEEE INTERNET OF THINGS JOURNAL, VOL. XX, NO. XX, XXX 20XX 15
      2327-4662 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
                              Authorized licensed use limited to: University of Exeter. Downloaded on July 11,2020 at 15:16:11 UTC from IEEE Xplore. Restrictions apply.