In edge computing, the scheduling problem involves properly allocating tasks offloaded by edge users to edge servers. Therefore, scheduling the tasks at the appropriate time and allocating the right resources can be modeled as a multi-objective optimization problem in MEC. Moreover, each task has specific requirements, further adding to the complexity of the optimization problem.
We propose a Q-learning (QL)-based task scheduling algorithm (named PBQ) for edge users who want to offload their tasks to edge servers. Therefore, we formulate the task scheduling problem in edge computing as a Markov Decision Process (MDP), where the RL agent uses Q-learning to derive the optimal policy.
A key challenge of reinforcement learning (RL) in edge computing environments is the large state space, which results in prolonged learning times due to numerous and frequently changing parameters. This delay conflicts with the edge users' demand for rapid task allocation to edge servers.
To accelerate the RL-agent learning process, we design the following techniques in the PBQ method:
- We prioritize tasks from edge users based on their deadlines (inspired by EDF), where tasks with the closest deadlines are given the highest priority in PBQ.
- We partition the tasks offloaded by edge users, intended for allocation to edge servers, into smaller subsets. Each subset comprises tasks with the earliest deadlines, followed sequentially by subsets containing tasks with the next earliest deadlines. Consequently, the PBQ formulates a MDP for each subset of tasks individually, rather than for the entire set of tasks collectively.
- Regarding 'state': we generate the states for Q-learning, which consists of all possible permutations of binary offloading decisions. Each state is represented as a binary matrix with 'n' rows (number of tasks from edge users) and 'k' columns (number of edge servers). Each cell in the matrix indicates whether a given task is allocated to a specific edge server (represented by a value of '1') or not allocated (represented by a value of '0').
- Regarding 'action': at each step, the agent takes an action based on a randomly generated number. If the number is below '0.9', it selects the state with the highest reward from the Q-table; otherwise, it randomly selects the next state. The Q-table is initialized with zeros at the start of the learning process.
- Regarding 'reward':
- If the 'next state' causes any edge server to transition into an 'idle' state, resulting in an available edge server with sufficient vacant processor utilization not being utilized for task execution, the agent receives the negative reward.
- If the agent initially assigns t1 to S1 and t2 to S2, but reassigns them as t1 → S2 and t2 → S1 in the next state, a negative reward is incurred due to task transmission between servers. This results in data transfer overhead, increased bandwidth usage, interrupted task execution, and additional caching and queuing.
- If the 'next state' equals the 'current state' while assigned tasks still have "execution time", the agent receives a positive reward, as unnecessary task transmission between servers does not occur.
- If the next state results in unassigned tasks being allocated to edge servers without causing overloading, the agent receives a positive reward.
Priority-Based Q-Learning (PBQ) for Accelerating Reinforcement Learning-Based Task Scheduling in Edge Computing
Large action-state spaces can slow learning progress and require substantial memory in Q-learning, as the number of states, actions, and the size of the action-state space grow exponentially with each additional variable.
For example, a network with four users and one server has
Strategies were implemented to accelerate the learning process of RL agents:
- State spaces that offload tasks to only one server are eliminated when multiple servers are available, promoting load balancing by utilizing all servers rather than overloading some while leaving others idle.
- State spaces are eliminated when a task is partitioned across multiple servers, as a task occupying all servers prevents other tasks from being served.
- At each scheduling step, only
$m$ tasks with the earliest deadlines ($m$ =number of edge users) are processed, reducing the state space. For example, in a network with two servers and five users (each with five tasks), processing only the five earliest tasks yields a$5\times2$ matrix ($=2^{(5\times 2)}$ states) instead of a$25\times2$ matrix ($2^{(25\times 2)}$ states).
The dataset includes user indices, task execution times (
Kindly cite our paper as follows if you find this code useful:
@inproceedings{avan2023task,
title={A Task Scheduler for Mobile Edge Computing Using Priority-based Reinforcement Learning},
author={Avan, Amin and Kheiri, Farnaz and Mahmoud, Qusay H and Azim, Akramul and Makrehchi, Masoud and Rahnamayan, Shahryar},
booktitle={2023 IEEE Symposium Series on Computational Intelligence (SSCI)},
pages={539--546},
year={2023},
organization={IEEE}
}