Towards Sustainable AI
Towards Sustainable AI
Towards Sustainable AI
Monitoring and Analysis of Carbon Emissions in
Machine Learning Algorithms
Supervisors Candidate
Prof. MICHELA MEO AURORA MARTINY
Co-Supervisors
December 2023
Summary
ii
Table of Contents
List of Tables vi
Acronyms xi
1 Introduction 1
1.1 Research Objectives and Novel Contributions . . . . . . . . . . . . 7
1.2 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Literature Review 10
2.1 Carbon Emissions and Global Warming . . . . . . . . . . . . . . . . 11
2.2 Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Recurrent Neural Network . . . . . . . . . . . . . . . . . . . 16
2.2.2 Long Short-Term Memory . . . . . . . . . . . . . . . . . . . 19
2.3 Carbon Emissions in Deep Learning . . . . . . . . . . . . . . . . . . 20
2.3.1 Theoretical Energy Models . . . . . . . . . . . . . . . . . . . 21
2.3.2 Energy Consumption and CO2 relationship . . . . . . . . . . 22
2.3.3 Carbon Emissions Calculation Tools . . . . . . . . . . . . . 23
2.3.4 Deep Learning’s Environmental Implications . . . . . . . . . 28
2.3.4.1 Continuous Training’s Impact . . . . . . . . . . . . 31
2.4 Existing Energy-Efficient Approaches . . . . . . . . . . . . . . . . . 32
iv
3 Problem Statement and Dataset Description 34
3.1 Traffic Data from Italian MNO . . . . . . . . . . . . . . . . . . . . 35
3.2 PVWatts Energy Estimate Data in Turin . . . . . . . . . . . . . . . 38
4 Methodology 45
4.1 Monitoring Carbon Emissions . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 CodeCarbon . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 LSTM-Based Prediction . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.1 Python libraries comparison . . . . . . . . . . . . . . . . . . 49
4.3 Handling Negative Predictions in PV Panel Dataset . . . . . . . . . 49
4.4 Manipulating Training Hyperparameters . . . . . . . . . . . . . . . 49
4.4.1 Epochs and Input Sequence Length Ablation . . . . . . . . . 50
4.5 Architectural Model Design . . . . . . . . . . . . . . . . . . . . . . 51
4.5.1 Layers and Nodes Ablation . . . . . . . . . . . . . . . . . . . 52
4.6 Tailored Problem Formulations . . . . . . . . . . . . . . . . . . . . 53
5 Experimental results 55
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.2.1 Different Environments . . . . . . . . . . . . . . . . . . . . . 59
5.2.2 Baseline Results . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2.3 Manipulating Training Hyperparameters . . . . . . . . . . . 62
5.2.4 Architectural Model Design . . . . . . . . . . . . . . . . . . 69
5.2.5 Tailored Problem Formulations . . . . . . . . . . . . . . . . 72
Bibliography 82
v
List of Tables
2.1 Commonly used carbon tracking tools available for online estimation
afterward or at runtime in Python scripting. . . . . . . . . . . . . . 25
2.2 Summary of the fundamental features of each tool for measuring
energy and CO2 equivalents. . . . . . . . . . . . . . . . . . . . . . . 28
vii
List of Figures
2.1 The increase in worldwide emissions from the middle of the 18th
century to 2021 [34]. . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 A neural network consisting of a single input, one output, and two
hidden layers, with each hidden layer comprising three hidden units. 15
2.3 Difference in information flow between an RNN and a feed-forward
neural network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4 Basic instance of a RNN. . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 The memory element within the LSTM architecture. . . . . . . . . 20
2.6 Overview of energy flow to power DL computations. . . . . . . . . . 23
2.7 Comparison of Global Electricity Consumption in 2012 with the IT
Sector’s Energy Usage in billion kilowatt-hours (kWh) [62]. . . . . . 29
ix
Acronyms
AI
Artificial Intelligence
BS
Base Station
CNN
Convolutional Neural Network
CO2
Carbon Dioxide
DL
Deep Learning
GHG
Greenhouse Gas Emissions
GPU
Graphics Processing Units
IT
Information Technology
LSTM
Long Short-Term Memory
xi
MAE
Mean Absolute Error
ML
Machine Learning
NLP
Natural Language Processing
NREL
National Renewable Energy Laboratory
NN
Neural Networks
RNN
Recurrent Neural Networks
xii
Chapter 1
Introduction
The rationale behind assessing and mitigating the environmental impact of Artificial
Intelligence (AI) systems arises from the exponential growth in their usage. In
fact, the rapid and expansive growth of AI has led to big changes across various
sectors, reshaping how we live and work. The term “AI” encompasses a spectrum
of technologies designed to simulate human intelligence, enabling systems to learn,
analyze, and adapt to complex tasks. The recent explosion of AI is indeed largely
attributed to advancements in technology and the unprecedented availability of
data [1]. The exponential growth in computing power, particularly with the advent
of GPUs (Graphics Processing Units) and specialized hardware for AI tasks, has
significantly accelerated the training and deployment of complex AI models [2],
[3]. Moreover, the proliferation of digital data in various forms, including text,
images, videos, and sensor data, has provided the fuel necessary for training AI
algorithms. This abundance of labeled datasets has been also encouraged by the
rise of data-sharing platforms [4].
However, this accelerated development in AI technologies has cast a spotlight on
energy consumption and its environmental implications. The necessity to craft
AI systems that not only offer remarkable performance but also curtail their
carbon footprint has never been more crucial. Achieving this demands a thorough
exploration into energy-efficient Machine Learning (ML) algorithms [5]. This
exploration forms the basis of this thesis, which aims to advance sustainability in
AI through monitoring and analyzing carbon emissions from ML algorithms.
ML is closely related to the broader field of AI due to their overlapping capabil-
ities such as learning and decision making. In early research on computational
intelligence, ML was born as a subset within the domain of AI in the late 1970s
[6]. At this stage, the ML goal of developing systems that could learn and make
judgments in an automated fashion was seen as one of the main objectives under
1
Introduction
the AI umbrella.
However, as ML progressed through empirical studies on algorithms derived from
data rather than symbolic representation, it started to differentiate and specialize
compared to logic-based AI approaches [7]. By the 1980s, ML began to establish it-
self as a distinct discipline focusing on creating systems that use statistical inference
to learn from examples, as opposed to expert systems designed using logic-based
rule sets [8]. Nowadays, ML has evolved into a distinct subject with its theoretical
foundations and tools, although it remains tied to AI through shared automation
objectives. ML represents how subfields can emerge within a broader research
domain and mature at different paces. ML is universally defined as a broad set of
algorithms and statistical techniques that enable computer systems to automatically
improve tasks through experience, without needing to be explicitly programmed [9].
At a high level, ML algorithms build mathematical models upon exposing large
amounts of data. ML’s data-centric paradigm now complements symbol-based AI
methods with the overarching goal of developing intelligent systems.
ML, the field centered around enabling systems to learn and improve from experi-
ence or data ([1, 3]), encompasses diverse techniques leading to different learning
paradigms, as shown in Fig 1.1. Among all, supervised learning involves training
models on labeled datasets, where each input data is associated not only with
features but also with a corresponding output or target label [10]. In this way, the
model learns to map inputs to outputs based on the provided examples [1]. This
learning paradigm is of two main types:
2
Introduction
Figure 1.2: Prediction of a target numeric value based on a single feature [9].
Unsupervised learning, on the other hand, delves into uncovering hidden structures
or patterns within unlabeled data [12]. Thus, the algorithm has to extract the
meaning of data without the labels’ help [1]. Key techniques in unsupervised
learning include:
• Clustering: Clustering algorithms group similar data points together based on
their features, but without prior knowledge of categories [9]. These algorithms
identify clusters or groups within the data, where data points within the same
cluster are more similar to each other than to those in other clusters. For
instance, the k-means algorithm defines the similarity based on the distance
between points [1].
Reinforcement learning introduces a dynamic aspect, where agents (i.e. the learning
methods) learn optimal decision-making through interactions with their environ-
ment [9]. These agents improve their policies by navigating through trial-and-error
experiences, aiming to maximize rewards in a given environment [13]. This paradigm
finds application in various domains, from robotics to game-playing algorithms
like AlphaGo, where the system learns to make strategic moves through repeated
3
Introduction
gameplay [3].
Representation learning emerges as a pivotal subset of methods within ML, allowing
systems to automatically extract meaningful features from raw data [14]. These
features are essential in tasks such as classification or detection. Overall, these
diverse techniques address distinct aspects of learning: supervised for labeled data,
unsupervised for hidden patterns, reinforcement for dynamic decision-making, and
representation for feature extraction.
Deep Learning (DL), a subset of ML, stands out as a pivotal technology driving
the evolution of AI. It harnesses artificial neural networks with multiple layers,
allowing for representation learning through hierarchies of abstraction [14]. These
deep neural networks can progressively derive higher-level representations by com-
posing simple transformations on lower-level inputs [15]. During the years, the
development of robust programming platforms such as TensorFlow and PyTorch
and the advancements in hardware capabilities have facilitated the creation of more
sophisticated deep models [16]. A recent study [2] examines how DL has been more
prevalent in papers presented at the prestigious AI conference ACL1 between 2010
and 2020: these days, deep networks serve as the backbone for all publications.
Nowadays, model training requires significant computational time and power. This
is due to the increasing models’ complexity, given by factors such as the number of
parameters. In fact, the easiest technique to get better performances is to increase
the size of the model when there is access to large-scale data sets [2]. As the
training duration grows, the required computational power increases accordingly [17].
Additionally, if the model performs continuous learning, the computing cost may
increase even further [18]. In this context, the realm of ML has largely centered
around the pursuit of highly accurate models without giving due consideration to
energy consumption as a significant factor [19]. Nevertheless, ML algorithms require
vast amounts of computational resources for model training on large datasets. This
data-intensive paradigm of learning also means that ML models demand high
power both for training iteratively over vast datasets as well as inference during
deployment at scale [20].
In classical computer science fields, algorithmic progress is rigorously tracked based
on asymptotic analysis of cost scaling with problem size. For example, quicksort [21]
has clearly more efficient O(n log n) runtime compared to O(n2 ) sorting algorithms.
However, DL tasks such as seeking approximate solutions, defining problem diffi-
culty and assessing progress pose unique challenges compared to optimal problems.
Improvements are reported primarily in accuracy metrics rather than cost-scaling,
1
https://2023.aclweb.org/
4
Introduction
For these reasons, the concept of sustainable AI has emerged. It refers to the
development and application of AI systems that are environmentally conscious,
economically viable, and socially beneficial. It represents a concerted effort to
5
Introduction
instigate transformation across every stage of the life cycle of AI products. This en-
compasses the inception of ideas, the training phase, adjustments, implementation,
and overarching governance, all geared towards enhancing ecological sustainability
and promoting social equity. Sustainable AI goes beyond the scope of AI applica-
tions; rather, it encapsulates the entire socio-technical framework of AI [30].
Energy-efficient AI, an integral component of sustainable AI, focuses on reducing
the energy consumption and carbon footprint of AI systems and infrastructure.
Recognizing the environmental implications of AI technologies, there exists a grow-
ing necessity to evaluate and comprehend the energy consumption and associated
carbon emissions of these systems.
Fischer et al. [31] proposed an approach to assess the efficiency of any ML ex-
periment by considering it as a composite entity comprising a configuration and
environment. The configuration involves the specifics related to the task at hand,
encompassing aspects such as the type of task (inference, training, robustness
testing), the dataset employed, the model used, and all associated hyperparame-
ters. On the other hand, the environment pertains to the hardware and software
utilized during the execution of the experiment. Patterson et al. [26] suggested
that accounting for carbon emissions within this framework could be achieved by
incorporating the local energy mix as a part of the environment. The concept of an
energy mix delineates the distribution of available production from energy resources
to fulfill the energy requirements within a specific geographic area [32], as depicted
in Fig. 1.3. Primary energy source encompasses fossil fuels such as oil, natural gas,
6
Introduction
and coal, alongside nuclear energy, waste utilization, and a diverse array of renew-
able energy sources, including biomass, wind, geothermal, water, and solar power.
Fig. 1.4 shows the global primary energy use from 1800 to 2022. This adjustment
recognizes the impact of the energy sources used during the experiment execution
on carbon emissions. The properties utilized to gauge the efficiency of a task are
termed metrics, such as accuracy, model size, and power draw [33]. These metrics
are specific to the experiment configuration. By conducting experiments with
varying configurations under a fixed task and environment, researchers can compare
their efficiency. Additionally, exploring how a particular model performs across
different environments can also offer insights into its adaptability and efficiency.
However, it is important to note that certain configurations and environments
might not be feasible due to practical constraints. For instance, the choice of a
dataset might impose the usage only of certain models, and, at the same time,
specific models could necessitate particular software or hardware for execution.
This integration of the local energy mix within the environmental parameters
acknowledges its influence on the overall environmental footprint of the experiment.
delves into the realm of sustainable AI, particularly focusing on the crucial aspect of
monitoring and analyzing the carbon emissions produced by specific ML algorithms.
The overarching goal of this thesis is to analyze a critical domain within the realm
of ML model training, specifically focusing on time-series datasets derived from
network data. Understanding how parameters impact carbon emissions during both
the training phases and potentially in the subsequent inference phase is crucial.
Obtaining a comprehensive overview of the parameters to monitor serves as an
example of initiating such an analysis in a similar context.
The novel contributions of this study are the demonstration of the following
statements:
• There usually exists a compromise between carbon emissions and the perfor-
mance of a model.
• The first step in evaluating carbon emissions should be to use already available
open-source resources.
• Specific training hyperparameters should be considered when the analyses
focus on energy efficiency.
• Adjustments in a model’s architecture can impact its carbon emissions pro-
duction during both training and test phases.
• Key considerations about the possible consumption should be made when
initially formulating the problem.
• Each setting can have a different impact in terms of emissions during the
training and inference phase.
By delving into these questions, this study aims to shed light on the factors
influencing carbon emissions mainly during the training of ML models, offering
insights into the manipulation of model architecture for reduced energy consumption.
Moreover, it endeavors to scrutinize the implications of the initial problem setup.
This exploration lays the foundation for understanding and potentially mitigating
the environmental impact of ML processes on a broader scale.
9
Chapter 2
Literature Review
Figure 2.1: The increase in worldwide emissions from the middle of the 18th
century to 2021 [34].
The burning of non-renewable fossil fuels – coal, oil, and natural gas – remains a
primary contributor to the escalating levels of CO2 in the Earth’s atmosphere. This
surge in emissions stands as a consequence of unsustainable human activities that
perpetuate the combustion of finite resources formed over millions of years, unable
to replenish at the rate of consumption [38]. In addition, widespread deforestation
1
https://unfccc.int/
11
Literature Review
has further exacerbated the release of carbon emissions, disrupting the delicate
balance of the planet’s ecosystems. The repercussions of these actions are profound,
as the surge in GHG has intensified the process of global warming, triggering a
cascade of environmental changes that pose a significant threat to our planet’s
future. The impact is twofold: a surge in emissions directly linked to energy
consumption and the collateral damage from depleting natural ecosystems, both of
which pose dire threats to the planet’s future.
The surge in carbon emissions has disrupted the delicate balance of the Earth’s
climate systems, leading to a myriad of consequences, not only global warming.
These include a surge in the frequency and intensity of extreme weather events
such as hurricanes, droughts, and heat waves, as confirmed also by the European
Commission2 . Moreover, the rise in global temperatures has accelerated the melting
of polar ice caps and glaciers, contributing to the alarming rise in sea levels. All
these shifts in weather patterns have been led by the disruption of the delicate
equilibrium of ecosystems and this, in turn, is causing the extinction of numerous
plant and animal species. The repercussions of global warming and associated
weather changes are far-reaching, affecting not only the environment but also
human health, agriculture, and the economy. Recent records from the National
Center for Environmental Information in 2023 (as of November 8) underscore
the staggering toll of 25 confirmed weather and climate disaster events in the
United States, each causing losses surpassing $1 billion [39]. Moreover, the impact
of global warming transcends geographical boundaries, affecting both developed
and developing nations. The disproportionate burden of climate change is often
borne by marginalized communities and vulnerable populations, exacerbating social
inequalities and economic disparities. In regions prone to extreme weather events,
2
https://climate.ec.europa.eu/climate-change/consequences-climate-change_en
12
Literature Review
such as coastal areas and small island nations, the rise in sea levels poses an
imminent threat to livelihoods and infrastructure. Furthermore, the disruption of
agricultural patterns and water resources jeopardizes food security and exacerbates
the risk of famine in vulnerable regions.
The consequences of global warming extend beyond environmental and social
dimensions, influencing also the economic sectors and the geopolitical stability.
The increasing frequency of natural disasters places a significant strain on na-
tional economies and infrastructure, impeding long-term sustainable development.
Moreover, the geopolitical ramifications of climate-induced migration and resource
scarcity may exacerbate tensions and conflicts in regions already grappling with
political instability.
The interconnected nature of these challenges underscores the imperative for collec-
tive action to address the root causes of global warming and mitigate its far-reaching
impact. In general, the impact of human actions is profound, with the surge in
greenhouse gas emissions giving rise to a myriad of environmental changes that
pose a threat to the delicate balance of the planet in every aspect. The intercon-
nectedness of these effects underscores the urgency of addressing the root causes of
carbon emissions and implementing sustainable solutions to mitigate their impact.
Figure 2.2: A neural network consisting of a single input, one output, and two
hidden layers, with each hidden layer comprising three hidden units.
The Fig. 2.2 illustrates the architecture of a deep neural network designed with a
specific configuration, featuring an input layer, two hidden layers, and an output
layer. Each layer contains a defined number of units or neurons, showcasing the
intricate connections that enable the network to learn and make predictions.
At the beginning of the network is the input layer. This layer represents the initial
information fed into the neural network. In the context of the example, it can be
imagined as the features of a dataset, where each feature corresponds to a specific
aspect of the input data.
Moving into the heart of the network, there are two hidden layers, each comprising
three hidden units. These hidden layers play a crucial role in learning complex
patterns from the input data. The connections between units in adjacent layers
are associated with weights, which are adjustable parameters learned during the
training process.
At each hidden unit, the network performs a two-step process. First, it calculates
the weighted sum of the inputs, considering the associated weights. This step
involves multiplying each input by its corresponding weight and summing up these
products. It reflects the network’s ability to assign importance to different features
based on their impact on the learning task. The second step involves applying a
non-linear activation function to the weighted sum. This introduces non-linearity
to the model, enabling the network to capture complex relationships in the data.
Common activation functions include sigmoid, tanh, or ReLU (Rectified Linear
Unit).
The final layer, known as the output layer, produces the network’s prediction or
output. The number of units in this layer depends on the nature of the task. For
15
Literature Review
example, in a binary classification task, there might be one unit with a sigmoid
activation function, while a multiclass classification task could involve multiple
units with softmax activation.
In traditional ML modeling, feature extraction often relies on manual efforts,
requiring domain experts to identify and design relevant features for the model.
However, the landscape shifts in DL, where feature extraction takes on an automated
character [45]. The motivation for adopting DL, particularly in scenarios involving
data such as images represented by low-level features like pixel intensities, stems
from the pursuit of achieving a higher-level understanding of the data. Unlike
traditional ML, DL models leverage their architecture to automatically learn and
extract hierarchical representations of features from raw data, eliminating the need
for explicit manual feature engineering. This shift in paradigm allows DL models to
uncover complex patterns and representations in data, contributing to their efficacy
in various domains. On the other hand, this translates to a strong correlation
between data quantity and performance: the latter is often unsatisfactory when
dealing with limited volumes of data [14].
DL is also driven by the concept of multi-task learning, which posits that an
effective higher-level data representation should be applicable across various tasks
[1]. For this reason, DL models often share initial levels of the network among
different tasks. In a typical deep neural network architecture, layers of logical units
are interconnected. In a fully connected layer, the output of each unit within the
layer is linked to the input of every unit in the subsequent layer.
The ability to generate complex Neural Networks (NN), which needs the creation
of larger datasets, has been one of the main causes behind the incredible advances
in ML technology [4]. Representative NN today includes the Convolutional Neural
Network (CNN) [46] and the Recurrent Neural Network (RNN) [1]. This thesis
will concentrate on RNNs out of all the possible NNs. The choice is justified by
their prevalent use within this research context and their capacity to address the
specific requirements of this study.
far [11]. This is achieved through the incorporation of a Hidden Layer within the
network, where the key feature lies in the RNN’s Hidden State – a ‘memory state’
retaining knowledge from previous inputs. This characteristic bears a striking
resemblance to RNNs, which, just like our cognitive processes, have the capability
to integrate information across different time instances. Thus, they represent a
significant advancement in NN architecture, offering a dynamic solution for handling
sequential data.
Fig. 2.3 shows the difference between an RNN and a feed-forward neural network:
In this formula, Li is the current state, which is a function of the preceding state
Li − 1 and the input Xi .
A feed-forward neural network, similar to various deep learning algorithms, employs
a weight matrix for its inputs to generate an output. However, in contrast to
this, RNNs assign weights to both the present and preceding inputs. Moreover,
a recurrent neural network adjusts the weights for both gradient descent and
backpropagation across time.
Incorporating the activation function in the formula 2.1, the current hidden state
Li can be expressed as:
Here, W represents the weight, L signifies the single hidden vector, Whh denotes
the weight at the previous hidden state, Wxh is the weight associated with the
current input state, and tanh represents the activation function. This activation
function introduces non-linearity and effectively compresses the activations into
the range of [-1, 1].
Since RNNs exhibit the ability to remember and utilize preceding information,
where inputs and outputs are not independent, they emerge as the solution to
predict subsequent words in a sentence or sequence. Moreover, the utilization of
shared parameters across all inputs or hidden layers decreases complexity and the
number of parameters remains constant even as the number of time steps in the
sequence expands [3].
On the other hand, as data progresses through an RNN, certain information is
progressively eroded at each time step. Over time, the RNN’s state gradually
loses any trace of the initial inputs, functioning essentially as a short-term memory
network. This issue is primarily associated with the unstable gradient problem,
where the gradient diminishes as it traverses backward through layers, significantly
slowing down learning in earlier layers. In RNNs, this issue is exacerbated since
18
Literature Review
the gradients not only flow backward through layers but also extend backward
through time. This prolonged network operation increases the instability of the
gradient, making it extremely intricate for effective learning [15]. To mitigate this
issue, long-term memory cells have been introduced to address the limitations
of conventional RNNs. These newer cell architectures have demonstrated such
effectiveness that the base cells are no longer in use.
Similar to many other deep learning algorithms, RNNs are relatively old. The
history of RNNs starts in the 1980s, but it is only in recent years, driven by increased
computational power and abundant data resources, that we have recognized their
remarkable potential. Furthermore, the introduction of Long Short-Term Memory
(LSTM) networks in the 1990s has pushed RNNs to the forefront of deep learning.
In particular, RNNs’ intrinsic memory enables precise prediction and understanding
of sequential data in various domains, including time series, speech, text, financial
data, audio, video, weather, and more. Their unique ability to understand temporal
dynamics surpasses the spatial content focus of other algorithms. In the words of
Lex Fridman from MIT, RNNs shine when ‘the temporal dynamics that connect
the data are more important than the spatial content of each individual frame’.
input gate, and output gate. Each gate has its specific function in regulating the
flow of information within the cell state.
The forget gate decides what past information to discard, taking into account the
current input data xt and the previous hidden state ht−1 . It filters out less relevant
information, allowing the LSTM to prioritize essential elements.
The input gate processes new input data xt and decides what new information is
essential to remember, creating a “new memory update vector” Ct . This vector is
then integrated into the cell state, modifying it according to the relevance of the
new information.
Lastly, the output gate generates the new hidden state ht by combining the updated
cell state, the current input data, and the previous hidden state.
This way, the LSTM network effectively manages the long-term memory and
short-term working memory, enabling it to process sequences by remembering
and selectively using information, by employing basic addition or multiplication.
For this reason, it offers an enhancement over the conventional RNN method and
can be used for sequence prediction tasks, language modeling, and time-series
analysis where understanding dependencies between elements over time is crucial.
Furthermore, these gating mechanisms help maintain the steepness of the gradients,
preventing them from diminishing significantly as they backpropagate through the
network. By doing so, LSTMs effectively address the vanishing gradient problem,
ensuring that the training remains effective and shorter and that the network
maintains high accuracy in learning temporal dependencies over long sequences.
activities. Overall, the types of fuels employed by power plants within the grid
directly impact the carbon emissions associated with each computing device. This
underlines the criticality of comprehending and accounting for the diversity of
energy sources in assessing the environmental impact of local computation.
3
https://www.green-algorithms.org/
24
Literature Review
CodeCarbon
Eco2AI
experiment-impact-tracker
energyusage
Table 2.1: Commonly used carbon tracking tools available for online estimation
afterward or at runtime in Python scripting.
4
https://mlco2.github.io/impact/
25
Literature Review
address this limitation, users are encouraged to contribute missing TDP values
by submitting a pull request for database updates [55].
3. CodeCarbon 5 [56]: provides insights into the energy consumed by CPUs
and GPUs during software execution. To measure CPU energy consumption,
CodeCarbon relies on RAPL (Running Average Power Limit) files or on Power
Gadget, specifically for INTEL CPUs, and requires root access. However, if
access is not available, CodeCarbon utilizes the model of the CPU in order
to find the TDP. In case the model is unknown, the tool uses a fixed value
of 85W. For GPU energy consumption estimation, CodeCarbon utilizes the
pynvml library, specifically designed for NVIDIA GPUs. Access to either the
RAPL files or NVIDIA GPUs is necessary for CodeCarbon to accurately track
and report energy consumption [56].
4. Carbontracker6 [57]: specializes in tracking the energy consumption of CPU
and NVIDIA GPUs during model training. It leverages RAPL files, specifi-
cally for INTEL CPUs with root access, to report CPU energy consumption.
However, without access to these files, CPU measurement becomes unavailable.
For NVIDIA GPUs, CarbonTracker utilizes the pynvml library to measure en-
ergy consumption, exclusively catering to NVIDIA GPUs and not supporting
non-NVIDIA ones [57].
5. Eco2AI7 [58]: offers a comprehensive approach to monitor energy consumption
of both CPU and NVIDIA GPUs. It calculates equivalent carbon emissions
by considering regional emission coefficients. For CPU energy consumption
estimation, Eco2AI utilizes a model-based approach, searching a predefined
list for the corresponding TDP. In cases where the TDP is unavailable, Eco2AI
employs an average approximation of 100 watts. When measuring GPU energy
consumption, Eco2AI relies on the pynvml library, specifically tailored for
NVIDIA GPUs, omitting measurement capabilities for non-NVIDIA GPUs [58].
6. experiment-impact-tracker8 (EIT) [59]: specializes in tracking CPU and
NVIDIA GPU energy consumption during computational tasks. For CPU
energy measurements, EIT relies on RAPL files, specifically accessible for
INTEL CPUs with root access and Linux operating systems. However, it
does not offer CPU energy consumption monitoring for non-INTEL CPUs or
non-Linux systems.
5
https://codecarbon.io/
6
https://github.com/lfwa/carbontracker
7
https://github.com/sb-ai-lab/Eco2AI
8
https://github.com/Breakend/experiment-impact-tracker
26
Literature Review
When measuring NVIDIA GPU energy consumption, EIT uses the nvidia-smi
command line, tailored exclusively for NVIDIA GPUs. It doesn’t extend its
measurement capabilities to non-NVIDIA GPUs [59].
9
https://pypi.org/project/cumulator/
10
https://pypi.org/project/energyusage/
27
Literature Review
All CPUs
CodeCarbon Yes Yes / RAPL and Power Gadget Yes No / pynvml -
Nvidia GPUs only
All CPUs
Green Algorithms Yes Yes / No Yes Yes / - -
All GPUs
All CPUs
Eco2AI Yes Yes / No Yes No / pynvml -
Nvidia GPUs only
No CPUs
ML CO2 Impact No - Yes Yes / - -
All GPUs
Table 2.2: Summary of the fundamental features of each tool for measuring energy
and CO2 equivalents.
amount of global electricity, as depicted in Fig. 2.7. This highlights the pressing
requirement for thorough monitoring and analysis to reduce carbon emissions in
ML algorithms, which is essential in managing the environmental impact of this
fast-growing technological field.
Overall, this growth comes at a significant environmental cost, since the signifi-
cant upsurge in computational needs has led to a pronounced increase in energy
consumption. In fact, these recent developments – the possibility to obtain larger
amounts of data and the development of more complex model architectures – have
led to two main consequences: an increase in energy-intensive data storage solutions
due to the continuous need for data retrieval and transmission, data redundancy
and the maintenance of cooling systems in data centers, along with a heightened
need for computational power to sustain long training sessions.
Considering also that the energy sector stands as the primary source of worldwide
GHG emissions, it is evident the huge environmental impact deriving from these
advancements. Together, the aforementioned studies [63, 64] shed light on the
prohibitive expenses linked to this trend, which align with the ‘Red AI’ framework
– that focused on enhancing accuracy by leveraging extensive computational re-
sources, often overlooking the associated costs or environmental implications [64].
Additionally, since a large portion of the research community cannot afford the nec-
essary resources, the cumulative effect not only has an impact on the environment
but also creates barriers to the development of AI. While hardware improvements
have enabled training larger models with billions of parameters like GPT-3 [65, 66],
optimizing such networks requires immense computing resources that are difficult
and expensive to scale further [67]. This may suggest that the growth rate in DL’s
consumption of computational power may slow down [16]. Moreover, the limited
supply of specialized AI chips also imposes constraints.
Despite international agreements like the UNFCCC and the Kyoto Protocol [68],
GHG emissions surged at a faster rate from 2000 to 2010 compared to the preceding
decade. Specifically, the annual growth of GHG emissions in the global energy
supply sector increased from 1.7% per year between 1990 and 2000 to 3.1% per year
from 2000 to 2010 [27]. Thus, considering the conjunction of the general problem of
GHG emissions and the vast usage of DL methods, addressing the growing energy
demand within this field is even more crucial. Exploring avenues to enhance energy
efficiency in DL models is important in order to mitigate this global environmental
impact. Additionally, raising awareness among practitioners about the energy and
carbon footprint of these models could prompt proactive steps toward reducing
environmental implications.
For instance, in order to continue advancing capabilities while addressing these
sustainability concerns, researchers are exploring alternatives beyond the constant
upscaling of computing. Different works [33, 20] have found algorithmic efficiency
to be a promising avenue, showing reductions in operations needed to match
past baselines. Hardware optimizations also multiply efficiency gains significantly
over time. Researchers are likely to turn to problem-focused methods leveraging
these algorithmic innovations rather than relying primarily on "brute force" scaling
30
Literature Review
11
https://2021.eacl.org/news/green-and-sustainable-nlp
31
Literature Review
strategies to understand and address these implications within the NLP field.
A recent study [25] investigates the carbon footprint of developing a NLP model
through its multi-stage training process. The researchers found that the total CO2
emissions from a model’s training iterations could be equivalent to what would
be produced by 5 cars over their usable lifetimes. It also equaled the emissions
from over 300 flights between two major cities. This highlights the environmental
impact of developing even a single sophisticated AI system. While the goal of AI
research is often to optimize for metrics like accuracy, this work underscores the
need to also consider sustainability.
Rohde et al.[70] have profiled the energy demands of tasks in computer vision,
speech recognition and gaming. Models that handle more complex problems require
much more intensive computations. The computing power is quantified in petaflop-
days, which refers to the floating point operations performed in a day at a scale of
tens of trillions. More intricate AI architectures directly translate to higher energy
usage, GHG emissions and resource expenditure.
Extending our lens beyond NLP, the domain of time-series forecasting stands as
another critical area with significant considerations. Here, continual training is
pivotal in enhancing models’ predictive abilities over time. However, the ramifi-
cations of energy-intensive training processes transcend mere model development,
encompassing broader environmental and financial implications. Notably, literature
lacks comprehensive studies addressing the energy consumption and emissions
produced by algorithms applied in this domain, signaling a gap in understanding
the environmental footprint of time-series forecasting models.
Collectively, these studies demonstrate the significance of evaluating new AI tech-
niques based on their carbon footprint in addition to their predictive capabilities.
Energy efficiency must be a priority as models continue increasing in scale and
sophistication.
33
Chapter 3
35
Problem Statement and Dataset Description
Table 3.1 displays the considered zones and their corresponding colors as represented
on the map. Each zone is associated with a specific color for visual identification
on the map. The zones serve as microcosms, representing the diverse areas found
within an urban environment. For instance, the Politecnico di Milano (Polimi)
area, marked by the light green square in the figure, denotes an area frequented by
students, experiencing heightened activity levels during specific times of the day.
In contrast, other zones represent a mix of business districts, residential streets,
train station areas, soccer stadiums, university campuses, industrial sectors, and
exhibition venues, each exhibiting its own traffic patterns and behaviors.
For example, the business district (dark green) witnesses traffic peaks during
core business hours, while the residential zone (yellow) observes increased traffic
in the evening. Similarly, the train station area (purple) reflects high activity,
primarily coinciding with the start and end of typical working hours. The San Siro
neighborhood (grey), home to the soccer stadium, presents sporadic and fluctuating
traffic volumes depending on event schedules.
The visualization in Fig. 3.2 depicts the proportional distribution of traffic across
individual hours, considering the entirety of a day within the business area. Each
data point represents the percentage of traffic observed during a specific hour
relative to the total traffic recorded throughout the day. This analysis offers insight
into the hourly traffic patterns within the business area: the data illustrates a
36
Problem Statement and Dataset Description
consistent increase in traffic volume from 8 am, reaching its peak around 1 pm,
followed by a gradual decline. This pattern corresponds with typical working hours,
depicting heightened activity during the morning and early afternoon, gradually
tapering off later in the day.
Within each zone, the presence of a macro BS along with 6 micro BSs are considered.
Both these types of stations are components of cellular networks, each serving
distinct coverage areas and functions within the network infrastructure:
• Macro base stations are large cell towers strategically positioned to cover larger
geographic areas (≤ 35 km), such as neighborhoods, towns, or urban areas [77].
They provide wide-area coverage and are usually installed at higher elevations
to maximize coverage range. These base stations handle high-capacity data
and voice traffic, serving a large number of mobile devices within their coverage
area.
• Micro base stations are smaller in size and cover more localized areas compared
to macro stations. They are deployed in areas with high user density or where
additional capacity is needed. Micro base stations have lower power and cover
shorter distances compared to macros – ranging from a few meters to one or
two kilometers [77].
They might be used concurrently in the same geographic area and they are both
integral to maintaining an efficient and reliable cellular network. Indeed, they work
together to provide seamless connectivity across different scales and user demands.
This configuration with one macro and more micro Bss, ensures that the service
area is effectively covered by one macro cell that overlaps with smaller cells, thus
enabling comprehensive network coverage across various zones.
The dataset organizes each BS’s data into a format that contains several metrics
for analysis that vary over time. Each entry in the dataset is associated with a
timestamp, that indicates the temporal sequence in Unix timestamp format, acting
as the temporal reference point for the recorded data. The level of granularity and
extensive temporal coverage (2 months) within the dataset enables multifaceted
analyses and comprehensive evaluations of user behavior and traffic dynamics across
the varied BS. Since the original data was stated in KBytes, the total network
traffic is determined using the information retrieved from the dataset and then
multiplied by 8000. The problem’s target is represented by this entire volume.
The dataset provided comprises network traffic measurements recorded every 15
minutes. The aim is to forecast the network traffic at specific time intervals by
utilizing the information from previous time steps, such as in [76]. Thus, predicting
the total network traffic f of BS b at hour t of day d, leveraging historical traffic
data.
37
Problem Statement and Dataset Description
Let Nb,d,t symbolize the total network traffic at BS b during hour t on day d. The
primary aim is to forecast Nb,d,t by leveraging the network traffic data from the
preceding k time steps. Given that the dataset is sampled every 15 minutes, the
objective is to predict the initial 4 values within an hour (t, t + 1, t + 2, t + 3) based
on the k preceding traffic measurements.
Let {Nb,d,t−i }ki=1 denote the sequence of total network traffic samples at BS b for
the past k time steps. The prediction model aims to estimate N̂b,d,t , the predicted
total network traffic at hour t based on the previous k observations:
The specific Mean Absolute Error (MAE) formulation used to quantify this dissim-
ilarity is given by:
n
1Ø
LM AE (Nb,d,t , N̂b,d,t ) = |Nb,d,t+i − N̂b,d,t+i | (3.3)
n i=1
This equation calculates the average absolute differences between the actual and
predicted values over n prediction steps, providing a measure of the model’s accuracy
in forecasting total network traffic at each time step.
1
https://pvwatts.nrel.gov/
2
https://www.nrel.gov/
38
Problem Statement and Dataset Description
crucial for determining the most suitable size, placement, and configuration of
solar systems. NREL, a government-owned research facility based in Golden, CO
and funded by the US Department of Energy, specializes in renewable energy and
energy efficiency research, boasting over two decades of leadership in the sector.
The PVWatts Calculator was conceived as part of NREL’s collaboration with
the Environmental Protection Agency within the RE-Powering America’s Land
initiative [78].
The dataset plays a pivotal role in our research by offering crucial insights into
estimating solar energy production, with a specific focus on the Turin area. Its
significance lies in its ability to provide detailed and region-specific information that
aids in accurately predicting solar energy output. The data is based on realistic
solar irradiation patterns, representing the Typical Meteorological Year (TMY) in
the area with hourly granularity. A TMY refers to a standard collection of weather
data encompassing hourly values throughout a year at a specific geographical
spot [79]. These datasets are curated from long-term records, usually spanning a
decade or more. The selection process involves picking data for each month from the
year that best represents the typical weather patterns for that specific month. For
example, data for January might originate from 2013, while February’s data could
be sourced from 2020, and so forth. PVWatts utilizes weather information derived
from the NREL National Solar Radiation Database3 (NSRDB) where accessible,
supplemented by data gathered from various other sources to cover regions beyond
its availability.
In particular, the data incorporated into this study originates from the INTL
TORINO-CASELLE weather source, located approximately 8.3 miles from Turin
and marked by a latitude of 45.18° N and a longitude of 7.65° E. The training set
is 580 KB in size and contains 8,760 samples – the equivalent of one year of data.
The specifications used in the PVWatts Calculator for Turin include:
• System size: 1 kW direct current (DC). Therefore, the capacity of the system
is 1 kWp since the kWp (kilowatts-peak) refers to the maximum power output
the system can generate under standard test conditions.
• Module type: standard. Thus, it utilizes crystalline silicon cells.
• Array type: fixed with an open rack design. This setting assumes a static
placement of the PV modules without any tracking mechanisms that adjust
with the sun’s movement throughout the day.
• Estimated system losses: 14.08%, which accounts for performance losses
3
https://nsrdb.nrel.gov/data-sets/tmy
39
Problem Statement and Dataset Description
40
Problem Statement and Dataset Description
3. Wind Dynamics:
• Wind Speed (m/s)
Additionally, the temporal aspect is represented by a general feature:
• Temporal Feature:
– Month
This categorization provides a comprehensive view, distinguishing between solar
radiation, temperature conditions, wind dynamics, and the temporal influence
of the month, all crucial aspects in understanding the patterns of solar energy
generation in the Turin area. The target variable, “AC System Output (W)", was
the focal point, representing the actual power output generated by the system
under these varying meteorological conditions. This selection aimed to capture and
predict the system’s real-world performance, essential for evaluating the algorithm’s
predictive accuracy in forecasting solar energy production.
Figure 3.3: AC System Output in Watt obtained from the first week of January
of the PV panel production data.
In examining the dataset, an illustrative Fig 3.3, was extracted to showcase the
actual AC System outputs obtained from the initial week of January. This plot
directly represents the values retrieved from the dataset, offering a tangible repre-
sentation of real-time production estimated with the tool. As can be seen from
41
Problem Statement and Dataset Description
In order to get a full understanding of the dataset, another plot was generated
showing the typical days for each month, as reported in Fig. 3.4. Each typical
day was derived by averaging the production values across all days within a
specific month. This representation aimed to elucidate how PV production varies
concerning distinct periods, considering both the duration of production hours
(active hours) and the maximum output during the day. Notably, this depiction
unveiled significant fluctuations in production characteristics. The plot highlighted
substantial shifts in production characteristics, varying from about 200 W to 500
W, more than doubling between seasons. Such fluctuations are expected due to
the substantial variation in solar intensity between winter and summer seasons.
Still comparing seasons, from summer to winter, also the active hours decreased
42
Problem Statement and Dataset Description
by almost half. This reduction demonstrates the seasonal impact on the output:
production occurs solely during daylight hours.
This formulation adopts a sequential approach, employing past data within the
sliding window to forecast the energy output at the current time t. The objective is
to minimize the discrepancy between the predicted values Ŷ and the actual energy
production values Y over the dataset by defining a loss function L:
This optimization process aims to train the model f (·) to accurately predict the
energy production given historical feature data. The specific loss function could be
the MAE, Mean Squared Error (MSE), etc.
To comprehensively assess the model’s performance and facilitate result comparison,
the MAE is used as the evaluation metric. The MAE calculates the average absolute
differences between the predicted Ŷ and actual Y values over n data points, enabling
a robust estimation of the model’s predictive accuracy. The specific formulation is:
n
1Ø
LM AE (Y, Ŷ) = |Yi − Ŷi | (3.6)
n i=1
In the context of forecasting network traffic Nb,t at a specific base station b, the
problem is univariate, focusing on predicting a single variable (traffic volume) over
future time steps based on its past values. This univariate approach simplifies
the prediction task, considering only one target variable without incorporating
relationships with other variables.
On the other hand, forecasting the energy production of a PV panel involves a
multivariate problem. It encompasses predicting the energy output considering
various influencing factors such as irradiance, temperature, wind speed, and others.
This multivariate nature involves analyzing and forecasting the relationship between
multiple input variables (features) and the target variable (energy production),
capturing the complex interplay between these factors to predict the panel’s
performance accurately.
44
Chapter 4
Methodology
This study aims to monitor the carbon emissions of NNs and evaluate how emissions
trends relate to accuracy variations. A critical first step is to establish a methodology
to reliably calculate the carbon footprint of model development. When quantifying
emissions, several factors must be considered, including the hardware platform,
training hyperparameters, and network architecture.
To monitor NN carbon emissions, this work utilizes CodeCarbon1 – an open-source
tool for calculating energy and emissions from code execution (refer to Section 2.3.3).
CodeCarbon samples the power draw of hardware during training and calculates
total energy based on high-frequency measurements. Standard emissions factors
then convert energy values into carbon equivalents.
Through this methodology, the study aims to identify how training hyperparameters
and architectural design patterns differentially impact accuracy and emissions levels.
A series of controlled experiments will vary the number of epochs, the size of both
training and test sets, the network depth and width, and other factors. Analyzing
their effects can provide insight into optimizing networks’ sustainability without
significantly sacrificing predictive capabilities. Repeated monitoring of emissions
improved model versions will quantify sustainability gains from various techniques.
The overall goal of this research is to formulate an approach for quantifying carbon
emissions from DL-based solutions and correlating trends to maintain high accuracy
through careful hyperparameters and design choices. It particularly focuses on the
proper handling of timeseries data with diverse characteristics.
1
https://codecarbon.io/
45
Methodology
4.1.1 CodeCarbon
CodeCarbon is a Python package that is designed to estimate the CO2 emissions
from code execution. It considers the computing resources used, whether on cloud
infrastructure or personal devices and it calculates energy usage by accessing the
RAPL files or by searching in a list the TDP associated with the CPU model [56].
Power usage is also incorporated for machines with Nvidia GPUs supporting the
NVIDIA System Management Interface. Consumption from the CPU package
domain(s) and any GPU power comprise the total measurement.
46
Methodology
To determine emissions, the tool uses the GeoJS API2 to get the user’s location
and corresponding energy mix to calculate a CO2 intensity in kg per kW h. If the
location is unknown, the tool defaults to world averages. For the US, state-level
eGRID3 data directly provides emissions rates. Internationally, the tool reverse-
engineers fuel-specific CO2 formulas from eGRID to apply each country’s energy
mix. This consistency improves accuracy for electricity domains. In order to reflect
also the energy lost to heat during usage, power supply efficiency is taken into
consideration. Users can specify efficiency if known, otherwise, the tool defaults to
the minimum 80% efficiency certified by the 80-Plus program.
The eGRID data provides state-level energy production and carbon emissions for
fuels like coal, oil, and natural gas. CodeCarbon’s goal is to calculate kg of CO2
emitted per M W h for each fuel. It converts emissions values from metric tons to
kg and divides by energy production values to determine emission intensities, using
the calculations:
metric tons CO2 × 1,000 = kg CO2 (4.1)
kg CO2
Emissions = (4.2)
MWh
This allows for measuring international grid carbon footprints based on reliable
energy production values.
2
https://www.geojs.io/
3
https://www.epa.gov/egrid
47
Methodology
4
https://pytorch.org/docs/stable/nn.init.html
5
https://keras.io/api/layers/initializers/
48
Methodology
6
https://keras.io/
7
https://pytorch.org/
49
Methodology
The whole investigation seeks to decode the intricate interplays of different factors
in order to facilitate the optimization of DL models not just in terms of performance
metrics but also in tandem with carbon emissions reduction.
the model, significantly impacts the network’s ability to capture long-range depen-
dencies within the data. Modifying this parameter affects the amount of historical
information fed into the model, subsequently affecting the model’s capability to
learn complex patterns. The investigation’s focus first centers on adjusting the
input sequence length by initially altering the training set size while keeping the test
set size constant. Subsequently, variations in the test set size have been introduced,
consequently impacting the training set size as well. In the first scenario, the
primary emphasis is on adjusting the amount of historical data available for the
model to learn from. By varying the training set size, the model’s exposure to
historical patterns is manipulated, influencing its ability to comprehend complex
temporal dependencies. Conversely, in the second scenario, modifications are in-
troduced to the test set size, consequently impacting the training set size. Here,
the focus extends beyond solely adjusting the historical data available for training.
It involves altering the partitioning between training and testing datasets, which
affects the model’s understanding of unseen data. This shift can provide insights
into how variations in the testing dataset influence the generalization ability of the
model, consequently influencing the temporal aspects of model training and its
corresponding emissions.
These hyperparameters have been scrutinized, emphasizing their intrinsic relation-
ship with the duration required for model training, and therefore, their emissions.
This correlation is carried on through a meticulous exploration of varying these
hyperparameters across a spectrum of values. By delving into a broad range of
parameter values, the objective is to extract a clear trend highlighting how changes
in epochs and input sequence length directly influence the temporal aspect of model
training, and thus, the associated carbon emissions.
and abstraction within the temporal context. A larger number of nodes within
a layer can potentially enable the model to detect more nuanced and intricate
patterns within the sequential data. This is due to the increased capacity of the
network to process and extract diverse features at a more granular level, allowing
for finer distinctions and more detailed representations of the underlying data.
This exploration was undertaken to comprehend the trade-offs between model
complexity and computational efficiency in terms of energy consumption.
By designing models that are computationally efficient and require fewer resources,
we can significantly lower carbon emissions. It seems reasonable to analyze layer size
and quantity variations and find optimal configurations that balance performance
and sustainability.
In the context of optimizing for reduced carbon emissions, a critical aspect under
examination involves investigating the impacts of altering the number of layers
and nodes within the LSTM architecture. These hyperparameters were selected
based on their fundamental role in shaping the model’s complexity and capabil-
ity to capture intricate temporal patterns, which directly relate to the energy
consumption of the system. The number of layers determines the depth of the
network, influencing its ability to understand temporal dependencies at multiple
hierarchical levels. Similarly, the quantity of nodes within each layer directly
affects the model’s capacity for local feature extraction and abstraction within the
temporal context. By investigating these specific hyperparameters, the objective is
to discern their individual effects on model performance and corresponding trends
in energy consumption. The analysis conducted is strategically designed to discern
the individual impacts of varying the number of layers and nodes within the LSTM
architecture. In fact, to accurately attribute the influence of each parameter, the
investigation was performed separately for these two hyperparameters. In one
scenario, while exploring the impact of altering the number of layers, the number
of nodes within each layer was kept constant. Conversely, when examining the
effects of adjusting the number of nodes, the number of layers remained fixed. This
segregation allowed for a focused evaluation, aiming to isolate and understand the
direct energy consumption.
Overall, this exploration aims to identify optimal configurations that strike a balance
between model efficiency, performance, and sustainability, ultimately contributing
to a reduction in carbon emissions.
52
Methodology
influencing solar energy output, potentially resulting in a more refined and accurate
predictive model. Thus, feature selection not only offers a pathway for emission
reduction but also holds the potential to improve the overall accuracy of the
predictive model.
Specifically, this feature selection process is conducted within the framework of the
three distinct categories into which the features are subdivided: solar radiation,
temperature conditions, and wind dynamics. By isolating and incorporating only
the most pertinent features from these categories, this approach aims to refine the
predictive model, concentrating computational resources on the most influential
factors governing energy production. This strategic selection process ensures that
the predictive model focuses specifically on the key aspects within each category,
contributing to a more nuanced and accurate representation of the intricate interplay
between solar radiation, temperature conditions, and wind dynamics in determining
solar energy generation.
In the context of the PV panel dataset, where data stability is notably consistent,
a second modification in the problem formulation involves manipulating the sliding
window configuration. This approach serves as an alternative study with respect to
the initial feature selection strategy. The objective here is to scrutinize the relation-
ship between accuracy and emissions by varying the historical data considered by
the model. By adjusting the sliding window, the amount of historical information
provided to the model is modified, potentially influencing its understanding of
temporal patterns.
In both scenarios, the adjustment in the problem formulation aids in simplifying
the model structure, focusing computational resources on the most critical aspects
of the prediction tasks, and ultimately contributing to a more emissions-efficient
modeling approach.
54
Chapter 5
Experimental results
the available dataset, producing a temporal gap between training and test
data. The testing duration remains unaltered, ensuring a fixed evaluation
period across all training durations, allowing a focused exploration of the
model’s learning behavior and performance stability. All other specifications
align with the settings defined in the baseline.
Other experiments are taken into consideration in the context of modifying the
architectural model design:
• Layers Number Variation: The number of layers undergoes variation across a
spectrum of values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] in the first experiment. This
exploration aims to discern the impact of different layer counts on model
accuracy and emissions.
• Node Count Variation: The second experiment concentrates on varying the
number of nodes within each layer, spanning values of [16, 32, 64, 128, 256].
Here, a single layer is maintained to delve deeper into the specific influence of
node count variation on the model’s performance and emissions.
In the final experiment, the approach shifts from creating an individual LSTM
model for each base station to consolidating data from multiple stations within
the same zone. This consolidation aims to generalize the problem by designing a
single model for each zone, reducing the model’s specificity. The experimentation
involves also modifying the architecture by reducing the number of layers from 6
to 1 and adjusting the training epochs from 500 to 100. The decision to alter the
architecture by reducing the layer count and adjusting the training epochs stems
from the findings derived from earlier experiments and a careful evaluation of model
performance and emission outcomes. This change in approach permits a broader
analysis by shifting the focus from specific base stations to zones, potentially
allowing for a reduction of computational resources.
PV Panels Configuration
Regarding the PV panels dataset, the baseline settings are determined through a
grid search methodology. The test set size is 25% of the whole dataset. Given that
the dataset spans 365 days and that each sample lasts for one hour, the test set
comprises 25% of the dataset (25 · 8760 ÷ 100 = 2.190 samples), with the remaining
samples (6570 samples) making up the training set. A 24-hour sliding window is
established by selecting a sequence length of 24-time steps while considering hourly
measurements. The number of epochs is set to 250. Design-wise, in the LSTM
architecture, the number of layers is fixed at 3, and the number of nodes at 256.
Similarly to the previous dataset, the learning rate for the Adam optimizer remains
constant at 0.001.
During the experimental phase, these baseline settings are varied to understand
57
Experimental results
their impact. The following tests are conducted in order to modify the training
hyperparameters:
• Testing Size Variation: The size is varied as a percentage of the total dataset,
exploring [5%, 10%, 20%, 25%, 30%, 40%]. As the test set size changes, the
training set size is derived as the difference between the total dataset and the
test set size.
• Training Size Variation: In the third experiment, while maintaining a fixed test
set size at 25%, the training set size is explored across [25%, 35%, 45%, 55%,
65%, 75%] to understand its influence on model performance and emissions.
Furthermore, a number of trials are carried out to investigate the complex aspects
of architectural design using LSTM models, within the framework of the PV panels
dataset:
• Node Count Variation: The number of nodes is varied across the range [32,
64, 128, 256, 512, 1024], focusing on a single LSTM layer for this experiment.
This experiment concentrates on exploring the effect of node density within a
single layer.
The investigation into the impact of historical data is executed by manipulating the
sliding window configuration across a range of values: [2, 8, 16, 24, 48, 72, 96]. This
deliberate variation in the sliding window alters the temporal context provided to
the model, influencing the depth of historical data considered during predictions.
Each value within this range corresponds to a distinct temporal span, allowing
for a comprehensive evaluation of how different amounts of historical information
affect both the model’s predictive accuracy and its associated carbon emissions.
58
Experimental results
5.2 Results
5.2.1 Different Environments
The initial analysis focuses on the experimental environment by comparing the
two primary Python libraries for developing neural networks. This comparative
study is exclusively conducted for the Business zone of the network traffic data,
operating under the assumption that consistent outputs are achievable regardless
of the specific dataset employed. Within this scope, the implementation of the
LSTM network is done using both Keras and PyTorch frameworks and following
the baseline settings. To provide a comprehensive perspective, a detailed Table 5.2
offers insights into emissions and durations during both the training and testing
phases. This focused examination within a singular zone aims to establish a robust
comparison between Keras and PyTorch in terms of their impact on emissions
when implementing the LSTM network. From the results, it is evident that the
training phase is notably longer, leading to higher emissions when employing the
PyTorch library compared to Keras. Conversely, during the testing phase, the
trend reverses. Since the model is evaluated at each epoch, the divergence in
training times between the two libraries might stem from variations in metric
computation, which is integrated into the training process itself. This difference
in metric implementation could result in significantly different training durations.
On the other hand, during the testing phase, where only prediction occurs, the
significant differences in timing could be attributed to the standard functions used
by each library to execute predictions. Consequently, for all subsequent analyses,
Keras is consistently utilized. The focus remains on emissions during the training
phase, where this library demonstrates significantly lower energy consumption.
Table 5.2: Comparative analysis of accuracy and emissions during training and
testing phases using Keras and PyTorch libraries, focusing on the Business zone
within the network traffic dataset.
59
Experimental results
Specifically, for the network data, each experiment’s result is calculated across
all the zones studied, as shown in Table 5.3. This detailed examination under
the baseline settings provides a holistic view of the model’s performance and the
environmental impact. The initial observation reveals a direct correlation between
duration and the associated emissions. An examination of the outcomes concerning
duration and emissions during the training and testing phases highlights their
parallel growth. As the duration extends, a synchronous increase in emissions
is notable: this is emphasized by the fact that the training phase’s duration is
approximately ten times longer than the testing phase. This discrepancy in time
directly influences emissions, with the prolonged training duration significantly
contributing to a noticeable rise in emitted carbon. Furthermore, the evaluation of
accuracy, computed for both the training and testing phases, indicates a consistent
trend. The observed accuracy values hover around 0.3 for all zones in both the
training and testing phases. This suggests a balanced model performance without
significant signs of overfitting or underfitting across the examined zones.
Figure 5.1: Graph depicting the comparative analysis between predicted and
actual PV panel system outputs, showcasing the model’s performance in forecasting
photovoltaic energy generation against the ground truth measurements.
The outcomes for the PV panel dataset are showcased in another detailed Table 5.4,
presenting the metrics acquired during the baseline configuration experiments.
Additionally, to provide a visual understanding of the prediction accuracy, Fig. 5.1
displays the actual and predicted values. This representation visually demonstrates
how closely the predicted trend aligns with the actual data. Furthermore, a
60
Experimental results
Emissions Duration
Zone Phase Accuracy
(gCO2 eq) (seconds)
Table 5.3: Accuracy, emissions, and duration during both training and testing
phases across all zones under the baseline setup, concerning the network traffic
data.
zoomed-in figure specifically focuses on the initial two days of the dataset (Fig. 5.2),
offering a more precise view of the predictions’ alignment with the true trend.
This remarkable precision in prediction is attributable to the dataset’s stability,
61
Experimental results
Table 5.4: Accuracy, emissions, and duration during both training and testing
phases, considering the PV panel data and the baseline settings.
Figure 5.2: Close-up view detailing the initial two days’ predictions and actual
values of the PV panels system output, offering a more detailed insight into the
model’s performance within this specific timeframe.
(a) Accuracy and Emissions Variation with Epochs for Train Station Zone in
Network Traffic Dataset
(b) Accuracy and Emissions Variation with Epochs for PV Panel Dataset.
63
Experimental results
Fig. 5.3a encapsulates the findings for the network traffic data, focusing solely on the
Train Station zone for the sake of simplicity, while Fig. 5.3b encapsulates the results
for the PV panel dataset. This visual comparison serves as a foundational overview,
setting the stage for a deeper exploration of the observed distinctions between the
two datasets. In both datasets, emissions demonstrate a linear increase with the
rising number of epochs. However, the behavior of accuracy varies notably. In the
PV panel dataset, accuracy remains almost constant, whereas in the network traffic
data, it consistently improves, although with occasional oscillations. Consequently,
establishing a balance between accuracy and emissions proves challenging in the
latter case. On the other hand, a significant finding is revealed in the PV panel
dataset. The difference in accuracy varying the number of epochs is considerably
smaller compared to the difference in emissions. This suggests that extended model
training leads to higher emissions, without enhancing the performance. This implies
that reducing the number of epochs could result in substantial emission reduction
with only a slight compromise in accuracy.
In summary, while emissions show a linear growth with increasing epochs in
both cases, accuracy exhibits diverse patterns. The network traffic data lacks a
discernible balance between accuracy and emissions, whereas for the PV panel
dataset, reducing epochs could significantly cut emissions with minimal impact on
accuracy.
After comparing the results obtained for the two distinct datasets by varying the
number of epochs and analyzing the relationship between accuracy and emissions,
a different type of visualization is proposed in order to visually find different model
behavioral patterns with respect to the number of epochs. In this alternative
visualization, depicted in Fig. 5.4, emissions are plotted on the y-axis, while
accuracy is plotted on the x-axis. Each specific number of total epochs tested
during experiments is represented by plotting the results obtained for each base
station. Two different figures are created for this purpose: Fig. 5.4a and Fig. 5.4b.
Fig. 5.4a, employs color differentiation according to zone names, directing attention
towards scrutinizing the enhancement within each specific zone rather than the
dataset as a whole. On the other hand, in Fig. 5.4b the points are colored based on
the number of epochs, allowing for a clear observation that even though accuracy
continues to increase, the absolute improvement is significantly smaller compared
to the growth in emissions. Hence, even when considering this dataset, it appears
plausible to use a reduced number of epochs compared to the one used by the
baseline method, i.e., 500 epochs. Moreover, by combining the insights coming from
both visualizations, deeper considerations can be made. Firstly, the experiments
using a lower number of epochs, i.e., 50 epochs, reveal a higher variance in accuracy
between the zones, while for a total number of epochs equal to or greater than 100,
the variance noticeably decreases.
64
Experimental results
Figure 5.4: Comparison of accuracy and emissions variations for different epoch
aggregations.
65
Experimental results
(a) Accuracy and Emissions Variation with Number of Testing Days for Train
Station Zone in Network Traffic Dataset
(b) Accuracy and Emissions Variation with Test Size for PV Panel Dataset.
In the subsequent investigation, the focus shifts to the impact of varying training
set sizes while maintaining a fixed test set size. These results are reported in
Fig. 5.6a and Fig. 5.6b. The findings from both datasets reveal a consistent trend
where both accuracy and emissions exhibit an upward trajectory as the training size
increases, aligning with the initial hypothesis. This reaffirms and accentuates the
results observed in the previous test size ablation, shedding light on the heightened
significance of this relationship. The increased emphasis may be attributed to the
creation of a temporal gap between the training and test phases when reducing the
training size. Indeed, the temporal link within the data emerges as a pivotal aspect,
especially in the context of time series analysis. In the realm of time series, the
chronological order of data points is fundamental for capturing patterns and trends
inherent in temporal sequences. When varying the training size and introducing a
potential temporal gap between the training and test phases, the intricate temporal
connections within the data may be compromised. The essence of time series
analysis lies in comprehending how past events influence future occurrences. In
the context of machine learning models, maintaining a robust temporal link during
training becomes imperative to enable the model to glean meaningful insights
from historical data and generalize effectively to unseen future instances. The
observed increase in both accuracy and emissions with an expanding training size
underscores the significance of preserving this temporal continuity, highlighting the
nuanced relationship between the temporal structure of data and the performance
of machine learning models, particularly in time series scenarios.
67
Experimental results
(a) Accuracy and Emissions Variation with Number of Training Days for Train
Station Zone in Network Traffic Dataset
(b) Accuracy and Emissions Variation with Train Size for PV Panel Dataset.
This temporal gap potentially disrupts the coherent evolution of the model, accen-
tuating the importance of maintaining both a sufficient training size and a temporal
continuity of the data used during these critical phases.
(a) Accuracy and Emissions Variation with Number of Layers for Train Station
Zone in Network Traffic Dataset
(b) Accuracy and Emissions Variation with Number of Layers for PV Panel
Dataset.
(a) Accuracy and Emissions Variation with Number of Nodes for Train Station
Zone in Network Traffic Dataset
(b) Accuracy and Emissions Variation with Number of Nodes for PV Panel
Dataset.
Conversely, in the case of the PV panel dataset, selecting the initial value that
corresponds to the onset of the plateau proves advantageous, steering towards a
configuration where additional layers do not significantly contribute to accuracy
but may contribute to emissions.
Continuing with the analysis of architectural model design, another explored aspect
is varying the number of nodes in a single-layer LSTM model for both datasets.
Visualizing the outcomes through Fig. 5.8 reveals interesting trends. Across both
datasets, increasing the number of nodes within the LSTM model initially boosts
accuracy, showing a positive correlation. However, this improvement in accuracy
plateaus after a certain amount of LSTM units, while the environmental cost –
measured in terms of emissions – continues to rise steadily. This illustrates a trade-
off in which additional accuracy improvements come with a disproportionately
larger environmental cost. The marginal improvements in accuracy beyond a certain
threshold do not justify the exponential increase in emissions. Hence, pursuing
maximum accuracy without an equivalent rise in emissions is not justified. If the
increase in emissions does not correspond to substantial accuracy improvements, it
is unreasonable to sacrifice environmental impact for marginal gains in accuracy.
This table includes crucial performance indicators such as accuracy, emissions, and
duration for both the training and testing phases. To facilitate a more nuanced
understanding of the gains achieved with this modified configuration, any reduction
in terms of emissions is reported in green within brackets, highlighting positive
advancements. Conversely, in emissions instances where a loss is observed, it
is denoted in red within brackets. Moreover, the color scheme is employed also
to highlight gains and losses in the case of accuracy, but clearly with opposite
meanings: the color green signifies an increase, while the color red denotes a
decrease. This color-coded approach aims to accentuate the impact of improvements,
particularly in terms of emissions, providing a visual cue for the reader to discern the
magnitude and direction of the changes in the two considered metrics. The presented
results unequivocally demonstrate the effectiveness of the modified configuration
in substantially lowering emissions across all zones, thanks to the reduction in
both epochs and layers. While there is a slight dip in accuracy, it is crucial
to emphasize that this decline is relatively minor in magnitude. This nuanced
trade-off underscores the success of the streamlined configuration in prioritizing
environmental sustainability without compromising accuracy to a significant extent.
After showcasing the favorable outcomes resulting from adjustments in the num-
ber of epochs and layers, two histograms presented in Fig. 5.9 encapsulate the
comparison between aggregated and non-aggregated values, showcasing the stark
differences in accuracy and emissions. While accuracy, shown in Fig. 5.9a remains
consistently comparable across all zones, emissions undergo a drastic reduction,
exceeding a sixfold decrease in the aggregated approach, underscoring the efficacy
of this tailored aggregation strategy, as depicted in Fig. 5.9b. The effectiveness is
derived not only from maintaining accuracy but also from substantial mitigation of
the environmental impact achieved through the significant reduction of emissions.
73
Experimental results
Emissions Duration
Zone Phase Accuracy
(gCO2 eq) (seconds)
Table 5.5: Accuracy, emissions, and duration during both training and testing
phases across all zones under the modified setup, concerning the network traffic
data. Discrepancies from the original baseline are emphasized in green if they
represent improvements and in red if they indicate deterioration.
74
Experimental results
(a) Accuracy comparison between aggregated and separated base station configura-
tion.
(b) Emissions comparison between aggregated and separated base station configu-
ration.
75
Experimental results
Table 5.6: Accuracy, emissions, and duration during both training and testing
phases, considering the modified settings with respect to the baseline of PV panel
data. Discrepancies from the original baseline are emphasized in green if they
represent improvements and in red if they indicate deterioration.
The results obtained from feature selection provide valuable insights into the trade-
offs between model performance and environmental impact. Notably, when wind
dynamics-related features are excluded, the model demonstrates even superior
accuracy than that obtained with the baseline. This outcome aligns with the
concept that reducing the number of features allows the model to focus on the
76
Experimental results
Table 5.7: Comparison between values of training emissions and test accuracy of
the feature selection strategy after adopting the modified number of epochs (50).
Each experiment involves the removal of one specific category of features – solar
radiation, temperature conditions, or wind dynamics.
The overarching insight gleaned from the analysis is that identifying and leveraging
77
Experimental results
Figure 5.10: Accuracy and Emissions Variation with Sliding Window Configura-
tion for PV Panel Dataset.
specific patterns associated with distinct time intervals can provide a means to
optimize accuracy while minimizing emissions. By recognizing the periodicity of
certain production trends and their correlation with time intervals, it becomes
possible to pinpoint the minimum duration necessary to attain nearly maximum
accuracy. This strategic alignment of temporal patterns and emission considerations
opens avenues for designing more resource-efficient models without compromising
predictive performance.
78
Chapter 6
Moreover, the impact of architectural model design is also considered, revealing that
adjustments can significantly affect a model’s carbon emissions production. The
complexity of a model emerges as a crucial factor, influencing learning time, power
requirements, and ultimately, emissions. Additionally, architectural choices in model
design extend their influence beyond emissions during the training phase, paralleling
the significance attributed to training hyperparameters. Their impact is equally
pronounced in the subsequent inference phase, thus imparting a long-term effect on
emissions over an extended operational lifecycle. This comprehensive consideration
of emissions, encompassing both training and inference, gains paramount importance
in real-world scenarios where models are intended for continuous and prolonged
deployment. The examination of emissions across these phases assumes a pivotal
role in understanding and optimizing the environmental footprint of these models,
ensuring sustained efficiency and eco-friendly operations throughout their practical
deployment lifespan.
In future investigations, expanding the scope of this research entails the inclusion
of a broader spectrum of models and datasets. An interesting avenue involves
replicating this exploration using hardware configurations equipped with GPUs or
employing computer clusters. This expansion seeks to ascertain the generalizability
of the findings across diverse hardware scenarios. Assessing whether similar trends
in emissions prevail under varied computational setups will offer invaluable insights
into the reproducibility and applicability of the identified emission patterns.
Another fascinating possibility for future exploration is to observe how a model
affects the environment when it is used in the real world. This pragmatic approach
aims to bridge the gap between theoretical analyses and practical implications,
providing researchers and industries with tangible insights into the actual envi-
ronmental ramifications of employing ML models. Understanding the real-time
impact of model deployment offers an invaluable opportunity to gauge the ecological
80
Conclusions and Future Works
81
Bibliography
[1] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http:
//www.deeplearningbook.org. MIT Press, 2016 (cit. on pp. 1–3, 14, 16).
[2] Jingjing Xu, Wangchunshu Zhou, Zhiyi Fu, Hao Zhou, and Lei Li. A Survey
on Green Deep Learning. 2021. arXiv: 2111.05193 [cs.LG] (cit. on pp. 1, 4).
[3] Aston Zhang, Zachary C. Lipton, Mu Li, and Alexander J. Smola. «Dive into
Deep Learning». In: CoRR abs/2106.11342 (2021). arXiv: 2106.11342. url:
https://arxiv.org/abs/2106.11342 (cit. on pp. 1–4, 18).
[4] C. -C. Jay Kuo and Azad M. Madni. Green Learning: Introduction, Examples
and Outlook. 2022. arXiv: 2210.00965 [cs.LG] (cit. on pp. 1, 16).
[5] Emma Strubell, Ananya Ganesh, and Andrew McCallum. «Energy and Policy
Considerations for Deep Learning in NLP». In: Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics. Florence, Italy:
Association for Computational Linguistics, July 2019, pp. 3645–3650. doi:
10 . 18653 / v1 / P19 - 1355. url: https : / / aclanthology . org / P19 - 1355
(cit. on pp. 1, 10, 31).
[6] A. L. Samuel. «Some Studies in Machine Learning Using the Game of Check-
ers». In: IBM Journal of Research and Development 3.3 (1959), pp. 210–229.
doi: 10.1147/rd.33.0210 (cit. on p. 1).
[7] Tom M. Mitchell. The Need for Biases in Learning Generalizations. Tech. rep.
New Brunswick, NJ: Rutgers University, 1980. url: https://www.cs.cmu.
edu/~tom/pubs/NeedForBias_1980.pdf (cit. on p. 2).
[8] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of
Statistical Learning. Data Mining, Inference, and Prediction, Second Edition.
2nd ed. Springer Series in Statistics. Springer Science+Business Media, LLC,
part of Springer Nature. New York, NY: Springer, 2009, pp. XXII, 745.
isbn: 978-0-387-84857-0. doi: 10.1007/978-0-387-84858-7. url: https:
//doi.org/10.1007/978-0-387-84858-7 (cit. on p. 2).
82
BIBLIOGRAPHY
[9] Aurlien Gron. Hands-On Machine Learning with Scikit-Learn and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems. 1st. O’Reilly
Media, Inc., 2017. isbn: 1491962291 (cit. on pp. 2, 3).
[10] Qiong Liu and Ying Wu. «Supervised Learning». In: (Jan. 2012). doi: 10.
1007/978-1-4419-1428-6_451 (cit. on p. 2).
[11] Simon J.D. Prince. Understanding Deep Learning. MIT Press, 2023. url:
http://udlbook.com (cit. on pp. 3, 15, 17).
[12] H.B. Barlow. «Unsupervised Learning». In: Neural Computation 1.3 (Sept.
1989), pp. 295–311. issn: 0899-7667. doi: 10.1162/neco.1989.1.3.295.
eprint: https://direct.mit.edu/neco/article-pdf/1/3/295/811863/
neco.1989.1.3.295.pdf. url: https://doi.org/10.1162/neco.1989.1.
3.295 (cit. on p. 3).
[13] Yuxi Li. Deep Reinforcement Learning: An Overview. 2018. arXiv: 1701.07274
[cs.LG] (cit. on p. 3).
[14] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. «Deep learning». In:
Nature 521.7553 (May 2015), pp. 436–444. issn: 1476-4687. doi: 10.1038/
nature14539. url: https://doi.org/10.1038/nature14539 (cit. on pp. 4,
14, 16).
[15] Michael A. Nielsen. Neural Networks and Deep Learning. misc. 2018. url:
http://neuralnetworksanddeeplearning.com/ (cit. on pp. 4, 19).
[16] Andrew J. Lohn and Micah Musser. «AI and Compute: How Much Longer
Can Computing Power Drive Artificial Intelligence Progress?» In: Center for
Security and Emerging Technology (Jan. 2022). doi: 10.51593/2021CA009
(cit. on pp. 4, 30).
[17] Eva Garcia-Martin, Crefeda Rodrigues, Graham Riley, and Håkan Grahn.
«Estimation of energy consumption in machine learning». In: Journal of
Parallel and Distributed Computing 134 (Aug. 2019). doi: 10.1016/j.jpdc.
2019.07.007 (cit. on pp. 4, 5, 10, 29).
[18] Anne-Laure Ligozat, Julien Lefevre, Aurélie Bugeau, and Jacques Combaz.
«Unraveling the Hidden Environmental Impacts of AI Solutions for Environ-
ment Life Cycle Assessment of AI Solutions». In: Sustainability 14.9 (2022).
issn: 2071-1050. doi: 10.3390/su14095172. url: https://www.mdpi.com/
2071-1050/14/9/5172 (cit. on p. 4).
83
BIBLIOGRAPHY
[19] Stefanos Georgiou, Maria Kechagia, Tushar Sharma, Federica Sarro, and
Ying Zou. «Green AI: Do Deep Learning Frameworks Have Different Costs?»
In: Proceedings of the 44th International Conference on Software Engineering.
ICSE ’22. Pittsburgh, Pennsylvania: Association for Computing Machinery,
2022, pp. 1082–1094. isbn: 9781450392211. doi: 10.1145/3510003.3510221.
url: https://doi.org/10.1145/3510003.3510221 (cit. on pp. 4, 10).
[20] Alexander E.I Brownlee, Jason Adair, Saemundur O. Haraldsson, and John
Jabbo. «Exploring the Accuracy – Energy Trade-off in Machine Learning».
In: 2021 IEEE/ACM International Workshop on Genetic Improvement (GI).
2021, pp. 11–18. doi: 10.1109/GI52543.2021.00011 (cit. on pp. 4, 10, 29,
30).
[21] Charles AR Hoare. «Quicksort». In: The Computer Journal 5.1 (1962), pp. 10–
16 (cit. on p. 4).
[22] Danny Hernandez and Tom B. Brown. Measuring the Algorithmic Efficiency
of Neural Networks. 2020. arXiv: 2005.04305 [cs.LG] (cit. on p. 5).
[23] Crefeda Rodrigues, Graham Riley, and Mikel Luján. «SyNERGY: An energy
measurement and prediction framework for Convolutional Neural Networks
on Jetson TX1». In: Oct. 2018 (cit. on p. 5).
[24] Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. Esti-
mating the Carbon Footprint of BLOOM, a 176B Parameter Language Model.
2022. arXiv: 2211.02001 [cs.LG] (cit. on p. 5).
[25] Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and Policy
Considerations for Deep Learning in NLP. 2019. arXiv: 1906.02243 [cs.CL]
(cit. on pp. 5, 32).
[26] David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel
Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon
Emissions and Large Neural Network Training. 2021. arXiv: 2104.10350
[cs.LG] (cit. on pp. 5, 6, 46).
[27] T. Bruckner et al. «Chapter 7 - Energy systems». In: Climate Change 2014:
Mitigation of Climate Change. IPCC Working Group III Contribution to AR5.
Cambridge University Press, Nov. 2014. url: http://www.ipcc.ch/pdf/
assessment-report/ar5/wg3/ipcc%5C%5fwg3%5C%5far5%5C%5fchapter7.
pdf (cit. on pp. 5, 13, 30).
[28] I.C. Change et al. Mitigation of Climate Change. Contribution of Working
Group III to the Fifth Assessment Report of the Intergovernmental Panel on
Climate Change. 2014, pp. 1454–147 (cit. on p. 5).
84
BIBLIOGRAPHY
[29] Donald E. Knuth. The Art of Computer Programming, Volume 1 (3rd Ed.):
Fundamental Algorithms. USA: Addison Wesley Longman Publishing Co.,
Inc., 1997. isbn: 0201896834 (cit. on p. 5).
[30] Aimee Wynsberghe. «Sustainable AI: AI for sustainability and the sustain-
ability of AI». In: AI and Ethics 1 (Feb. 2021). doi: 10.1007/s43681-021-
00043-6 (cit. on p. 6).
[31] Raphael Fischer, Matthias Jakobs, Sascha Mücke, and Katharina Morik. «A
Unified Framework for Assessing Energy Efficiency of Machine Learning». In:
Jan. 2023, pp. 39–54. isbn: 978-3-031-23617-4. doi: 10.1007/978-3-031-
23618-1_3 (cit. on p. 6).
[32] Hannah Ritchie and Pablo Rosado. «Energy Mix». In: Our World in Data
(2020). https://ourworldindata.org/energy-mix (cit. on pp. 6, 7, 22).
[33] Raphael Fischer, Matthias Jakobs, and Katharina Morik. Energy Efficiency
Considerations for Popular AI Benchmarks. 2023. arXiv: 2304.08359 [cs.LG]
(cit. on pp. 7, 30).
[34] Hannah Ritchie, Max Roser, and Pablo Rosado. «CO2 and Greenhouse Gas
Emissions». In: Our World in Data (2020). https://ourworldindata.org/co2-
and-greenhouse-gas-emissions (cit. on pp. 11, 12).
[35] Dieter Lüthi et al. «High-resolution carbon dioxide concentration record
650,000–800,000years before present». In: Nature 453.7193 (May 2008), pp. 379–
382. issn: 1476-4687. doi: 10.1038/nature06949. url: https://doi.org/
10.1038/nature06949 (cit. on p. 11).
[36] M.R. Allen et al. «Framing and Context». In: Global Warming of 1.5°C.
An IPCC Special Report on the impacts of global warming of 1.5°C above
pre-industrial levels and related global greenhouse gas emission pathways,
in the context of strengthening the global response to the threat of climate
change, sustainable development, and efforts to eradicate poverty. Ed. by V.
Masson-Delmotte et al. Cambridge, UK and New York, NY, USA: Cambridge
University Press, 2018, pp. 49–92. doi: 10.1017/9781009157940.003 (cit. on
p. 11).
[37] UNFCCC. UNFCCC: Report on the Structured Expert Dialogue (SED) on
the 2013–2015 review. Report FCCC/SB/2015/INF.1. 2015 (cit. on p. 11).
[38] Intergovernmental Panel on Climate Change (IPCC). «Summary for Policy-
makers». In: Climate Change 2013: The Physical Science Basis. Contribution
of Working Group I to the Fifth Assessment Report of the Intergovernmen-
tal Panel on Climate Change. Cambridge, United Kingdom and New York,
NY, USA: Cambridge University Press, 2013. Chap. SPM, pp. 1–30. doi:
10.1017/CBO9781107415324.004 (cit. on pp. 11, 46).
85
BIBLIOGRAPHY
[39] NOAA National Centers for Environmental Information (NCEI). U.S. Billion-
Dollar Weather and Climate Disasters. https : / / www . ncei . noaa . gov /
access/billions/. 2023. doi: 10.25921/stkw-7w73 (cit. on p. 12).
[40] Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach.
4th. Pearson, 2021 (cit. on p. 14).
[41] M. I. Jordan and T. M. Mitchell. «Machine learning: Trends, perspectives,
and prospects». In: Science 349.6245 (2015), pp. 255–260. doi: 10.1126/
science.aaa8415. eprint: https://www.science.org/doi/pdf/10.1126/
science.aaa8415. url: https://www.science.org/doi/abs/10.1126/
science.aaa8415 (cit. on p. 14).
[42] Christopher M. Bishop. Pattern Recognition and Machine Learning. Infor-
mation Science and Statistics. Springer-Verlag New York, Inc., 2006 (cit. on
p. 14).
[43] Avrim Blum, John Hopcroft, and Ravi Kannan. Foundations of Data Science.
Cambridge: Cambridge University Press, 2020. isbn: 9781108485067. doi:
DOI:. url: https://www.cambridge.org/core/books/foundations-of-
data-science/6A43CE830DE83BED6CC5171E62B0AA9E (cit. on p. 14).
[44] Christian Janiesch, Patrick Zschech, and Kai Heinrich. «Machine learning and
deep learning». In: Electronic Markets 31.3 (Sept. 2021), pp. 685–695. doi:
10.1007/s12525-021-00475-2. url: https://doi.org/10.1007/s12525-
021-00475-2 (cit. on p. 14).
[45] Iqbal H. Sarker. «Deep Learning: A Comprehensive Overview on Techniques,
Taxonomy, Applications and Research Directions». In: SN Computer Science
2.6 (Aug. 2021), p. 420. issn: 2661-8907. doi: 10.1007/s42979-021-00815-1.
url: https://doi.org/10.1007/s42979-021-00815-1 (cit. on p. 16).
[46] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. «Gradient-based learning
applied to document recognition». In: Proceedings of the IEEE 86.11 (1998),
pp. 2278–2324. doi: 10.1109/5.726791 (cit. on p. 16).
[47] Sepp Hochreiter and Jürgen Schmidhuber. «Long Short-Term Memory».
In: Neural Computation 9.8 (Nov. 1997), pp. 1735–1780. issn: 0899-7667.
doi: 10.1162/neco.1997.9.8.1735. eprint: https://direct.mit.edu/
neco/article- pdf/9/8/1735/813796/neco.1997.9.8.1735.pdf. url:
https://doi.org/10.1162/neco.1997.9.8.1735 (cit. on p. 19).
[48] Michel Dubois, Murali Annavaram, and Per Stenström. Parallel Computer
Organization and Design. Cambridge University Press, 2012 (cit. on p. 21).
[49] Neil Weste, David Harris, and A Banerjee. «CMOS VLSI Design: A Circuits
and Systems Perspective». In: 11 (2005), p. 739 (cit. on p. 21).
86
BIBLIOGRAPHY
[50] Mark Horowitz. «1.1 Computing’s Energy Problem (and What We Can Do
About It)». In: 2014 IEEE International Solid-State Circuits Conference
Digest of Technical Papers (ISSCC). IEEE. 2014, pp. 10–14 (cit. on p. 22).
[51] Monoj Kumar Mondal, Hemant Kumar Balsora, and Prachi Varshney. «Progress
and trends in CO2 capture/separation technologies: A review». In: Energy 46.1
(2012). Energy and Exergy Modelling of Advance Energy Systems, pp. 431–
441. issn: 0360-5442. doi: https://doi.org/10.1016/j.energy.2012.
08.006. url: https://www.sciencedirect.com/science/article/pii/
S0360544212006184 (cit. on p. 22).
[52] Raghavendra Selvan, Nikhil Bhagwat, Lasse F. Wolff Anthony, Benjamin
Kanding, and Erik B. Dam. «Carbon Footprint of Selecting and Training Deep
Learning Models for Medical Image Analysis». In: Lecture Notes in Computer
Science. Springer Nature Switzerland, 2022, pp. 506–516. doi: 10.1007/978-
3 - 031 - 16443 - 9 _ 49. url: https : / / doi . org / 10 . 1007 % 2F978 - 3 - 031 -
16443-9_49 (cit. on pp. 24, 31).
[53] Anne-Laure Ligozat and Sasha Luccioni. A Practical Guide to Quantify-
ing Carbon Emissions for Machine Learning Researchers and Practitioners.
Research Report ffhal-03376391f. MILA; LISN, 2021. url: https://hal.
science/hal-03376391/document (cit. on p. 24).
[54] Loïc Lannelongue, Jason Grealey, and Michael Inouye. Green Algorithms:
Quantifying the carbon footprint of computation. 2020. arXiv: 2007.07610
[cs.CY] (cit. on pp. 24, 25).
[55] Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres.
Quantifying the Carbon Emissions of Machine Learning. 2019. arXiv: 1910.
09700 [cs.CY] (cit. on pp. 25, 26).
[56] Kadan Lottick, Silvia Susai, Sorelle A. Friedler, and Jonathan P. Wilson.
Energy Usage Reports: Environmental awareness as part of algorithmic ac-
countability. 2019. arXiv: 1911.08354 [cs.LG] (cit. on pp. 26, 46).
[57] Lasse F. Wolff Anthony, Benjamin Kanding, and Raghavendra Selvan. Car-
bontracker: Tracking and Predicting the Carbon Footprint of Training Deep
Learning Models. 2020. arXiv: 2007.03051 [cs.CY] (cit. on p. 26).
[58] SA Budennyy et al. «Eco2ai: carbon emissions tracking of machine learning
models as the first step towards sustainable ai». In: Doklady Mathematics.
Springer. 2023, pp. 1–11 (cit. on p. 26).
[59] Peter Henderson, Jieru Hu, Joshua Romoff, Emma Brunskill, Dan Jurafsky,
and Joelle Pineau. Towards the Systematic Reporting of the Energy and
Carbon Footprints of Machine Learning. 2022. arXiv: 2002.05651 [cs.CY]
(cit. on pp. 26, 27).
87
BIBLIOGRAPHY
88
BIBLIOGRAPHY
[70] Friederike Rohde, Maike Gossen, Josephin Wagner, and Tilman Santarius.
«Sustainability challenges of Artificial Intelligence and Policy Implications».
In: Ökologisches Wirtschaften - Fachzeitschrift 36.O1 (Feb. 2021), pp. 36–40.
doi: 10.14512/OEWO360136. url: https://oekologisches-wirtschaften.
de/index.php/oew/article/view/1792 (cit. on p. 32).
[71] S. Han, H. Mao, and W.J. Dally. «Deep compression: Compressing deep
neural networks with pruning, trained quantization and Huffman coding». In:
arXiv preprint arXiv:1510.00149 (2015) (cit. on p. 32).
[72] S. Han, J. Pool, J. Tran, and W. Dally. «Learning both weights and con-
nections for efficient neural network». In: Advances in Neural Information
Processing Systems. 2015, pp. 1135–1143 (cit. on p. 32).
[73] Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. «Eyeriss: An
Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural
Networks». In: IEEE Journal of Solid-State Circuits 52.1 (2017), pp. 127–138.
doi: 10.1109/JSSC.2016.2616357 (cit. on p. 33).
[74] Ermao Cai, Da-Cheng Juan, Dimitrios Stamoulis, and Diana Marculescu. Neu-
ralPower: Predict and Deploy Energy-Efficient Convolutional Neural Networks.
2017. arXiv: 1710.05420 [cs.LG] (cit. on p. 33).
[75] B.D. Rouhani, A. Mirhoseini, and F. Koushanfar. «Delight: Adding energy
dimension to deep neural networks». In: Proceedings of the 2016 International
Symposium on Low Power Electronics and Design. ACM. 2016, pp. 112–117
(cit. on p. 33).
[76] Greta Vallero, Daniela Renga, Michela Meo, and Marco Ajmone Marsan.
«Greener RAN Operation Through Machine Learning». In: IEEE Transactions
on Network and Service Management 16.3 (2019), pp. 896–908. doi: 10.1109/
TNSM.2019.2923881 (cit. on pp. 35, 37, 56).
[77] ITU-R. Framework for the radio interface(s) and radio sub-system functional-
ity for international mobile tele-communications-2000 (IMT-2000) (Question
ITU-R 39/8). ITU-R Recommendation M.1035. 1994 (cit. on p. 37).
[78] Nicholas A DiOrio et al. «Solar System Modeling at NREL». In: (Oct. 2018).
url: https://www.osti.gov/biblio/1477226 (cit. on p. 39).
[79] Tshewang Lhendup and Samten Lhundup. «Comparison of methodologies for
generating a typical meteorological year (TMY)». In: Energy for Sustainable
Development 11.3 (2007), pp. 5–10. issn: 0973-0826. doi: https://doi.org/
10.1016/S0973- 0826(08)60571- 2. url: https://www.sciencedirect.
com/science/article/pii/S0973082608605712 (cit. on p. 39).
89
BIBLIOGRAPHY
90