The source codes for each step can be found in their respective folders.
More details on these source codes can be found in the following sections.
The raw dataset can be found in the data_html
folder.
The processed dataset can be found in the data
folder.
More details on these datasets can be found in the following sections.
References can be found in the THUD.bib
file.
The purpose of the step is to parse the acquired htmls and crawl additional pricing and date data.
The input of this step are the files in the data_html
folder.
These files are acquired from PassMark's CPU and GPU datasets and after performing the following steps:
- Select all columns and show all entries of the tables
- Sort data by price so we can easily remove products without price
- Copy the html files to
cpu.html
andgpu.html
respectively inside thedata_html
folder - Remove products without price
The output of this step are the files in the data_csv
folder.
These files are acquired from running step 1's main.py
which performs the following steps:
- Parse the htmls into a
pandas.DataFrame
while also extracting each row's urls - These urls are then used to crawl the pricing history, relase price and release date data
- Output the
pandas.DataFrame
into a csv file inside thedata_csv
folder
More details can be found at step 1's main.py
.
The purpose of the step is to process the acquired data from step 1.
The input of this step are the files in the data_csv
folder.
The output of this step are the files in the data
folder.
These files are acquired from running step 2's main.py
which performs the following steps:
- Drop irrelevant or derived columns
- Process data types of columns
- Process null rows
- Remove irrelevant rows
- Recalculate derived columns
More details can be found at step 2's main.py
.
The purpose of the step is to analyze the acquired data from step 2.
The input of this step are the files in the data
folder.
The output of this step are the files in the analytics
folder
or appear on screen using matplotlib
.
The outputs that appear on screen can be found in step4_writting
in the form of png files.
These results are acquired from running step 3's main.py
which performs the following steps:
- Read and preprocess data
- Profile data
- Plot scatters
- Plot averages
- Perform regression analysis and calculate its R2 value
More details can be found at step 3's main.py
.
The purpose of the step is to write about the results gathered from step 3.
The input of this step are:
- The files in the
analytics
folder - The png files in the
step4_writting
folder - The
THUD.bib
file which is used for citing
The output of this step is the THUD.pdf
file
compiled from the THUD.tex
file.