Skip to content

LeStolz/THUD

Repository files navigation

THUD

Overview

Source Code

The source codes for each step can be found in their respective folders.

More details on these source codes can be found in the following sections.

Datasets

The raw dataset can be found in the data_html folder. The processed dataset can be found in the data folder.

More details on these datasets can be found in the following sections.

References

References can be found in the THUD.bib file.

Step 1 Data Crawling

Purpose

The purpose of the step is to parse the acquired htmls and crawl additional pricing and date data.

Input

The input of this step are the files in the data_html folder.

These files are acquired from PassMark's CPU and GPU datasets and after performing the following steps:

  1. Select all columns and show all entries of the tables
  2. Sort data by price so we can easily remove products without price
  3. Copy the html files to cpu.html and gpu.html respectively inside the data_html folder
  4. Remove products without price

Output

The output of this step are the files in the data_csv folder.

These files are acquired from running step 1's main.py which performs the following steps:

  1. Parse the htmls into a pandas.DataFrame while also extracting each row's urls
  2. These urls are then used to crawl the pricing history, relase price and release date data
  3. Output the pandas.DataFrame into a csv file inside the data_csv folder

More details can be found at step 1's main.py.

Step 2 Data Processing

Purpose

The purpose of the step is to process the acquired data from step 1.

Input

The input of this step are the files in the data_csv folder.

Output

The output of this step are the files in the data folder.

These files are acquired from running step 2's main.py which performs the following steps:

  1. Drop irrelevant or derived columns
  2. Process data types of columns
  3. Process null rows
  4. Remove irrelevant rows
  5. Recalculate derived columns

More details can be found at step 2's main.py.

Step 3 Data Analyzing

Purpose

The purpose of the step is to analyze the acquired data from step 2.

Input

The input of this step are the files in the data folder.

Output

The output of this step are the files in the analytics folder or appear on screen using matplotlib. The outputs that appear on screen can be found in step4_writting in the form of png files.

These results are acquired from running step 3's main.py which performs the following steps:

  1. Read and preprocess data
  2. Profile data
  3. Plot scatters
  4. Plot averages
  5. Perform regression analysis and calculate its R2 value

More details can be found at step 3's main.py.

Step 4 Writing

Purpose

The purpose of the step is to write about the results gathered from step 3.

Input

The input of this step are:

Output

The output of this step is the THUD.pdf file compiled from the THUD.tex file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages