THUD

Overview

Source Code

The source codes for each step can be found in their respective folders.

More details on these source codes can be found in the following sections.

Datasets

The raw dataset can be found in the data_html folder. The processed dataset can be found in the data folder.

More details on these datasets can be found in the following sections.

References

References can be found in the THUD.bib file.

Step 1 Data Crawling

Purpose

The purpose of the step is to parse the acquired htmls and crawl additional pricing and date data.

Input

The input of this step are the files in the data_html folder.

These files are acquired from PassMark's CPU and GPU datasets and after performing the following steps:

Select all columns and show all entries of the tables
Sort data by price so we can easily remove products without price
Copy the html files to cpu.html and gpu.html respectively inside the data_html folder
Remove products without price

Output

The output of this step are the files in the data_csv folder.

These files are acquired from running step 1's main.py which performs the following steps:

Parse the htmls into a pandas.DataFrame while also extracting each row's urls
These urls are then used to crawl the pricing history, relase price and release date data
Output the pandas.DataFrame into a csv file inside the data_csv folder

More details can be found at step 1's main.py.

Step 2 Data Processing

Purpose

The purpose of the step is to process the acquired data from step 1.

Input

The input of this step are the files in the data_csv folder.

Output

The output of this step are the files in the data folder.

These files are acquired from running step 2's main.py which performs the following steps:

Drop irrelevant or derived columns
Process data types of columns
Process null rows
Remove irrelevant rows
Recalculate derived columns

More details can be found at step 2's main.py.

Step 3 Data Analyzing

Purpose

The purpose of the step is to analyze the acquired data from step 2.

Input

The input of this step are the files in the data folder.

Output

The output of this step are the files in the analytics folder or appear on screen using matplotlib. The outputs that appear on screen can be found in step4_writting in the form of png files.

These results are acquired from running step 3's main.py which performs the following steps:

Read and preprocess data
Profile data
Plot scatters
Plot averages
Perform regression analysis and calculate its R2 value

More details can be found at step 3's main.py.

Step 4 Writing

Purpose

The purpose of the step is to write about the results gathered from step 3.

Input

The input of this step are:

The files in the analytics folder
The png files in the step4_writting folder
The THUD.bib file which is used for citing

Output

The output of this step is the THUD.pdf file compiled from the THUD.tex file.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
step1_data_crawling		step1_data_crawling
step2_data_processing		step2_data_processing
step3_data_analyzing		step3_data_analyzing
step4_writing		step4_writing
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

THUD

Overview

Source Code

Datasets

References

Step 1 Data Crawling

Purpose

Input

Output

Step 2 Data Processing

Purpose

Input

Output

Step 3 Data Analyzing

Purpose

Input

Output

Step 4 Writing

Purpose

Input

Output

About

Releases

Packages

Contributors 2

Languages

LeStolz/THUD

Folders and files

Latest commit

History

Repository files navigation

THUD

Overview

Source Code

Datasets

References

Step 1 Data Crawling

Purpose

Input

Output

Step 2 Data Processing

Purpose

Input

Output

Step 3 Data Analyzing

Purpose

Input

Output

Step 4 Writing

Purpose

Input

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages