0% found this document useful (0 votes)
14 views6 pages

Program EDA

The document discusses Python's dataclass feature, introduced in version 3.7, which simplifies the creation of classes that primarily hold data by automatically generating methods like __init__ and __repr__. It also details the creation of a 3-dimensional data cube using a pivot table in pandas to analyze product sales over time across different regions, including steps for data structure definition, data generation, and analysis techniques. The provided code exemplifies how to implement these concepts using a sample dataset of products.

Uploaded by

aswinip22062006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

Program EDA

The document discusses Python's dataclass feature, introduced in version 3.7, which simplifies the creation of classes that primarily hold data by automatically generating methods like __init__ and __repr__. It also details the creation of a 3-dimensional data cube using a pivot table in pandas to analyze product sales over time across different regions, including steps for data structure definition, data generation, and analysis techniques. The provided code exemplifies how to implement these concepts using a sample dataset of products.

Uploaded by

aswinip22062006
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Points to ponder:

-​ In Python, a dataclass is a class designed primarily to hold data


values. Introduced in Python 3.7, the dataclasses module provides a
decorator, @dataclass, that automatically generates common
boilerplate methods for classes, making them more concise and easier
to manage when serving as data container
-​ Automatic Method Generation:​
The @dataclass decorator automatically generates special methods like
__init__, __repr__, and __eq__ (dundar methods). This eliminates the
need to manually write these methods, reducing boilerplate code and potential
for errors.
-​ Default Values:
-​ You can easily assign default values to fields within a dataclass, simplifying
object creation when certain fields have common initial states
-​ Immutability (Frozen Dataclasses):​
Dataclasses can be made immutable by setting frozen=True in the decorator,
preventing modification of field values after object instantiation, which is useful
for creating read-only data structures.
-​ Example:
from dataclasses import dataclass

@dataclass
class Point:
x: float
y: float
z: float = 0.0 # Field with a default value

# Creating instances
p1 = Point(1.0, 2.0)
p2 = Point(3.0, 4.0, 5.0)

# Automatic __repr__
print(p1) # Output: Point(x=1.0, y=2.0, z=0.0)
Pivot table:
pivot table, which behaves like a data cube and supports analytical
operations such as slicing, dicing, etc.
A pivot table is a data summarization tool used to organize and analyze large amounts
of data in a spreadsheet or database. It allows you to quickly summarize, sort,
reorganize, group, count, total, or average data, providing different perspectives and
insights into your data.

Syntax:

DataFrame.pivot_table( values=None, index=None, columns=None,


aggfunc='mean', fill_value=None, margins=False, dropna=True)

Parameters:

●​ values: Columns to aggregate.

●​ index: Columns to use as the new row index.

●​ columns: Columns to use as the new column headers.

●​ aggfunc: Aggregation functions like mean, sum, count etc. By

default it is mean.

●​ fill_value: Value to replace missing data.

●​ margins: Whether to add totals, default is false.

●​ dropna: Whether to exclude missing values from the DataFrame,

default is True.
This table is a common way to track product sales over time across different regions, helping
businesses analyze sales performance by product, country, and date.

Aim
To Implement data cube for datawarehouse on 3-dimensional data for product sales over time
across different regions (product, date, country.) using python
Algorithm:

Dimensions:

1.​ Product (nested under Category)


2.​ .Date
3.​ Country​
Measure: Price

1. Import libraries & define data structures (Use dataclass to model each product record.
2. Define the Product entity (contains name (str), category(str) and price(float))
3. Instantiate dimension members (
- Create a list of Product objects (name, category, price).
- Create lists for all Date values and Country values

4. Generate the fact table


5. Load facts into a pandas DataFrame
6. Create the 3‑D data cube (pivot table)
7. Explore / analyze (slice &dice , Rollup, Extract)
8. Visualize or integrate downstream - Plot with matplotlib or feed to a BI tool.
- Store in a data‑warehouse table or OLAP engine for large‑scale analytics.

Result: cube is a compact 3‑dimensional structure that supports fast analytical queries
(slice, dice, roll‑up, drill‑down) using familiar pandas operations.
Code:
import pandas as pd
from dataclasses import dataclass

# Define the Product class


@dataclass
class Product:
name: str
category: str
price: float

# Sample product data


products = [
Product("Laptop", "Electronics", 1000),
Product("T-shirt", "Clothing", 20),
Product("Book", "Books", 15),
Product("Headphones", "Electronics", 100),
Product("Jeans", "Clothing", 50),
Product("Smartphone", "Electronics", 800),
Product("Sunglasses", "Accessories", 30),
Product("Watch", "Accessories", 50),
Product("Shoes", "Footwear", 80),
]

# Define dates and countries


dates = ["2023-05-01", "2023-05-02", "2023-05-03"]
countries = ["USA", "UK", "Germany"]

# Generate the full dataset


data = []
for product in products:
for date in dates:
for country in countries:
data.append({
"Category": product.category,
"Product": product.name,
"Date": date,
"Country": country,
"Price": product.price
})

# Convert to DataFrame
df = pd.DataFrame(data)

# Create a pivot table (data cube)


data_cube = pd.pivot_table(
df,
values="Price",
index=["Category", "Product"],
columns=["Date", "Country"],
aggfunc="first" # each price is unique and atomic
)
# Display the data cube
print("3D Data Cube (Product x Date x Country):\n")
print(data_cube)

# sliceup operation
# Selecting a single value for one dimension, reducing the cube’s
dimensionality.
# how Laptop prices across countries and dates
print(data_cube.loc[("Electronics", "Laptop")])

# Rollup operation
# Definition: Aggregating data up a hierarchy or grouping dimension.
# Roll-up by Product Category → Sum of prices for each category.
print(data_cube.groupby(level=0).sum())

#Export
data_cube.to_csv("MyDrive\cube.csv")

You might also like