Points to ponder:
- In Python, a dataclass is a class designed primarily to hold data
values. Introduced in Python 3.7, the dataclasses module provides a
decorator, @dataclass, that automatically generates common
boilerplate methods for classes, making them more concise and easier
to manage when serving as data container
- Automatic Method Generation:
The @dataclass decorator automatically generates special methods like
__init__, __repr__, and __eq__ (dundar methods). This eliminates the
need to manually write these methods, reducing boilerplate code and potential
for errors.
- Default Values:
- You can easily assign default values to fields within a dataclass, simplifying
object creation when certain fields have common initial states
- Immutability (Frozen Dataclasses):
Dataclasses can be made immutable by setting frozen=True in the decorator,
preventing modification of field values after object instantiation, which is useful
for creating read-only data structures.
- Example:
from dataclasses import dataclass
@dataclass
class Point:
x: float
y: float
z: float = 0.0 # Field with a default value
# Creating instances
p1 = Point(1.0, 2.0)
p2 = Point(3.0, 4.0, 5.0)
# Automatic __repr__
print(p1) # Output: Point(x=1.0, y=2.0, z=0.0)
Pivot table:
pivot table, which behaves like a data cube and supports analytical
operations such as slicing, dicing, etc.
A pivot table is a data summarization tool used to organize and analyze large amounts
of data in a spreadsheet or database. It allows you to quickly summarize, sort,
reorganize, group, count, total, or average data, providing different perspectives and
insights into your data.
Syntax:
DataFrame.pivot_table( values=None, index=None, columns=None,
aggfunc='mean', fill_value=None, margins=False, dropna=True)
Parameters:
● values: Columns to aggregate.
● index: Columns to use as the new row index.
● columns: Columns to use as the new column headers.
● aggfunc: Aggregation functions like mean, sum, count etc. By
default it is mean.
● fill_value: Value to replace missing data.
● margins: Whether to add totals, default is false.
● dropna: Whether to exclude missing values from the DataFrame,
default is True.
This table is a common way to track product sales over time across different regions, helping
businesses analyze sales performance by product, country, and date.
Aim
To Implement data cube for datawarehouse on 3-dimensional data for product sales over time
across different regions (product, date, country.) using python
Algorithm:
Dimensions:
1. Product (nested under Category)
2. .Date
3. Country
Measure: Price
1. Import libraries & define data structures (Use dataclass to model each product record.
2. Define the Product entity (contains name (str), category(str) and price(float))
3. Instantiate dimension members (
- Create a list of Product objects (name, category, price).
- Create lists for all Date values and Country values
4. Generate the fact table
5. Load facts into a pandas DataFrame
6. Create the 3‑D data cube (pivot table)
7. Explore / analyze (slice &dice , Rollup, Extract)
8. Visualize or integrate downstream - Plot with matplotlib or feed to a BI tool.
- Store in a data‑warehouse table or OLAP engine for large‑scale analytics.
Result: cube is a compact 3‑dimensional structure that supports fast analytical queries
(slice, dice, roll‑up, drill‑down) using familiar pandas operations.
Code:
import pandas as pd
from dataclasses import dataclass
# Define the Product class
@dataclass
class Product:
name: str
category: str
price: float
# Sample product data
products = [
Product("Laptop", "Electronics", 1000),
Product("T-shirt", "Clothing", 20),
Product("Book", "Books", 15),
Product("Headphones", "Electronics", 100),
Product("Jeans", "Clothing", 50),
Product("Smartphone", "Electronics", 800),
Product("Sunglasses", "Accessories", 30),
Product("Watch", "Accessories", 50),
Product("Shoes", "Footwear", 80),
]
# Define dates and countries
dates = ["2023-05-01", "2023-05-02", "2023-05-03"]
countries = ["USA", "UK", "Germany"]
# Generate the full dataset
data = []
for product in products:
for date in dates:
for country in countries:
data.append({
"Category": product.category,
"Product": product.name,
"Date": date,
"Country": country,
"Price": product.price
})
# Convert to DataFrame
df = pd.DataFrame(data)
# Create a pivot table (data cube)
data_cube = pd.pivot_table(
df,
values="Price",
index=["Category", "Product"],
columns=["Date", "Country"],
aggfunc="first" # each price is unique and atomic
)
# Display the data cube
print("3D Data Cube (Product x Date x Country):\n")
print(data_cube)
# sliceup operation
# Selecting a single value for one dimension, reducing the cube’s
dimensionality.
# how Laptop prices across countries and dates
print(data_cube.loc[("Electronics", "Laptop")])
# Rollup operation
# Definition: Aggregating data up a hierarchy or grouping dimension.
# Roll-up by Product Category → Sum of prices for each category.
print(data_cube.groupby(level=0).sum())
#Export
data_cube.to_csv("MyDrive\cube.csv")