0% found this document useful (0 votes)
74 views25 pages

Pandas PDF

Uploaded by

krithikb87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views25 pages

Pandas PDF

Uploaded by

krithikb87
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Introduction to Pandas

Pandas is a powerful Python library for data manipulation and analysis. It provides two primary
data structures:

1. Series: A one-dimensional array-like object (like a list with labeled indices).


2. DataFrame: A two-dimensional table-like structure with labeled rows and columns,
similar to Excel or SQL tables.

Why Use Pandas?

 Efficiently handle large datasets.


 Perform data cleaning, filtering, and transformation.
 Easily read and write data from/to files like CSV, Excel, or databases.

Creating DataFrames in Pandas

A DataFrame is essentially a table with rows and columns. Let’s explore how to create one.

1. From a Dictionary

The most common way is using a dictionary where keys are column names, and values are data.

python

import pandas as pd

# Creating a DataFrame from a dictionary


data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "Los Angeles", "Chicago"]
}

df = pd.DataFrame(data)
print(df)

Output:

markdown

Name Age City


0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
2. From a List of Lists

You can also create a DataFrame from a list of lists.

python

# List of lists
data = [
["Alice", 25, "New York"],
["Bob", 30, "Los Angeles"],
["Charlie", 35, "Chicago"]
]

# Defining column names


columns = ["Name", "Age", "City"]

df = pd.DataFrame(data, columns=columns)
print(df)

3. From a CSV File

If your data is stored in a CSV file, you can read it directly into a DataFrame.

python

df = pd.read_csv("example.csv")
print(df)

4. Manually Create DataFrame

You can create a DataFrame directly by specifying the data in the constructor.

python

df = pd.DataFrame({
"Name": ["Alice", "Bob"],
"Age": [25, 30],
"City": ["New York", "Los Angeles"]
})
print(df)

Basic Operations

Once you’ve created your DataFrame, you can:

1. Access Columns: df["Age"]


2. Access Rows: df.iloc[0] (by index), df.loc[0] (by label)
3. Filter Data: df[df["Age"] > 28]
4. Add a New Column:

python

df["Gender"] = ["Female", "Male", "Male"]

Next Steps:

1. Learn common data manipulation techniques (head(), tail(), describe(),


groupby()).
2. Practice loading and cleaning real-world datasets.
3. Explore visualizations with pandas.plot() or libraries like Matplotlib/Seaborn.

Quandl
Quandl is a platform for financial, economic, and alternative datasets. The Quandl Python API
allows you to fetch datasets programmatically. Here’s a guide to reading and querying Quandl
data with examples.

Setup

To use the Quandl Python API:

1. Install the Quandl library:

bash

pip install quandl

2. Get an API key by creating an account on Quandl.

Steps to Work with Quandl

1. Import the Quandl Library

python

import quandl

2. Set Your API Key


python

quandl.ApiConfig.api_key = "your_api_key"

Fetching Quandl Data

1. Retrieve a Single Dataset

Quandl datasets have unique codes (e.g., WIKI/AAPL for Apple stock prices).

python

# Fetching Apple's stock price data


data = quandl.get("WIKI/AAPL")
print(data.head())

This fetches a Pandas DataFrame containing the dataset. Use head() to preview the first few
rows.

2. Specify a Date Range

To fetch data for a specific time range, use the start_date and end_date parameters.

python

data = quandl.get("WIKI/AAPL", start_date="2020-01-01", end_date="2020-12-


31")
print(data.head())

3. Retrieve Multiple Datasets

You can request multiple datasets in one call by passing a list of codes.

python

# Fetch multiple datasets


data = quandl.get(["WIKI/AAPL", "WIKI/GOOGL"])
print(data.head())

4. Fetch Specific Columns

If a dataset has many columns, you can specify which ones to retrieve.

python
data = quandl.get("WIKI/AAPL", column_index=4) # Example: Fetch only the
"Close" column
print(data.head())

5. Querying Data Locally

Once data is loaded into a Pandas DataFrame, you can query it with standard Pandas operations.

python

# Example: Filtering rows where the closing price is above $150


filtered_data = data[data["Adj. Close"] > 150]
print(filtered_data)

Advanced Querying

1. Rolling Averages

Calculate moving averages for trends:

python

data['Rolling_Avg'] = data["Adj. Close"].rolling(window=10).mean()


print(data[['Adj. Close', 'Rolling_Avg']].tail())

2. Visualization

Visualize stock trends using Matplotlib.

python

import matplotlib.pyplot as plt

data["Adj. Close"].plot(title="Apple Stock Prices", figsize=(10, 6))


plt.show()

Using Quandl for Economic Data

Quandl also offers economic datasets, e.g., GDP, inflation rates, or unemployment data.

python

# Fetch US GDP data


gdp_data = quandl.get("FRED/GDP")
print(gdp_data.head())
Saving Data Locally

You can save the fetched data for offline use:

python

data.to_csv("apple_stock.csv") # Save as CSV

Additional Tips

1. Explore the Quandl Dataset Search to find specific codes (Quandl Search).
2. Use advanced APIs for time-series manipulations like resampling or interpolation.

Would you like a specific example or exercises based on Quandl?

4o

describing pandas data frames

said:

When working with Pandas DataFrames, describing the data is crucial for understanding its
structure, content, and statistics. Pandas offers built-in methods to explore DataFrames
efficiently.

Key Methods to Describe DataFrames

1. Preview the Data

 head(): Shows the first few rows of the DataFrame (default: 5 rows).
 tail(): Shows the last few rows of the DataFrame.

python

print(df.head()) # First 5 rows


print(df.tail(3)) # Last 3 rows

2. Basic Information
 info(): Displays a summary of the DataFrame, including column names, data types, and
non-null values.
 shape: Gives the dimensions of the DataFrame (rows, columns).

python

print(df.info()) # DataFrame summary


print(df.shape) # (rows, columns)

3. Descriptive Statistics

 describe(): Provides summary statistics (count, mean, std, min, max, etc.) for
numerical columns by default.
 For all columns (numerical + categorical), use include="all".

python

print(df.describe()) # Summary statistics for numerical columns


print(df.describe(include="all")) # Summary for all columns

Column-wise Analysis

1. List Column Names

python

print(df.columns)

2. Check Data Types

python

print(df.dtypes)

3. View Unique Values in a Column

python

print(df["ColumnName"].unique())

4. Count Values in a Column

python

print(df["ColumnName"].value_counts())

Data Inspection
1. Missing Values

 isnull(): Identifies missing values.


 sum(): Summarizes missing values for each column.

python

print(df.isnull().sum()) # Count missing values per column

2. Check for Duplicates

python

print(df.duplicated().sum()) # Count duplicate rows

3. Get Data Sample

python

print(df.sample(5)) # Randomly sample 5 rows

Data Visualization

For quick visualization of column distributions:

python

import matplotlib.pyplot as plt

# Plot histogram for a column


df["ColumnName"].hist()
plt.show()

# Scatter plot of two numerical columns


df.plot.scatter(x="Column1", y="Column2")
plt.show()

Example: Describing a DataFrame

Sample DataFrame

python

import pandas as pd

# Example DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
"Age": [25, 30, 35, 40, 29],
"City": ["NY", "LA", "Chicago", "NY", "LA"],
"Salary": [50000, 60000, 75000, 80000, 58000]
}

df = pd.DataFrame(data)

Description

python

print(df.head()) # Preview data


print(df.info()) # DataFrame structure
print(df.describe()) # Statistics for numerical data
print(df["City"].value_counts()) # Unique value count for categorical data

Would you like to explore more advanced insights, such as correlations or group-based statistics?

4o

grouping and joining

said:

In Pandas, grouping and joining are powerful methods for analyzing and combining datasets.
Here's an explanation with examples:

1. Grouping in Pandas
Grouping data is useful for aggregations (e.g., sum, mean) based on one or more columns.

Syntax: groupby()
python

df.groupby("ColumnName").agg_function()

Examples

1.1. Grouping by a Single Column

python
import pandas as pd

# Sample DataFrame
data = {
"Department": ["HR", "HR", "IT", "IT", "Finance"],
"Employee": ["Alice", "Bob", "Charlie", "David", "Eve"],
"Salary": [50000, 60000, 75000, 80000, 58000]
}

df = pd.DataFrame(data)

# Group by Department and calculate mean salary


grouped = df.groupby("Department")["Salary"].mean()
print(grouped)

Output:

yaml

Department
Finance 58000.0
HR 55000.0
IT 77500.0
Name: Salary, dtype: float64

1.2. Grouping by Multiple Columns

python

# Add a column for gender


df["Gender"] = ["F", "M", "M", "M", "F"]

# Group by Department and Gender, calculate sum of salaries


grouped = df.groupby(["Department", "Gender"])["Salary"].sum()
print(grouped)

Output:

Department Gender
Finance F 58000
HR F 50000
M 60000
IT M 155000
Name: Salary, dtype: int64

1.3. Applying Multiple Aggregations

python

# Calculate both mean and sum for each department


grouped = df.groupby("Department")["Salary"].agg(["mean", "sum"])
print(grouped)

Output:

bash

mean sum
Department
Finance 58000.0 58000
HR 55000.0 110000
IT 77500.0 155000

2. Joining in Pandas
Joining is used to combine two or more DataFrames based on a common column or index.

2.1. Types of Joins


1. Inner Join: Keeps only rows that have matching keys in both DataFrames.
2. Left Join: Keeps all rows from the left DataFrame and matches with the right one.
3. Right Join: Keeps all rows from the right DataFrame and matches with the left one.
4. Outer Join: Keeps all rows from both DataFrames, filling NaN where there's no match.

Syntax: merge()
python

pd.merge(left, right, on="common_column", how="join_type")

Examples

2.2. Joining Two DataFrames

python

# Employee information
df1 = pd.DataFrame({
"Employee": ["Alice", "Bob", "Charlie", "David"],
"Department": ["HR", "HR", "IT", "IT"]
})

# Salary information
df2 = pd.DataFrame({
"Employee": ["Alice", "Bob", "Charlie", "Eve"],
"Salary": [50000, 60000, 75000, 58000]
})
# Perform an inner join
joined = pd.merge(df1, df2, on="Employee", how="inner")
print(joined)

Output:

Employee Department Salary


0 Alice HR 50000
1 Bob HR 60000
2 Charlie IT 75000

2.3. Left Join

python

# Perform a left join


left_joined = pd.merge(df1, df2, on="Employee", how="left")
print(left_joined)

Output:

Employee Department Salary


0 Alice HR 50000.0
1 Bob HR 60000.0
2 Charlie IT 75000.0
3 David IT NaN

2.4. Outer Join

python

# Perform an outer join


outer_joined = pd.merge(df1, df2, on="Employee", how="outer")
print(outer_joined)

Output:

Employee Department Salary


0 Alice HR 50000.0
1 Bob HR 60000.0
2 Charlie IT 75000.0
3 David IT NaN
4 Eve NaN 58000.0
Key Differences Between groupby() and merge()

 groupby() is for aggregations based on column values.


 merge() is for combining DataFrames.

Would you like exercises or tasks for better understanding of these concepts?

4o

working with missing values

said:

Handling missing values in datasets is essential for maintaining the integrity and quality of
analysis. Pandas provides several methods to identify, handle, and impute missing data
efficiently.

Step 1: Identifying Missing Values

In Pandas, missing values are represented as NaN (Not a Number).

1.1. Check for Missing Values

 isnull(): Identifies missing values (returns True for NaN values).


 notnull(): Opposite of isnull() (returns True for non-missing values).

python

import pandas as pd

# Sample DataFrame
data = {
"Name": ["Alice", "Bob", "Charlie", None],
"Age": [25, 30, None, 35],
"Salary": [50000, 60000, 75000, None]
}

df = pd.DataFrame(data)

# Check for missing values


print(df.isnull())
Output:

mathematica

Name Age Salary


0 False False False
1 False False False
2 False True False
3 True False True

1.2. Count Missing Values

python

# Count missing values per column


print(df.isnull().sum())

Output:

go

Name 1
Age 1
Salary 1
dtype: int64

Step 2: Handling Missing Values

2.1. Drop Missing Values

 dropna(): Removes rows or columns with missing values.

python

# Drop rows with missing values


df_dropped_rows = df.dropna()
print(df_dropped_rows)

# Drop columns with missing values


df_dropped_cols = df.dropna(axis=1)
print(df_dropped_cols)

2.2. Fill Missing Values

 fillna(): Replaces missing values with a specified value.

python
# Fill missing values with a specific value
df_filled = df.fillna(0)
print(df_filled)

# Fill missing values with the mean of the column


df["Age"] = df["Age"].fillna(df["Age"].mean())
print(df)

2.3. Forward/Backward Fill

These methods propagate non-missing values forward or backward.

python

# Forward fill
df_ffill = df.fillna(method="ffill")
print(df_ffill)

# Backward fill
df_bfill = df.fillna(method="bfill")
print(df_bfill)

2.4. Interpolate Missing Values

 interpolate(): Estimates missing values using interpolation.

python

df_interpolated = df.interpolate()
print(df_interpolated)

Step 3: Advanced Missing Value Handling

3.1. Replace Missing Values Based on Condition

python

# Replace missing salaries with 50000 if the age is above 30


df.loc[df["Salary"].isnull() & (df["Age"] > 30), "Salary"] = 50000
print(df)

3.2. Drop Rows with Too Many Missing Values

You can drop rows or columns based on a threshold for missing values.

python

# Drop rows with more than 50% missing values


df_threshold = df.dropna(thresh=int(len(df.columns) / 2))
print(df_threshold)

Example Dataset and Output

Input:

python

data = {
"Name": ["Alice", "Bob", "Charlie", None],
"Age": [25, 30, None, 35],
"Salary": [50000, 60000, 75000, None]
}

df = pd.DataFrame(data)

Output After Filling Missing Values:

python

# Filling missing values with column means


df["Age"] = df["Age"].fillna(df["Age"].mean())
df["Salary"] = df["Salary"].fillna(df["Salary"].mean())
print(df)

Output:

css

Name Age Salary


0 Alice 25.000000 50000.000000
1 Bob 30.000000 60000.000000
2 Charlie 30.000000 75000.000000
3 None 35.000000 61666.666667

Tips for Working with Missing Values

1. Analyze Patterns: Use visualizations (e.g., heatmaps) to analyze missing data.

python

import seaborn as sns


sns.heatmap(df.isnull(), cbar=False, cmap="viridis")

2. Domain Knowledge: Use knowledge about the dataset to decide how to handle missing
values (e.g., zero may not be appropriate for some contexts).

Would you like exercises or specific datasets to practice?


4o

creating pivot tables

said:

Creating pivot tables in Pandas allows you to summarize, group, and reorganize data, much like
in Excel. The pivot_table() method in Pandas provides a powerful and flexible way to analyze
datasets.

Syntax of pivot_table()
python

pd.pivot_table(
data,
values=None,
index=None,
columns=None,
aggfunc="mean",
fill_value=None
)

 data: DataFrame to pivot.


 values: Column(s) to aggregate.
 index: Rows for grouping.
 columns: Columns for grouping.
 aggfunc: Aggregation function (e.g., mean, sum, count).
 fill_value: Value to replace missing values in the pivot table.

Example Dataset
python

import pandas as pd

# Sample data
data = {
"Employee": ["Alice", "Bob", "Charlie", "David", "Eve"],
"Department": ["HR", "IT", "IT", "HR", "Finance"],
"Gender": ["F", "M", "M", "M", "F"],
"Salary": [50000, 60000, 75000, 80000, 58000],
"Bonus": [5000, 7000, 6000, 8000, 4000]
}

df = pd.DataFrame(data)
print(df)

Output:

yaml

Employee Department Gender Salary Bonus


0 Alice HR F 50000 5000
1 Bob IT M 60000 7000
2 Charlie IT M 75000 6000
3 David HR M 80000 8000
4 Eve Finance F 58000 4000

Basic Pivot Table

1. Group by a Single Column

To find the average salary in each department:

python

pivot = pd.pivot_table(df, values="Salary", index="Department",


aggfunc="mean")
print(pivot)

Output:

markdown

Salary
Department
Finance 58000.000
HR 65000.000
IT 67500.000

2. Group by Multiple Columns

To group by Department and Gender, and find the total Bonus:

python

pivot = pd.pivot_table(df, values="Bonus", index="Department",


columns="Gender", aggfunc="sum")
print(pivot)

Output:
r

Gender F M
Department
Finance 4000.0 NaN
HR 5000.0 8000.0
IT NaN 13000.0

Advanced Pivot Tables

3. Multiple Aggregations

To calculate both mean and sum of salaries by department:

python

pivot = pd.pivot_table(df, values="Salary", index="Department",


aggfunc=["mean", "sum"])
print(pivot)

Output:

bash

mean sum
Department
Finance 58000.0000 58000
HR 65000.0000 130000
IT 67500.0000 135000

4. Handling Missing Values

To replace NaN values with 0 in the pivot table:

python

pivot = pd.pivot_table(
df,
values="Bonus",
index="Department",
columns="Gender",
aggfunc="sum",
fill_value=0
)
print(pivot)

Output:

yaml
Gender F M
Department
Finance 4000 0
HR 5000 8000
IT 0 13000

5. Group by Multiple Values

To summarize both Salary and Bonus for each Department:

python

pivot = pd.pivot_table(df, values=["Salary", "Bonus"], index="Department",


aggfunc="mean")
print(pivot)

Output:

yaml

Bonus Salary
Department
Finance 4000.000000 58000.000000
HR 6500.000000 65000.000000
IT 6500.000000 67500.000000

6. Add Margins for Totals

To include totals for rows and columns:

python

pivot = pd.pivot_table(
df,
values="Salary",
index="Department",
columns="Gender",
aggfunc="mean",
margins=True
)
print(pivot)

Output:

Gender F M All
Department
Finance 58000.0 NaN 58000.000
HR 50000.0 80000.0 65000.000
IT NaN 67500.0 67500.000
All 54000.0 72500.0 65000.000

Tips for Pivot Tables

1. Explore Your Data First: Use df.describe() and df.info() to understand the data.
2. Use Visualization: After creating pivot tables, visualize the results with libraries like
Matplotlib or Seaborn.

Would you like help visualizing a pivot table or more exercises?

4o

dealing with dates

said:

Dealing with dates in Pandas is essential for analyzing time-series data or performing operations
that depend on time. Pandas has built-in functions for working with dates and times efficiently.

1. Converting to Datetime

1.1. Using pd.to_datetime()

Converts a column or a list of date strings into a datetime object.

python

import pandas as pd

# Sample DataFrame
data = {
"Date": ["2024-12-01", "2024-12-02", "2024-12-03"],
"Value": [10, 20, 30]
}

df = pd.DataFrame(data)

# Convert the Date column to datetime


df["Date"] = pd.to_datetime(df["Date"])
print(df)
print(df.dtypes)
Output:

vbnet

Date Value
0 2024-12-01 10
1 2024-12-02 20
2 2024-12-03 30

Date datetime64[ns]
Value int64
dtype: object

2. Extracting Date Components

After converting to datetime, you can extract components like the year, month, day, etc.

python

# Extract components
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["Day"] = df["Date"].dt.day
df["DayOfWeek"] = df["Date"].dt.day_name()

print(df)

Output:

yaml

Date Value Year Month Day DayOfWeek


0 2024-12-01 10 2024 12 1 Sunday
1 2024-12-02 20 2024 12 2 Monday
2 2024-12-03 30 2024 12 3 Tuesday

3. Filtering Data by Date

3.1. Filtering for a Specific Date

python

filtered = df[df["Date"] == "2024-12-02"]


print(filtered)

Output:

sql

Date Value Year Month Day DayOfWeek


1 2024-12-02 20 2024 12 2 Monday

3.2. Filtering by Date Range

python

# Filter for dates after 2024-12-01


filtered = df[df["Date"] > "2024-12-01"]
print(filtered)

4. Generating Date Ranges

4.1. Using pd.date_range()

Create a range of dates with a specified frequency.

python

# Generate daily dates


date_range = pd.date_range(start="2024-12-01", end="2024-12-07", freq="D")
print(date_range)

Output:

arduino

DatetimeIndex(['2024-12-01', '2024-12-02', ..., '2024-12-07'],


dtype='datetime64[ns]', freq='D')

4.2. Custom Frequencies

 "D": Daily
 "W": Weekly
 "M": Month-end
 "H": Hourly

python

# Generate monthly dates


monthly_dates = pd.date_range(start="2024-01-01", periods=5, freq="M")
print(monthly_dates)

5. Handling Missing Dates

If a time series has missing dates, you can fill them in.

5.1. Reindex with a Complete Date Range


python

# Original DataFrame with missing dates


data = {
"Date": ["2024-12-01", "2024-12-03"],
"Value": [10, 30]
}
df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])

# Create a full date range and reindex


full_range = pd.date_range(start="2024-12-01", end="2024-12-03", freq="D")
df = df.set_index("Date").reindex(full_range).reset_index()
df.columns = ["Date", "Value"]
print(df)

Output:

yaml

Date Value
0 2024-12-01 10.0
1 2024-12-02 NaN
2 2024-12-03 30.0

6. Performing Date Arithmetic

6.1. Adding/Subtracting Days

python

# Add 7 days to each date


df["DatePlus7"] = df["Date"] + pd.Timedelta(days=7)
print(df)

6.2. Calculate Differences

python

# Calculate the difference between dates


df["DateDiff"] = df["Date"].diff()
print(df)

7. Working with Time Periods

7.1. Converting to Periods

python

# Convert dates to monthly periods


df["MonthPeriod"] = df["Date"].dt.to_period("M")
print(df)

7.2. Resampling Data

You can group data by specific time intervals, e.g., monthly totals.

python

# Resample by month and sum values


df = df.set_index("Date")
resampled = df["Value"].resample("M").sum()
print(resampled)

8. Handling Time Zones

If working with time zones, Pandas supports timezone-aware datetimes.

python

# Localize to a timezone
df["Date"] = pd.to_datetime(df["Date"])
df["Date"] = df["Date"].dt.tz_localize("UTC")

# Convert to another timezone


df["Date_EST"] = df["Date"].dt.tz_convert("US/Eastern")
print(df)

Would You Like Exercises?


1. Filter data for weekends only.
2. Create a rolling 7-day sum of a time-series dataset.
3. Handle missing data in an irregular time series.

Let me know if you'd like help with any of these tasks!

4o

can you nprovide me a pdf of this entire conversation , i mean entire conversation starting from
creating data frames

said:

ChatGP

You might also like