Unit-4
Introduction to pandas Data Structures: Series, Data Frame and Essential
Functionality:Dropping Entries- Indexing, Selection, and Filtering- Function Application and
Mapping-Sorting and Ranking.
Summarizing and Computing Descriptive Statistics- Unique Values, Value Counts, andMembership.
Reading and Writing Data in Text Format
Python Pandas
Python Pandas is defined as an open-source library that provides high-performance data
manipulation in Python. Pandas is a Python library used for working with data sets. It has functions
for analyzing, cleaning, exploring, and manipulating data.
The name of Pandas is derived from the word Panel Data, which means an Econometrics from
Multidimensional data. It is used for data analysis in Python and developed by Wes McKinney in 2008.
There are different tools are available for fast data processing, such as Numpy, Scipy, Cython,
and Panda. Pandas is built on top of the Numpy package(operating the Pandas.)
Features of Pandas
o Used for reshaping and pivoting of the data sets.
o Group by data for aggregations and transformations.
o It is used for data alignment and integration of the missing data.
o Provide the functionality of Time Series.
o Process a variety of data sets in different formats like matrix data, tabular heterogeneous, time
series.
o Handle multiple operations of the data sets such as subsetting, slicing, filtering, groupBy, re-
ordering, and re-shaping.
o It integrates with the other libraries such as SciPy, and scikit-learn.
Benefits of Pandas
o Data Representation: It represents the data in a form that is suited for data analysis through
its DataFrame and Series.
UNIT-4 Page 1
o Clear code: The clear API of the Pandas allows you to focus on the core part of the code. So, it
provides clear and concise code for the user.
Introduction to pandas Data Structures
To get started with pandas, you will need to get comfortable with its two workhorse data
structures: Series and DataFrame. While they are not a universal solution for everyproblem, they
provide a solid, easy-to-use basis for most applications.
Series
A Series is a one-dimensional array-like object containing an array of data (of any NumPy data
type) and an associated array of data labels, called its index. The simplest Series is formed from only
an array of data:
Program:
import pandas as pd
obj = pd.Series([4, 7, -5, 3])
print(obj)
Output:
0 4
1 7
2 -5
3 3
dtype: int64
The string representation of a Series displayed interactively shows the index on the left and the values
on the right. Since we did not specify an index for the data, a default one consisting of the integers 0
through N - 1 (where N is the length of the data) is created. You can get the array representation and
index object of the Series via its values and index attributes, respectively:
UNIT-4 Page 2
Program:
import pandas as pd
import numpy as np
obj=pd.Series([12,13,14,15],index=['a','b','c','d'])
print((obj))
Output:
a 12
b 13
c 14
d 15
dtype: int64
NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math
functions, will preserve the index-value link:
Program:
import pandas as pd
obj=pd.Series([12,13,14,15],index=['a','b','c','d'])
print(obj*2)
Output:
a 24
b 26
c 28
d 30
dtype: int64
Program:
import pandas as pd
import numpy as np
obj=pd.Series([12,13,14,15],index=['a','b','c','d'])
print(np.exp(obj))
Output:
a 1.627548e+05
b 4.424134e+05
c 1.202604e+06
d 3.269017e+06
UNIT-4 Page 3
dtype: float64
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mappingof index values
to data values. It can be substituted into many functions that expect a dict:
Program:
import pandas as pd
import numpy as np
obj=pd.Series([12,13,14,15],index=['a','b','c','d'])
print('e' in obj)
Output:
False
Program:
import pandas as pd
import numpy as np
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah':5000}
obj=pd.Series(sdata)
print(sdata)
Output
{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000,'Utah': 5000}
When only passing a dict, the index in the resulting Series will have the dict’s keys insorted order
Program:
import pandas as pd
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
print(obj4)
print(pd.isnull(obj4))
Output
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
California True
Ohio False
Oregon False
Texas False
dtype: bool
Program:
import pandas as pd
UNIT-4 Page 4
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
print(obj4)
print(pd.notnull(obj4))
Output:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
California False
Ohio True
Oregon True
Texas True
dtype: bool
DataFrame
A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of
columns, each of which can be a different value type (numeric,string, boolean, etc.). The DataFrame has both a
row and column index; it can be thought of as a dict of Series (one for all sharing the same index).
DataFrame is defined as a standard way to store data that has two different indexes,
i.e., row index and column index.
Parameter
data: It consists of different forms like ndarray, series, map, constants, lists, array.
index: The Default np.arrange(n) index is used for the row labels if no index is passed.
columns: The default syntax is np.arrange(n) for the column labels. It shows only true if no index is passed.
dtype: It refers to the data type of each column.
copy(): It is used for copying the data.
Create a DataFrame
o dict
o Lists
o Numpy ndarrrays
o Series
Program:
# importing the pandas library
import pandas as pd
df = pd.DataFrame()
print (df)
Output:
UNIT-4 Page 5
Empty DataFrame
Columns: []
Index: []
Create a DataFrame using List:
Program:
# importing the pandas library
import pandas as pd
# a list of strings
x = ['Python', 'Pandas']
# Calling DataFrame constructor on list
df = pd.DataFrame(x)
print(df)
Output:
0
0 Python
1 Pandas
Create a DataFrame from Dict
Program:
import pandas as pd
data = { "calories": [420, 380, 390],"duration": [50, 40, 45]}
myvar = pd.DataFrame(data)
print(myvar)
Output:
calories duration
0 420 50
1 380 40
2 390 45
Create a DataFrame from Dict of Series:
Program:
# importing the pandas library
import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),
'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}
d1 = pd.DataFrame(info)
print (d1)
Output
UNIT-4 Page 6
one two
a 1.0 1
b 2.0 2
c 3.0 3
d 4.0 4
e 5.0 5
f 6.0 6
g NaN 7
h NaN 8
Essential Functionality:
Dropping entries from an axis
Dropping one or more entries from an axis is easy if you have an index array or list without those entries. As
that can require a bit of munging and set logic, the drop method will return a new object with the indicated
value or values deleted from an axis.
Syntax:
DataFrame.drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='rais
e')
Drop Single Column
Program:
import pandas as pd
student_dict = {"name": ["Joe", "Nat"], "age": [20, 21], "marks": [85.10, 77.80]}
# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)
# drop column
student_df = student_df.drop(columns='age')
print(student_df)
Output
Name age marks
0 Joe 20 85.1
1 Nat 21 77.8
Name marks
0 Joe 85.1
1 Nat 77.8
Program:
from pandas import Series
import pandas as pd
import numpy as np
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
print( new_obj
Output
a 0
b 1
d 3
UNIT-4 Page 7
e 4
With DataFrame, index values can be deleted from either axis:
Program:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['Ohio', 'Colorado', 'Utah', 'New York'],
columns=['one', 'two', 'three', 'four'])
print(data.drop(['Colorado', 'Ohio' ]))
print("\n")
print(data.drop('two', axis=1) )
print("\n")
print(data.drop(['two', 'four'], axis=1))
Output
Indexing,selection, and filtering :
Series indexing (obj[...]) works analogously to NumPy array indexing, except you can
use the Series’s index values instead of only integers.
Program:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj['b'])
print(obj[1])
print(obj[2:4])
print(obj[['b', 'a', 'd']])
UNIT-4 Page 8
print(obj[[1, 3]])
print(obj[obj < 2])
Output
In the above, indexing into a DataFrame is for retrieving one or more columns either with a single
value or sequence:
Program:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['Ohio', 'Colorado', 'Utah', 'New
York'],columns=['one', 'two', 'three', 'four'])
print(data)
print(data['two'])
print(data[['three', 'one']])
Output
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12
UNIT-4 Page 9
Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar
comparison
Program:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['Ohio', 'Colorado', 'Utah', 'New
York'],columns=['one', 'two', 'three', 'four'])
print(data < 5)
data[data < 5] = 0
print(data)
Output
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
For DataFrame label-indexing on the rows, The special indexing field ix. It enables you to select a
subset of the rows and columns from a DataFrame with NumPy like notation plus axis labels. This
is also a less verbose way to do reindexing:
UNIT-4 Page 10
So there are many ways to select and rearrange the data contained in a pandas object for
DataFrame, Indexing options with DataFrame.
Type Notes
obj[val] Select single column or sequence of columns from the DataFrame. Special
case conveniences:
boolean array (filter rows), slice (slice rows), or boolean DataFrame (set
values based on some criterion).
obj.ix[val] Selects single row of subset of rows from the DataFrame.
obj.ix[:, val] Selects single column of subset of columns.
obj.ix[val1, val2] Select both rows and columns
reindex method Conform one or more axes to new indexes.
xs method Select single row or column as a Series by label.
icol, irow methods Select single column or row, respectively, as a Series by integer location.
get_value, set_value Select single value by row and column label.
methods
Function application and mapping
NumPy functions (element-wise array methods) work fine with pandas objects:
Program:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas',
'Oregon'])
print(frame)
print(np.abs(frame))
UNIT-4 Page 11
b d e
Utah -0.092044 0.726726 -1.075978
Ohio -0.308196 -1.055195 -0.549732
Texas 1.625654 -0.981833 -0.311490
Oregon 0.294512 1.585100 -0.275431
b d e
Utah 0.092044 0.726726 1.075978
Ohio 0.308196 1.055195 0.549732
Texas 1.625654 0.981833 0.311490
Oregon 0.294512 1.585100 0.275431
Another frequent operation is applying a function on 1D arrays to each column or row. Using
DataFrame’s apply method Many of the most common array statistics (like sum and mean) are
DataFrame methods,so using apply is not necessary.The function passed to apply need not return a
scalar value, it can also return a Series with multiple values.
Summarizing and Computing Descriptive Statistics
pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of reductions or summary statistics, methods that
extract a single value (like the sum or mean) from a Series or a Series of values from the
rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data. Consider a
small DataFrame:
UNIT-4 Page 12
Example:
from pandas import DataFrame
import numpy as np
df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])
print(df)
O/P:
Calling DataFrame’s sum method returns a Series containing column sums:
from pandas import DataFrame
import numpy as np
df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])
print('data frame\n',df)
#print('\n')
print('data frame sum\n',df.sum())
output:
UNIT-4 Page 13
Options for reduction methods
Method Description
Axis Axis to reduce over. 0 for DataFrame’s rows and 1 for columns.
skipna Exclude missing values, True by default.
Level Reduce grouped by level if the axis is hierarchically-indexed
EXAMPLE:
from pandas import DataFrame
import numpy as np
df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])
print('data frame sum with axis\n',df.sum(axis=1))
print('\nskipna option:')
print( df.mean(axis=1, skipna=False))
output:
UNIT-4 Page 14
NA values are excluded unless the entire slice (row or column in this case) is NA. This
can be disabled using the skipna option.
Descriptive and summary statistics:
UNIT-4 Page 15
Example programs:
Some methods, like idxmin and idxmax, return indirect statistics like the index value
where the minimum or maximum values are attained:
from pandas import DataFrame
import numpy as np
UNIT-4 Page 16
df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])
print("\n")
print(df.idxmax())
print(df.cumsum())
print("\ndescribe method")
print(df.describe())
OutPut:
Unique Values
While working with the DataFrame in Pandas, you need to find the unique elements
present in the column. For doing this, we have to use the unique() method to extract the
unique values from the columns. The Pandas library in Python can easily help us to find
unique data.
The unique values present in the columns are returned in order of its occurrence. This
does not sort the order of its appearance. In addition, this method is based on the hash-
table.
UNIT-4 Page 17
It is significantly faster than numpy.unique() method and also includes null values.
To illustrate these, consider this example:
from pandas import DataFrame,Series
import numpy as np
df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])
print("\n")
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
print(uniques)
output:
['c' 'a' 'd' 'b']
The unique values are not necessarily returned in sorted order, but could be sorted after
the fact if needed (uniques.sort()).
Value Counts
The value_counts() function returns a Series that contain counts of unique values. It
returns an object that will be in descending order so that its first element will be the most
frequently-occurred element.
value_counts computes a Series containing value frequencies:
from pandas import DataFrame,Series
import numpy as np
df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])
print("\n")
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
print(obj.value_counts())
output:
UNIT-4 Page 18
c 3
a 3
b 2
d 1
dtype: int64
The Series is sorted by value in descending order as a convenience.
value_counts is also available as a top-level pandas method that can be used with any
array or sequence:
from pandas import DataFrame,Series
import pandas as pd
import numpy as np
df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])
print("\n")
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
print(pd.value_counts(obj.values, sort=False))
output:
Membership
UNIT-4 Page 19
Isin is responsible for vectorized set membership and can be veryuseful in filtering a data
set down to a subset of values in a Series or column in a DataFrame:
from pandas import DataFrame,Series
import pandas as pd
import numpy as np
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
print("\n")
mask=obj.isin(['b', 'c'])
print(pd.value_counts(obj.values, sort=False))
print(mask,"\n")
print("\nobj[mask]")
print(obj[mask])
output:
Unique, value counts, and binning methods
UNIT-4 Page 20
Reading and Writing Data in Text Format
Python has become a beloved language for text and file munging due to its simple syntax
for interacting with files, intuitive data structures, and convenient features like tuple
packing and unpacking. pandas features a number of functions for reading tabular data as
a DataFrame object.
Parsing functions in pandas:
I’ll give an overview of the mechanics of these functions, which are meant to convert
text data into a DataFrame. The options for these functions fall into a few categories:
• Indexing: can treat one or more columns as the returned DataFrame, and whether
to get column names from the file, the user, or not at all.
UNIT-4 Page 21
• Type inference and data conversion: this includes the user-defined value conversions
and custom list of missing value markers.
• Datetime parsing: includes combining capability, including combining date and
time information spread over multiple columns into a single column in the result.
• Iterating: support for iterating over chunks of very large files.
• Unclean data issues: skipping rows or a footer, comments, or other minor things
like numeric data with thousands separated by commas.
Type inference is one of the more important features of these functions; that means you
don’t have to specify which columns are numeric, integer, boolean, or string. Handling
dates and other custom types requires a bit more effort, though. Let’s start with a small
comma-separated (CSV) text file:
This file must me save with m1.csv
from pandas import DataFrame,Series
import pandas as pd
import numpy as np
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
df = pd.read_csv('C:\\Users\\bsc_lab1_18\\Downloads\\m1.csv')
UNIT-4 Page 22
print(df)
output:
Header:
from pandas import DataFrame,Series
import pandas as pd
import numpy as np
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
df = pd.read_csv('C:\\Users\\bsc_lab1_18\\Downloads\\m1.csv',header=None)
print(df)
UNIT-4 Page 23
NAMES:
from pandas import Series,DataFrame
import pandas as pd
df=pd.read_csv('C:\\Users\\bsc_lab1_29\\Documents\\m1.csv', names=['col1', 'col2',
'col3', 'col4', 'Heading'])
print(df)
UNIT-4 Page 24
Suppose you wanted the message column to be the index of the returned DataFrame. You
can either indicate you want the column at index 4 or named 'message' using the
index_col argument:
from pandas import Series,DataFrame
import pandas as pd
names=['col1', 'col2', 'col3', 'col4', 'Heading']
df=pd.read_csv('C:\\Users\\bsc_lab1_29\\Documents\\m1.csv', names=names,
index_col='Heading')
print(df)
OUTPUT:
UNIT-4 Page 25
read_csv /read_table function arguments
UNIT-4 Page 26
Example programs on read_csv /read_table function arguments
isnull function:
from pandas import Series,DataFrame
import pandas as pd
names=['col1', 'col2', 'col3', 'col4', 'Heading']
result =pd.read_csv('C:\\Users\\bsc_lab1_29\\Documents\\m1.csv')
df=pd.isnull(result)
print(result)
print(df)
output:
The na_values option can take either a list or set of strings to consider missing values:
from pandas import Series,DataFrame
import pandas as pd
names=['col1', 'col2', 'col3', 'col4', 'Heading']
result =pd.read_csv('C:\\Users\\bsc_lab1_29\\Documents\\m1.csv', na_values=['NULL'])
UNIT-4 Page 27
print(result)
OUTPUT:
Writing Data Out to Text Format
# importing the module
import pandas as pd
# creating the DataFrame
my_df = {'Name': ['Rutuja', 'Anuja'],
'ID': [1, 2],
'Age': [20, 19]}
df = pd.DataFrame(my_df)
# displaying the DataFrame
print('DataFrame:\n', df)
# saving the DataFrame as a CSV file
gfg_csv_data = df.to_csv('C:\\Users\\bsc_lab1_29\\Documents.m1.csv', index = True)
print('\nCSV String:\n', gfg_csv_data)
UNIT-4 Page 28
output:
the output of the above code includes the index, as follows.
Converting to a CSV file without the index. If we wish not to include the index, then in
the index parameter assign the value False.
# importing the module
import pandas as pd
# creating the DataFrame
my_df = {'Name': ['Rutuja', 'Anuja'],
'ID': [1, 2],
'Age': [20, 19]}
UNIT-4 Page 29
df = pd.DataFrame(my_df)
# displaying the DataFrame
print('DataFrame:\n', df)
# saving the DataFrame as a CSV file
gfg_csv_data = df.to_csv('GfG.csv', index = False)
print('\nCSV String:\n', gfg_csv_data)
output:
UNIT-4 Page 30