0% found this document useful (0 votes)
99 views30 pages

P Unit-4 NP

Uploaded by

prasannadp04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views30 pages

P Unit-4 NP

Uploaded by

prasannadp04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Unit-4

Introduction to pandas Data Structures: Series, Data Frame and Essential


Functionality:Dropping Entries- Indexing, Selection, and Filtering- Function Application and
Mapping-Sorting and Ranking.

Summarizing and Computing Descriptive Statistics- Unique Values, Value Counts, andMembership.
Reading and Writing Data in Text Format

Python Pandas

Python Pandas is defined as an open-source library that provides high-performance data


manipulation in Python. Pandas is a Python library used for working with data sets. It has functions
for analyzing, cleaning, exploring, and manipulating data.

The name of Pandas is derived from the word Panel Data, which means an Econometrics from
Multidimensional data. It is used for data analysis in Python and developed by Wes McKinney in 2008.

There are different tools are available for fast data processing, such as Numpy, Scipy, Cython,
and Panda. Pandas is built on top of the Numpy package(operating the Pandas.)

Features of Pandas

o Used for reshaping and pivoting of the data sets.

o Group by data for aggregations and transformations.

o It is used for data alignment and integration of the missing data.

o Provide the functionality of Time Series.

o Process a variety of data sets in different formats like matrix data, tabular heterogeneous, time
series.

o Handle multiple operations of the data sets such as subsetting, slicing, filtering, groupBy, re-
ordering, and re-shaping.

o It integrates with the other libraries such as SciPy, and scikit-learn.

Benefits of Pandas

o Data Representation: It represents the data in a form that is suited for data analysis through
its DataFrame and Series.

UNIT-4 Page 1
o Clear code: The clear API of the Pandas allows you to focus on the core part of the code. So, it
provides clear and concise code for the user.

Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data
structures: Series and DataFrame. While they are not a universal solution for everyproblem, they
provide a solid, easy-to-use basis for most applications.

Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data
type) and an associated array of data labels, called its index. The simplest Series is formed from only
an array of data:

Program:

import pandas as pd

obj = pd.Series([4, 7, -5, 3])

print(obj)

Output:

0 4

1 7

2 -5

3 3

dtype: int64

The string representation of a Series displayed interactively shows the index on the left and the values
on the right. Since we did not specify an index for the data, a default one consisting of the integers 0
through N - 1 (where N is the length of the data) is created. You can get the array representation and
index object of the Series via its values and index attributes, respectively:

UNIT-4 Page 2
Program:
import pandas as pd

import numpy as np

obj=pd.Series([12,13,14,15],index=['a','b','c','d'])

print((obj))

Output:

a 12

b 13

c 14

d 15

dtype: int64

NumPy array operations, such as filtering with a boolean array, scalar multiplication, or applying math
functions, will preserve the index-value link:

Program:
import pandas as pd
obj=pd.Series([12,13,14,15],index=['a','b','c','d'])
print(obj*2)

Output:
a 24
b 26
c 28
d 30
dtype: int64

Program:
import pandas as pd
import numpy as np
obj=pd.Series([12,13,14,15],index=['a','b','c','d'])
print(np.exp(obj))

Output:
a 1.627548e+05
b 4.424134e+05
c 1.202604e+06
d 3.269017e+06

UNIT-4 Page 3
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mappingof index values
to data values. It can be substituted into many functions that expect a dict:

Program:
import pandas as pd
import numpy as np
obj=pd.Series([12,13,14,15],index=['a','b','c','d'])
print('e' in obj)

Output:
False

Program:
import pandas as pd
import numpy as np
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah':5000}
obj=pd.Series(sdata)
print(sdata)

Output
{'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000,'Utah': 5000}

When only passing a dict, the index in the resulting Series will have the dict’s keys insorted order

Program:
import pandas as pd
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
print(obj4)
print(pd.isnull(obj4))

Output
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
California True
Ohio False
Oregon False
Texas False
dtype: bool

Program:
import pandas as pd

UNIT-4 Page 4
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
states = ['California', 'Ohio', 'Oregon', 'Texas']

obj4 = pd.Series(sdata, index=states)


print(obj4)
print(pd.notnull(obj4))
Output:
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
California False
Ohio True
Oregon True
Texas True
dtype: bool

DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of


columns, each of which can be a different value type (numeric,string, boolean, etc.). The DataFrame has both a
row and column index; it can be thought of as a dict of Series (one for all sharing the same index).
DataFrame is defined as a standard way to store data that has two different indexes,
i.e., row index and column index.
Parameter
data: It consists of different forms like ndarray, series, map, constants, lists, array.
index: The Default np.arrange(n) index is used for the row labels if no index is passed.
columns: The default syntax is np.arrange(n) for the column labels. It shows only true if no index is passed.
dtype: It refers to the data type of each column.
copy(): It is used for copying the data.

Create a DataFrame
o dict
o Lists
o Numpy ndarrrays
o Series

Program:
# importing the pandas library
import pandas as pd
df = pd.DataFrame()
print (df)
Output:

UNIT-4 Page 5
Empty DataFrame
Columns: []
Index: []

Create a DataFrame using List:


Program:
# importing the pandas library
import pandas as pd
# a list of strings
x = ['Python', 'Pandas']
# Calling DataFrame constructor on list
df = pd.DataFrame(x)
print(df)
Output:
0
0 Python
1 Pandas

Create a DataFrame from Dict

Program:
import pandas as pd
data = { "calories": [420, 380, 390],"duration": [50, 40, 45]}
myvar = pd.DataFrame(data)
print(myvar)

Output:
calories duration
0 420 50
1 380 40
2 390 45

Create a DataFrame from Dict of Series:

Program:
# importing the pandas library
import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),
'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}
d1 = pd.DataFrame(info)
print (d1)
Output

UNIT-4 Page 6
one two
a 1.0 1
b 2.0 2
c 3.0 3
d 4.0 4
e 5.0 5
f 6.0 6
g NaN 7
h NaN 8

Essential Functionality:
Dropping entries from an axis
Dropping one or more entries from an axis is easy if you have an index array or list without those entries. As
that can require a bit of munging and set logic, the drop method will return a new object with the indicated
value or values deleted from an axis.
Syntax:
DataFrame.drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='rais
e')

Drop Single Column

Program:
import pandas as pd
student_dict = {"name": ["Joe", "Nat"], "age": [20, 21], "marks": [85.10, 77.80]}
# Create DataFrame from dict
student_df = pd.DataFrame(student_dict)
print(student_df)
# drop column
student_df = student_df.drop(columns='age')
print(student_df)
Output
Name age marks
0 Joe 20 85.1
1 Nat 21 77.8
Name marks
0 Joe 85.1
1 Nat 77.8

Program:
from pandas import Series
import pandas as pd
import numpy as np
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
print( new_obj
Output
a 0
b 1
d 3

UNIT-4 Page 7
e 4

With DataFrame, index values can be deleted from either axis:

Program:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np

data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['Ohio', 'Colorado', 'Utah', 'New York'],


columns=['one', 'two', 'three', 'four'])

print(data.drop(['Colorado', 'Ohio' ]))


print("\n")
print(data.drop('two', axis=1) )
print("\n")
print(data.drop(['two', 'four'], axis=1))
Output

Indexing,selection, and filtering :

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can
use the Series’s index values instead of only integers.

Program:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
print(obj['b'])
print(obj[1])
print(obj[2:4])
print(obj[['b', 'a', 'd']])

UNIT-4 Page 8
print(obj[[1, 3]])
print(obj[obj < 2])

Output

In the above, indexing into a DataFrame is for retrieving one or more columns either with a single
value or sequence:

Program:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['Ohio', 'Colorado', 'Utah', 'New
York'],columns=['one', 'two', 'three', 'four'])
print(data)
print(data['two'])
print(data[['three', 'one']])

Output
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32

three one
Ohio 2 0
Colorado 6 4
Utah 10 8
New York 14 12

UNIT-4 Page 9
Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar
comparison

Program:
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
data = pd.DataFrame(np.arange(16).reshape((4, 4)),index=['Ohio', 'Colorado', 'Utah', 'New
York'],columns=['one', 'two', 'three', 'four'])
print(data < 5)
data[data < 5] = 0
print(data)

Output

one two three four


Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

For DataFrame label-indexing on the rows, The special indexing field ix. It enables you to select a
subset of the rows and columns from a DataFrame with NumPy like notation plus axis labels. This
is also a less verbose way to do reindexing:

UNIT-4 Page 10
So there are many ways to select and rearrange the data contained in a pandas object for
DataFrame, Indexing options with DataFrame.

Type Notes

obj[val] Select single column or sequence of columns from the DataFrame. Special
case conveniences:
boolean array (filter rows), slice (slice rows), or boolean DataFrame (set
values based on some criterion).
obj.ix[val] Selects single row of subset of rows from the DataFrame.

obj.ix[:, val] Selects single column of subset of columns.

obj.ix[val1, val2] Select both rows and columns

reindex method Conform one or more axes to new indexes.

xs method Select single row or column as a Series by label.

icol, irow methods Select single column or row, respectively, as a Series by integer location.

get_value, set_value Select single value by row and column label.


methods

Function application and mapping


NumPy functions (element-wise array methods) work fine with pandas objects:

Program:
from pandas import Series, DataFrame

import pandas as pd

import numpy as np

frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'), index=['Utah', 'Ohio', 'Texas',


'Oregon'])

print(frame)

print(np.abs(frame))

UNIT-4 Page 11
b d e

Utah -0.092044 0.726726 -1.075978

Ohio -0.308196 -1.055195 -0.549732

Texas 1.625654 -0.981833 -0.311490

Oregon 0.294512 1.585100 -0.275431

b d e

Utah 0.092044 0.726726 1.075978

Ohio 0.308196 1.055195 0.549732

Texas 1.625654 0.981833 0.311490

Oregon 0.294512 1.585100 0.275431


Another frequent operation is applying a function on 1D arrays to each column or row. Using
DataFrame’s apply method Many of the most common array statistics (like sum and mean) are
DataFrame methods,so using apply is not necessary.The function passed to apply need not return a
scalar value, it can also return a Series with multiple values.

Summarizing and Computing Descriptive Statistics

pandas objects are equipped with a set of common mathematical and statistical methods.
Most of these fall into the category of reductions or summary statistics, methods that
extract a single value (like the sum or mean) from a Series or a Series of values from the
rows or columns of a DataFrame. Compared with the equivalent methods of vanilla
NumPy arrays, they are all built from the ground up to exclude missing data. Consider a
small DataFrame:

UNIT-4 Page 12
Example:

from pandas import DataFrame

import numpy as np

df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])

print(df)

O/P:

Calling DataFrame’s sum method returns a Series containing column sums:

from pandas import DataFrame

import numpy as np

df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])

print('data frame\n',df)

#print('\n')

print('data frame sum\n',df.sum())

output:

UNIT-4 Page 13
Options for reduction methods

Method Description
Axis Axis to reduce over. 0 for DataFrame’s rows and 1 for columns.

skipna Exclude missing values, True by default.


Level Reduce grouped by level if the axis is hierarchically-indexed

EXAMPLE:

from pandas import DataFrame

import numpy as np

df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])

print('data frame sum with axis\n',df.sum(axis=1))

print('\nskipna option:')

print( df.mean(axis=1, skipna=False))

output:

UNIT-4 Page 14
NA values are excluded unless the entire slice (row or column in this case) is NA. This
can be disabled using the skipna option.

Descriptive and summary statistics:

UNIT-4 Page 15
Example programs:

Some methods, like idxmin and idxmax, return indirect statistics like the index value
where the minimum or maximum values are attained:

from pandas import DataFrame

import numpy as np

UNIT-4 Page 16
df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])
print("\n")
print(df.idxmax())
print(df.cumsum())
print("\ndescribe method")
print(df.describe())
OutPut:

Unique Values

While working with the DataFrame in Pandas, you need to find the unique elements
present in the column. For doing this, we have to use the unique() method to extract the
unique values from the columns. The Pandas library in Python can easily help us to find
unique data.

The unique values present in the columns are returned in order of its occurrence. This
does not sort the order of its appearance. In addition, this method is based on the hash-
table.

UNIT-4 Page 17
It is significantly faster than numpy.unique() method and also includes null values.

To illustrate these, consider this example:

from pandas import DataFrame,Series


import numpy as np
df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])

print("\n")
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
uniques = obj.unique()
print(uniques)

output:

['c' 'a' 'd' 'b']

The unique values are not necessarily returned in sorted order, but could be sorted after
the fact if needed (uniques.sort()).

Value Counts

The value_counts() function returns a Series that contain counts of unique values. It
returns an object that will be in descending order so that its first element will be the most
frequently-occurred element.

value_counts computes a Series containing value frequencies:

from pandas import DataFrame,Series

import numpy as np
df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])
print("\n")
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
print(obj.value_counts())

output:

UNIT-4 Page 18
c 3

a 3

b 2

d 1

dtype: int64

The Series is sorted by value in descending order as a convenience.


value_counts is also available as a top-level pandas method that can be used with any
array or sequence:

from pandas import DataFrame,Series

import pandas as pd

import numpy as np

df =DataFrame([[1.4, np.nan], [7.1, -4.5],[np.nan, np.nan], [0.75, -1.3]], index=['a', 'b', 'c',
'd'],columns=['one', 'two'])

print("\n")

obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

print(pd.value_counts(obj.values, sort=False))

output:

Membership

UNIT-4 Page 19
Isin is responsible for vectorized set membership and can be veryuseful in filtering a data
set down to a subset of values in a Series or column in a DataFrame:

from pandas import DataFrame,Series


import pandas as pd
import numpy as np
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
print("\n")
mask=obj.isin(['b', 'c'])
print(pd.value_counts(obj.values, sort=False))
print(mask,"\n")
print("\nobj[mask]")
print(obj[mask])
output:

Unique, value counts, and binning methods

UNIT-4 Page 20
Reading and Writing Data in Text Format

Python has become a beloved language for text and file munging due to its simple syntax
for interacting with files, intuitive data structures, and convenient features like tuple
packing and unpacking. pandas features a number of functions for reading tabular data as
a DataFrame object.

Parsing functions in pandas:

I’ll give an overview of the mechanics of these functions, which are meant to convert

text data into a DataFrame. The options for these functions fall into a few categories:

• Indexing: can treat one or more columns as the returned DataFrame, and whether

to get column names from the file, the user, or not at all.

UNIT-4 Page 21
• Type inference and data conversion: this includes the user-defined value conversions
and custom list of missing value markers.

• Datetime parsing: includes combining capability, including combining date and

time information spread over multiple columns into a single column in the result.

• Iterating: support for iterating over chunks of very large files.

• Unclean data issues: skipping rows or a footer, comments, or other minor things

like numeric data with thousands separated by commas.

Type inference is one of the more important features of these functions; that means you
don’t have to specify which columns are numeric, integer, boolean, or string. Handling
dates and other custom types requires a bit more effort, though. Let’s start with a small
comma-separated (CSV) text file:

This file must me save with m1.csv

from pandas import DataFrame,Series

import pandas as pd

import numpy as np

obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

df = pd.read_csv('C:\\Users\\bsc_lab1_18\\Downloads\\m1.csv')

UNIT-4 Page 22
print(df)

output:

Header:

from pandas import DataFrame,Series

import pandas as pd

import numpy as np

obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

df = pd.read_csv('C:\\Users\\bsc_lab1_18\\Downloads\\m1.csv',header=None)

print(df)

UNIT-4 Page 23
NAMES:

from pandas import Series,DataFrame

import pandas as pd

df=pd.read_csv('C:\\Users\\bsc_lab1_29\\Documents\\m1.csv', names=['col1', 'col2',


'col3', 'col4', 'Heading'])

print(df)

UNIT-4 Page 24
Suppose you wanted the message column to be the index of the returned DataFrame. You
can either indicate you want the column at index 4 or named 'message' using the
index_col argument:

from pandas import Series,DataFrame

import pandas as pd

names=['col1', 'col2', 'col3', 'col4', 'Heading']

df=pd.read_csv('C:\\Users\\bsc_lab1_29\\Documents\\m1.csv', names=names,
index_col='Heading')

print(df)

OUTPUT:

UNIT-4 Page 25
read_csv /read_table function arguments

UNIT-4 Page 26
Example programs on read_csv /read_table function arguments

isnull function:

from pandas import Series,DataFrame

import pandas as pd

names=['col1', 'col2', 'col3', 'col4', 'Heading']

result =pd.read_csv('C:\\Users\\bsc_lab1_29\\Documents\\m1.csv')

df=pd.isnull(result)

print(result)

print(df)

output:

The na_values option can take either a list or set of strings to consider missing values:

from pandas import Series,DataFrame

import pandas as pd

names=['col1', 'col2', 'col3', 'col4', 'Heading']

result =pd.read_csv('C:\\Users\\bsc_lab1_29\\Documents\\m1.csv', na_values=['NULL'])

UNIT-4 Page 27
print(result)

OUTPUT:

Writing Data Out to Text Format

# importing the module

import pandas as pd

# creating the DataFrame

my_df = {'Name': ['Rutuja', 'Anuja'],

'ID': [1, 2],

'Age': [20, 19]}

df = pd.DataFrame(my_df)

# displaying the DataFrame

print('DataFrame:\n', df)

# saving the DataFrame as a CSV file

gfg_csv_data = df.to_csv('C:\\Users\\bsc_lab1_29\\Documents.m1.csv', index = True)

print('\nCSV String:\n', gfg_csv_data)

UNIT-4 Page 28
output:

the output of the above code includes the index, as follows.

Converting to a CSV file without the index. If we wish not to include the index, then in
the index parameter assign the value False.

# importing the module

import pandas as pd

# creating the DataFrame

my_df = {'Name': ['Rutuja', 'Anuja'],

'ID': [1, 2],

'Age': [20, 19]}


UNIT-4 Page 29
df = pd.DataFrame(my_df)

# displaying the DataFrame

print('DataFrame:\n', df)

# saving the DataFrame as a CSV file

gfg_csv_data = df.to_csv('GfG.csv', index = False)

print('\nCSV String:\n', gfg_csv_data)

output:

UNIT-4 Page 30

You might also like