0% found this document useful (0 votes)
12 views16 pages

DHP Unit - 4 Part2

Uploaded by

8t46ct85bb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views16 pages

DHP Unit - 4 Part2

Uploaded by

8t46ct85bb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit-4 Python Int .

eract1on with text and CSV 77


same way to retrieve m It·1 I
.. u P e columns we can provide list of column names as
su bscripts to data frame 0 b·
Jects as data[[list of column name]]
Example: ·
>>>data[['Eno', 'Name']]

Output:
Eno Name
0 101 amit
1 102 deep
2 103 jay
3 104 nency
4 105 shree

( f ) to_numpy()

The DataFrame.to_numpy() function is applied on !b.g._ Q_aJ . ~ ~ ~at...~ urns- t he--


jlUmpy nd~ y. For performing some high-1~--;;i mathematical functions, we can
convert Pandas DataFrame to numpy arrays. It uses the DataFrame.to_numpy()
function.

Syntax:
DataFrame.to_numpy(dtype=None, copy=False}
Parameters - ---------
• dtype: It is an optional parameter that pass the dtype to numpy.asarray(}.
• copy: It returns the boolean value that has the default value False.
It ensures that the returned value is not a view on another array.

Returns
It returns the numpy.ndarray as an output.

Example:
import pandas as pd
#initializing the dataframe
info= pd.DataFrame([[l7, 62, 35],[25, 36, 54],[42, 20, 15],[48, 62, 76]],
columns=['x', 'y', 'z'])
print(' Data Frame\n----------\n', info)
#convert the dataframe to a numpy array
arr= info.to_numpy()
. n with text and CSV
Unit-4 python Inte r~
78

print(' Nu mpy Array\n----------\n '' arr)

Outpu t:
Data Frame

X y Z

0 17 62 35
1 25 36 54
2 42 20 15
3 48 62 76
Numpy Array

[[17 62 35]
[25 36 54]
[42 20 15]
[48 62 76]]

findin g Maximum and Minimum Values _


. .
The column containing numerical b I () metho d to find maximum
data can e app y max
value and min()method to find minimum value from that column.

Example: To get maximum and minimum' salary


>>>data['Salary'].max()
Output:
2500

>>>data['Salary'].min()
Output:
1500

Dataf1 ai:Re.loc[]
The DataFrame.loc[] is used to retrieve the group of rows and colum ns by labels
or a
boolean array in the DataF ;ame~h ffikes only index labels, and if it exists in
the caller
DataFrame, it returns the rows, columns, or DataFrame.

The DataFrame.loc[] is a label based but may use with the boole an array.

The allowed inputs for .loc[] are:


.YJ)it--4 Python .lntera r
c •o~it h text and csv
o Single label e _ _ _ _ _ _ _ _ _ _ _ _ _-!7~9
' .g,,7o ra · Here 7 ·
o ' is interpret d
list or array 0 f 1abels, e.g. ['x',,
. , , , e as the label of the index.
o Slice object With I Y, z ].
abels, e.g. 'x':'f'
o A boolea n array of the .
same length
o callable functio n With . e.g. [True, True, False].
one argument.
Syntax:
pandas. Data Frame. loc[]

Example:
impor t pandas as pd
# Creati ng the DataFrame
info= pd . Dat a Frame({'Age':[32 41
I I r 44, 38, 33]

Name' :['chet an' ' ' , ' ,'


# Create the index ' mayan k' deep, raj', 'rahul'] })

, , ,Row_3', 'Row_4', 'Row_S']


index- = ['Row- 1' , 'R ow_2,

# Set the index


info.in dex= index

# return the value


final= info.lo c['Row _2', 'Name']

# Print the result


print(f inal)

Output:
mayan k

~ ) .

~ OataF, ame.iloc[]

other than
The DataF rame. iloc[] is used when the index label of the DataF rame is
the index
numer ic series of 0,1,2, .... ,n or in the case when the user does not know

label.
80 Unit-4 Python Interaction with text and CSV

We can extract the rows by using an imaginary index position which is not visible in
~he Data Frame. It is an integer- based position(from o to length-1 of the axis), but may
also be used with the boolean array.

Syntax:
pandas.DataFrame.iloc[]
It can raise the lndexError if we request the index is out-of-bounds, except slice
indexers, which allow the out-of-bounds indexing.

Example:
import.JJ.~Y as np
import pandas as pd
~ ydict-; [{'p': 2, 'q': 3, 'r': 4, 's': 5},
..., {'p': 200, 'q': 300, 'r': 400, 's': 500},
{'p': 2000, 'q': 3000, 'r': 4000, 's': 5000 }]
. df = pd.DataFrame(mydict)
print(type(df.iloc[0])) e-i~ _...,
print(df.iloc[0]) '

Output:
pandas.core.series.Series
p 2
q 3
r 4
s 5

-
Name: 0, dtype: int64

-----
Retrie11ing Statistical information (3
The describe() method is used for calculating some statistical datq like percentile,
mean and std of the numerical values of the Series or DataFrame. It analyzes both
numeric and object series and also the DataFrame column sets of mixed data types.

Example:
>>>data.describe()

Output:
EnoSalary
Count 5.000000 5.000000
mean 103.000000 2050.000000
..
Unit-4 Python Interaction with text and csv 81

std 1.581139 360.555128


min 101.000000 1500.000000
25% 102.000000 2000.000000
50% 103.000000 2100.000000
75% 104.000000 2150.000000
Max 105.000000 2500.000000

Performing Queries on Data


We can retrieve rows based on a query. The query should be g•iven as subscript in the
data frame object. For Example to retrieve all employee whose salary is greater than
2000 rs:
>>>data[data.Salary>2000]
Output:
Eno Name Salary dept
1 102 deep 2500 sales
2 103 jay 2150 ·design
3 104 nency 2100 account

Sorting the Data


Pandas sort_values (} function sorts a data frame in Ascending or Descending order of
passed Column. It's different than the sorted Python function since it cannot sort a
data frame and particular column cannot be selected. Returns a sorted Data Frame
with Same dimensions as of the function caller Data Frame.
Syntax:
DataFrame .sort_value s(by, axis=0, ascending=True, inplace=False, kind=' quicksort',
na_position= 'last')

Parameters:
by: Single/List of column names to sort Data Frame by.
axis: 0 or 'index' for rows and 1 or 'columns' for Column.
ascending: Boolean value which sorts Data frame in ascending order if True.
in place: Boolean value. Makes the changes in passed data frame itself if True.
kind: String which can have three inputs('qui cksort', 'mergesort ' or 'heapsort') of
algorithm used to sort data frame.
na_position: Takes two string input 'last1 or 'first' to set position of Null values.
Default is 'last'.
82 Unit-4 Python Interaction with text and CSV

Example:
sorting data frame by name
data.sort_values("Name", axis= O, ascending= True, inplace = True,
na_position ='last')
Handling Missing Data
The most time consuming part of a data science project is data cleaning and
preparation. However, there are many powerful tools to expedite this process. One of
them is Pandas which is a widely used data analysis library for Python. Handling
missing values is an essential part of data cleaning and preparation process because
almost all data in real life comes with some missing values.

~ A 8 C D
1 Eno ... ~!!"le j~laiy _ !dept
2 101:amit · 200(Ji account
- · -♦-- -- _., - - · - - - -

3 102 i 2500lsates
------- . : -· --•- - -
4 1~ 3 jjay _ -J- 2150: - - -
_ 104~ nen_cy !_~ccoun!
___ rnslshree -; 1500isales
--~-- -- -

Now we convert data into dataframe , the missing data in above file is represented by
NaN(Not a Number), NaN is default marker for the missing value.

Example:
import pandas as pd
data= pd.read_excel('demol.xlsx')
print(data)

Output:

Eno Name Salary dept


·--·- -- ...-- --·
1 1.02 deep 2500 sales
Cl
L 103 Jay 2150 design

3 .l.04 nency 2100 account

We can use fillna() method to replace the Na or NaN values by a specified I


>>>data.fillna(O) va ue.

Output:
Eno Name Salary dept
0 10 1 amit 2000.0 account
1 102 0 2500.0 sales
2 103 j ay 215 0.0 0
3 10 4 ne ncy 0.0 account
4 10 5 shree 1500.0 sales
Unlt-4 Python Interaction w·th
1 text and csv 83
But this is not so usef I . . h
. u as ,t is filling any type of column with zero. We can fill eac
column with a different I b t·11
. h . va ue Y passing column names and the value to be used to 1
int e Column.

Example:
import pandas as pd
data= pd.read_excel('demol.xlsx')/
print(data.fillna({'Name':'missing', 'Salary':0 ,'dept' : 'admin'}))

Output:
Eno Name Salary dept
0 101 amit 2000.0 account
1 102 missing 2500.0 sales
2 103 jay 2150.0 admin
3 104 nency 0.0 account
4 105 shree 1500.0 sales
, . . .h - - • • •••••••• • •

In above code we fill name missing value by missing word salary by O and dept· by
admin.

If we do not want missing data then we can remove all those missing data row by
using dropna() method as:
import pandas as pd
data= pd.read_excel ('demol.xlsx')
print(data.drop na())
Output:

Eno Name Salary dept


0 101 amit 2000.0 account
4 105 shree 1500.0 sales
In this way, filling the necessary data or eliminating the missing data is called
"data cleansing".

Python NumPy
NumPy stands for numeric python which is a python package for the computation and
processing of the multidimensio nal and single dimensional array elements. Travis
Oliphant created NumPy package in 2005 by injecting the features of the ancestor
module Numeric into another module Numarray. It is an extension module of Python
· with text and CSV
84 Unit-4 Python Interaction

. . f nctions which are capable of


which is mostly written in C. It provides various u
performing the numeric computations with a high speed. . d. .
. lementing mu 1tr- 1mens1ona1
NumPy provides various powerful data structures, imp .
f r the optimal computations
arrays and matrices. These data structures are use d O .
f
regarding arrays and matrices. Using NumPy, mat hema ,ca f and logical operations on

arrays can be performed.

There are the following advantages of using NumPy for data analysis.

• NumPy performs array-oriented computing.


• It efficiently implements the multidimensiona l arrays.
• It performs scientific computations.
• It is capable of performing Fourier Transform and reshaping the data
stored in multidimensional arrays.
• NumPy provides the in-built functions for linear algebra and random
number generation.
Nowadays, NumPy in combination with SciPy and Mat-plotlib is used as the
replacement to_ MATLAB as Python is more complete and easier programming
language than MATLAB.

NumPy Environment Setup


NumPy doesn't come· bundled with Python . We have to install it using the python pip
installer. Execute the following command.
$ pip install numpy

The best way to enable NumPy is to use an installable binary package specific to your
operating system. These binaries contain full SciPy stack (inclusive of NumPy, SciPy,
matplotlib, !Python, SymPy and nose packages along with core Python).

Windows
Anaconda (from https://www.an aconda.com/pr oducts/individu al) is a free Python
distribution for SciPy stack. It is also available for Linux and Mac.

Linux
In Linux, the different package managers are used to install the SciPy stack. The
package managers are specific to the different distributions of Linux. Let's look at each
one of them.
Execute the following command on the terminal.( For Ubuntu)
Yolt-4 Python Interaction .
. with te><t and csv 85

$ Sudo apt-get inst · II


$ Pyth on-scipy PYtha Pvthon-num PY
0
$ Pvthon-s ympy Pvthn-matpl O tl·b·
1 1
Pvthonipythonnotebook python-pandas
on-nose

~ NumPy _Ndarray
~ darray Is the n-dimens ion .
V collection of the • . 1
a array object defined in the numpy which stores the
s1mI 1ar type of I
the collection of the data t e ements: In other words, we can define a ndarray as
U
sing the o ba d . d . ype (dtype) objects. The ndarray object can be accessed by
se in exmg Each I . .
th e memory. · e ement of the Array object contains the same size in

Creating a ndarray object


The ndarray object can be
. crea t e d by using
· the array routine
· of the numpy modu I
e.
For this purpose, we need to import the numpy.
Syntax:
numpy.ar ray(objec t, dtype = None, copy= True, order= None, subok = False, ndmin =
0) .

Object It represen ts the collection object. It can be a list, tuple, dictionary , set, etc.

Dtype We can change the data type of the array elements by changing this option
to the specified type. The default is none.

Copy It is optional. By default, it is true which means the object is copied.

Order There can be 3 possible values assigned to this option . It can be C (column
order), R (row order), or A (any)

Subok The returned array will be base class array by default. We can change this
to make the subclasses passes through by setting this option to true.

ndmin It represen ts the minimum dimensions of the resultant array.

Example:
import numpy as np
a= np.array( [l0,12,13 ))
print (a)
86 Uni t-4 Pyt hon lnte ra ctio n wit
.
h tex t and CSV

Out put :
[10, 12, 13]

Example: two dim ens ions arra y

imp ort num py as np


a= np. arra y([[ ll, 22), [43, 54)))
prin t (a)
Output:
[[11 , 22)
[43, 54))

Finding the dimensions of the Array


The ndim function can be use d to find the
dim ens ion s· of the arra y.

Example:
imp ort num py as np
arr = np...arra y([[ l, 2, 3, 4], [4, 5, 6, 7], [9,
10, 11, 23]]}
print(arr.ndim)
Output:
2
Finding the shape and size of the array
To get the sha pe and size of the arra y, the
size and sha pe fun ctio n ass oci ate d wit
num py-arra y is used. h the

Example:
import num py as np
a= np. arra y([[ l,2, 3,4 ,51 ])
print("Array Size:",a.size)
print("Shape:" ,a .shape)

Output:
Array Size: 5
Shape: (1, 5)
Unit-4
hon Interaction with t ext and csv 87
Numpy Methods

~ Rttfflf)y;mean()
'-- The sum of elements al .h .
' ong w,t an axis divided by the number of elements, Is known
as ,arithmetic mean The ·h t·
· numpy.mean() function is used to compute the ant me IC
mean along the specified · Th" f •
axis. Is unction returns the average of the array eIemen t s.
By default, th e average is taken on the flattened array. Else on the specified axis, float
64 is intermediate as well as return values are used for integer inputs

Syntax:
numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>)

These are the following parameters in numpy.mean() function:


a: array_like = This parameter defines the source array containing elements whose
mean is desired.
axis(optional): This parameter defines the axis along which the means are computed.
By default, the mean is computed of the flattened array.
dtype: (optional):This parameter is used to define the data-type used in computing
the mean. For integer inputs, the default is float64, and for floating-point inputs, it is
the same as the input dtype.
out: (optional)
This parameter defines an alternative output array in which the result will be placed.
keepdims: bool(optional)
When the value is true, the reduced axis is left as dimensions with size one in the
output/result.

Example:

import numpy
speed= (99,86,87,88,111,86,103,87,94, 78, 77,85,86]
x = numpy.mean(speed)
print(x)

Output:
89.76923076923077

Example:
import numpy as np
ion with text and CSV
Unit -4 Pyth on In t era ct .
88

a = np.array([[l, 2], [3, 4 1])


b=np.mean(a)
print(b)
Output:
2.5
13.0

. .
We have imported numpy with ahas name np. We have created two arrays 'a' using
.
· the function mean () ·
np.array() function and passed arrays ,a• in

numpy.median() . hi her half of a data sample from the


Median is defined as the value separating the
g I I t the median of the mul ti-
lower half. The function numpy.median{) is used
to ca cu a e
dimensional or one-dimensional arrays.

Example:
imp ort numpy
spe ed= [99, 86,8 7,88 ,111 ,86, 103 ,87, 94, 78, 77,8
5,86 ]
x = numpy.median(speed)
print(x)
Output:
87.0

Example:
imp ort num py as np
a= np.a rray ([[lO , 7, 4], [3, 2, 1]])
prin t("\n Med ian of array along axis O:",np.m
edian(a,O))

Output:
Median of array along axis 0: [ 6.5 4.5 2.5]

~o de ()
The Mod e value is the value that appears .
the mos t num ber of time s. The mod e 1s
the
num ber that occurs with the grea test freq
uen cy with in a data set. The SciPy mod
ule
has a met hod for mod e. scipy.stats.mod e()
func tion calc ulat es the mod e of the arra
y
elem ents along the specified axis of the arra
y (list in pyth on) .
Syntax:
scipy.stats.mode(array, axis=O)
These are the follo win g para met ers in scip
y.st ats. mod e() func tion :
Unit-4 Python Interaction .
With text and csv 89
array: Input array or b·
. . o Ject having th I
axis: Ax,s along which th .. e e ements to calculate the mode.
- e mode rs to b
Returns : Modal values f . h e computed. By default axis = o.
0
t e array el
ements based on the set parameters.
Example:
import numpy as np
from scipy import stats
dataset= [1,1,2,3,4,G,lB]
#mode value
mode= stats.mode(dat aset)
print("Mode: ", mode)

Output:

Mode: ModeResult(mode=array([l]), count=array([2]))

ModeResult, is returned that has 2 attributes. The first attribute, mode, is_t_he number
th at is the
mode of the data set. The second attribute, count, is the number of times it

3 '\'.::;;:::~ ;:;: ;~~iation(numpy.std{)) ,


==-J~ethod std(} is used to compute the standard deviation along the specified a.xis. This
function returns the standard deviation of the array elements.

Syntax:
numpy.std(a,ax is=None,dtype =None,out=No ne)

These are the following parameters:


arr: [array_like] input array.
axis : [int or tuples of int] axis along which we want to calculate the standard
deviation.
out: [ndarray, optional]Diffe rent array in which we want to place the result.
dtype : [data-type, optional]Type we desire while computing SD.

Example : one dimensional array


import numpy as np
arr=f20,2,7, 1,34]
print("arr : ", arr)
print("std of arr : ", np.std(arr))
ion with text and CSV
90 Unit-4 Python In t eract

print ("\nMore precision with float32")


print("std of arr : ", np.std(arr, dtype = np.float 32H
print ("\nMore accuracy with float64")
print("std of arr: ", np.std(arr, dtype = np.float64))

Output:
arr: [20, 2, 7, 1, 34]
std of arr : 12.5761679378 09991

More precision with float32


std of arr : 12.576168

More accuracy with float64


std of arr : 12.5761679378 09991

Example: two dimensional array


import numpy as np
# 2D array
arr= [[2, 2, 2, 2, 2],
[15, 6, 27, 8, 2],
[23, 2, 54, 1, 2, ],
[11, 44, 34, 7, 2]]
# std of the flattened array
print("\nstd of arr, axis= None : ", np.std(arr))
# std along the axis = 0
print("\nstd of arr, axis= 0 : ", np.std(arr, axis= 0))
# std along the axis = 1
print("\nstd of arr, axis= 1 : ", np.std(arr, axis= 1))

Output:
std of arr, axis= None : 15 .36684743205 32
std of arr, axis= 0 : [ 7.56224173 17 .684 73918 18.59267329 3.04138127 0.
]

std of arr, axis= 1 : [ 0. 8.7772433 20.53874388 16.40243884]


Unit--4 Python Interaction with text and csv 91

~~ aria nee (numpy.var())

th
The variance is e average of the squared deviations from the mean. numpy.var( arr,
axis= None) is used to Compute the variance of the given data (array elements) along
the specified axis(if any).

numpy.var(a,axis=None,dtype=None,out=Non e)

These are the following parameters:


arr: [array_like] input array.

axis : [int or tuples of int] axis along which we want to calculate the standard
deviation.

out: [ndarray, optional]Different array in which we want to place the result.


dtype : [ data-type, optional]Type we desire while computing SD.

Example : one dimensional array


import numpy as np
a = np.array([[l, 2], [3, 41])
np.var(a)
Output:
1.25

Example:
import numpy as np

# 20 array
arr= [[2, 2, 2, 2, 2],
[15, 6, 27, 8, 2],
[23, 2, 54, 1, 2, ],
[11, 44, 34, 7, 2]]
# var of the flattened array
print("\nvar of arr, axis= None: ", np.var(arr))
# var along the axis = 0
print("\nvar of arr, axis = 0: ", np.var(arr, axis = 0))

# var along the axis= 1


print("\nvar of arr, axis = 1 : ", np.var(arr, axis= 1))

Output:
. Python Interaction wit h tex t and
92 Um t-4 CSV

var of arr, axis= No ne : 236.14000


000000004
var of arr, axis= 0: [ 57.1875 312 0. ]
.75 345.6875 9.25
. _ l
var of arr, axis - . . [ 0 77 04 421.84 269.04]
• ·

. fl . . . . . ted using the same precision the


For oatmg -pom t mp t the variance is
u , compu inp ut
has. Depending on the inp ut h sui ts to be ina ccu rate
data, this can cause t e_ re
especially for float32 (see examp b 1 ) ecifying a higher-accuracy acc '
le e ow · 5P umulator
using the dtype keyword can alle
viate this issue.

Exercise

Short Question
1. Explain close() me tho d.
2. List File modes.
3. Define CSV mo du le
4. Define the Python pandas?
5. Explain Dataframe in Pandas.
6. Define Series in Pandas?
7. Ho w can we calculate the sta
ndard de via tio n fro m the Series
?
Long Question
1. Explain open() in py tho n to wo
rk wit h file.
2. Explain reader(), wr ite r(), wr ite
row s() , DictReader{), Dic tW rite
r() wi th exa mp le.
3. Explain the dif fer en t type's of
Data Str uct ure s in Pandas?
4. Define the dif fer en t ways a
Data Frame can be cre ate d in pa nd
as?
5. Ho w to Extract specific att rib ute
s and row s fro m da taf ram e?
6. Explain fol low ing fun cti on of
Da taf ram e: he ad , tai l, lac, iloc
, va lue , to_ nu mp y{) ,
de scr ibe ()/
7. Explain Nu mp y ~~ tho ds ~
- n() ,m ed ian ()~mo de (),s td( ),v ar(
----- ,__,. ,~-. -
)
, .....

You might also like