DHP Unit - 4 Part2
DHP Unit - 4 Part2
Output:
Eno Name
0 101 amit
1 102 deep
2 103 jay
3 104 nency
4 105 shree
( f ) to_numpy()
Syntax:
DataFrame.to_numpy(dtype=None, copy=False}
Parameters - ---------
• dtype: It is an optional parameter that pass the dtype to numpy.asarray(}.
• copy: It returns the boolean value that has the default value False.
It ensures that the returned value is not a view on another array.
Returns
It returns the numpy.ndarray as an output.
Example:
import pandas as pd
#initializing the dataframe
info= pd.DataFrame([[l7, 62, 35],[25, 36, 54],[42, 20, 15],[48, 62, 76]],
columns=['x', 'y', 'z'])
print(' Data Frame\n----------\n', info)
#convert the dataframe to a numpy array
arr= info.to_numpy()
. n with text and CSV
Unit-4 python Inte r~
78
Outpu t:
Data Frame
X y Z
0 17 62 35
1 25 36 54
2 42 20 15
3 48 62 76
Numpy Array
[[17 62 35]
[25 36 54]
[42 20 15]
[48 62 76]]
>>>data['Salary'].min()
Output:
1500
Dataf1 ai:Re.loc[]
The DataFrame.loc[] is used to retrieve the group of rows and colum ns by labels
or a
boolean array in the DataF ;ame~h ffikes only index labels, and if it exists in
the caller
DataFrame, it returns the rows, columns, or DataFrame.
The DataFrame.loc[] is a label based but may use with the boole an array.
Example:
impor t pandas as pd
# Creati ng the DataFrame
info= pd . Dat a Frame({'Age':[32 41
I I r 44, 38, 33]
Output:
mayan k
~ ) .
~ OataF, ame.iloc[]
other than
The DataF rame. iloc[] is used when the index label of the DataF rame is
the index
numer ic series of 0,1,2, .... ,n or in the case when the user does not know
label.
80 Unit-4 Python Interaction with text and CSV
We can extract the rows by using an imaginary index position which is not visible in
~he Data Frame. It is an integer- based position(from o to length-1 of the axis), but may
also be used with the boolean array.
Syntax:
pandas.DataFrame.iloc[]
It can raise the lndexError if we request the index is out-of-bounds, except slice
indexers, which allow the out-of-bounds indexing.
Example:
import.JJ.~Y as np
import pandas as pd
~ ydict-; [{'p': 2, 'q': 3, 'r': 4, 's': 5},
..., {'p': 200, 'q': 300, 'r': 400, 's': 500},
{'p': 2000, 'q': 3000, 'r': 4000, 's': 5000 }]
. df = pd.DataFrame(mydict)
print(type(df.iloc[0])) e-i~ _...,
print(df.iloc[0]) '
Output:
pandas.core.series.Series
p 2
q 3
r 4
s 5
-
Name: 0, dtype: int64
-----
Retrie11ing Statistical information (3
The describe() method is used for calculating some statistical datq like percentile,
mean and std of the numerical values of the Series or DataFrame. It analyzes both
numeric and object series and also the DataFrame column sets of mixed data types.
Example:
>>>data.describe()
Output:
EnoSalary
Count 5.000000 5.000000
mean 103.000000 2050.000000
..
Unit-4 Python Interaction with text and csv 81
Parameters:
by: Single/List of column names to sort Data Frame by.
axis: 0 or 'index' for rows and 1 or 'columns' for Column.
ascending: Boolean value which sorts Data frame in ascending order if True.
in place: Boolean value. Makes the changes in passed data frame itself if True.
kind: String which can have three inputs('qui cksort', 'mergesort ' or 'heapsort') of
algorithm used to sort data frame.
na_position: Takes two string input 'last1 or 'first' to set position of Null values.
Default is 'last'.
82 Unit-4 Python Interaction with text and CSV
Example:
sorting data frame by name
data.sort_values("Name", axis= O, ascending= True, inplace = True,
na_position ='last')
Handling Missing Data
The most time consuming part of a data science project is data cleaning and
preparation. However, there are many powerful tools to expedite this process. One of
them is Pandas which is a widely used data analysis library for Python. Handling
missing values is an essential part of data cleaning and preparation process because
almost all data in real life comes with some missing values.
~ A 8 C D
1 Eno ... ~!!"le j~laiy _ !dept
2 101:amit · 200(Ji account
- · -♦-- -- _., - - · - - - -
3 102 i 2500lsates
------- . : -· --•- - -
4 1~ 3 jjay _ -J- 2150: - - -
_ 104~ nen_cy !_~ccoun!
___ rnslshree -; 1500isales
--~-- -- -
Now we convert data into dataframe , the missing data in above file is represented by
NaN(Not a Number), NaN is default marker for the missing value.
Example:
import pandas as pd
data= pd.read_excel('demol.xlsx')
print(data)
Output:
Output:
Eno Name Salary dept
0 10 1 amit 2000.0 account
1 102 0 2500.0 sales
2 103 j ay 215 0.0 0
3 10 4 ne ncy 0.0 account
4 10 5 shree 1500.0 sales
Unlt-4 Python Interaction w·th
1 text and csv 83
But this is not so usef I . . h
. u as ,t is filling any type of column with zero. We can fill eac
column with a different I b t·11
. h . va ue Y passing column names and the value to be used to 1
int e Column.
Example:
import pandas as pd
data= pd.read_excel('demol.xlsx')/
print(data.fillna({'Name':'missing', 'Salary':0 ,'dept' : 'admin'}))
Output:
Eno Name Salary dept
0 101 amit 2000.0 account
1 102 missing 2500.0 sales
2 103 jay 2150.0 admin
3 104 nency 0.0 account
4 105 shree 1500.0 sales
, . . .h - - • • •••••••• • •
In above code we fill name missing value by missing word salary by O and dept· by
admin.
If we do not want missing data then we can remove all those missing data row by
using dropna() method as:
import pandas as pd
data= pd.read_excel ('demol.xlsx')
print(data.drop na())
Output:
Python NumPy
NumPy stands for numeric python which is a python package for the computation and
processing of the multidimensio nal and single dimensional array elements. Travis
Oliphant created NumPy package in 2005 by injecting the features of the ancestor
module Numeric into another module Numarray. It is an extension module of Python
· with text and CSV
84 Unit-4 Python Interaction
There are the following advantages of using NumPy for data analysis.
The best way to enable NumPy is to use an installable binary package specific to your
operating system. These binaries contain full SciPy stack (inclusive of NumPy, SciPy,
matplotlib, !Python, SymPy and nose packages along with core Python).
Windows
Anaconda (from https://www.an aconda.com/pr oducts/individu al) is a free Python
distribution for SciPy stack. It is also available for Linux and Mac.
Linux
In Linux, the different package managers are used to install the SciPy stack. The
package managers are specific to the different distributions of Linux. Let's look at each
one of them.
Execute the following command on the terminal.( For Ubuntu)
Yolt-4 Python Interaction .
. with te><t and csv 85
~ NumPy _Ndarray
~ darray Is the n-dimens ion .
V collection of the • . 1
a array object defined in the numpy which stores the
s1mI 1ar type of I
the collection of the data t e ements: In other words, we can define a ndarray as
U
sing the o ba d . d . ype (dtype) objects. The ndarray object can be accessed by
se in exmg Each I . .
th e memory. · e ement of the Array object contains the same size in
Object It represen ts the collection object. It can be a list, tuple, dictionary , set, etc.
Dtype We can change the data type of the array elements by changing this option
to the specified type. The default is none.
Order There can be 3 possible values assigned to this option . It can be C (column
order), R (row order), or A (any)
Subok The returned array will be base class array by default. We can change this
to make the subclasses passes through by setting this option to true.
Example:
import numpy as np
a= np.array( [l0,12,13 ))
print (a)
86 Uni t-4 Pyt hon lnte ra ctio n wit
.
h tex t and CSV
Out put :
[10, 12, 13]
Example:
imp ort num py as np
arr = np...arra y([[ l, 2, 3, 4], [4, 5, 6, 7], [9,
10, 11, 23]]}
print(arr.ndim)
Output:
2
Finding the shape and size of the array
To get the sha pe and size of the arra y, the
size and sha pe fun ctio n ass oci ate d wit
num py-arra y is used. h the
Example:
import num py as np
a= np. arra y([[ l,2, 3,4 ,51 ])
print("Array Size:",a.size)
print("Shape:" ,a .shape)
Output:
Array Size: 5
Shape: (1, 5)
Unit-4
hon Interaction with t ext and csv 87
Numpy Methods
~ Rttfflf)y;mean()
'-- The sum of elements al .h .
' ong w,t an axis divided by the number of elements, Is known
as ,arithmetic mean The ·h t·
· numpy.mean() function is used to compute the ant me IC
mean along the specified · Th" f •
axis. Is unction returns the average of the array eIemen t s.
By default, th e average is taken on the flattened array. Else on the specified axis, float
64 is intermediate as well as return values are used for integer inputs
Syntax:
numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>)
Example:
import numpy
speed= (99,86,87,88,111,86,103,87,94, 78, 77,85,86]
x = numpy.mean(speed)
print(x)
Output:
89.76923076923077
Example:
import numpy as np
ion with text and CSV
Unit -4 Pyth on In t era ct .
88
. .
We have imported numpy with ahas name np. We have created two arrays 'a' using
.
· the function mean () ·
np.array() function and passed arrays ,a• in
Example:
imp ort numpy
spe ed= [99, 86,8 7,88 ,111 ,86, 103 ,87, 94, 78, 77,8
5,86 ]
x = numpy.median(speed)
print(x)
Output:
87.0
Example:
imp ort num py as np
a= np.a rray ([[lO , 7, 4], [3, 2, 1]])
prin t("\n Med ian of array along axis O:",np.m
edian(a,O))
Output:
Median of array along axis 0: [ 6.5 4.5 2.5]
~o de ()
The Mod e value is the value that appears .
the mos t num ber of time s. The mod e 1s
the
num ber that occurs with the grea test freq
uen cy with in a data set. The SciPy mod
ule
has a met hod for mod e. scipy.stats.mod e()
func tion calc ulat es the mod e of the arra
y
elem ents along the specified axis of the arra
y (list in pyth on) .
Syntax:
scipy.stats.mode(array, axis=O)
These are the follo win g para met ers in scip
y.st ats. mod e() func tion :
Unit-4 Python Interaction .
With text and csv 89
array: Input array or b·
. . o Ject having th I
axis: Ax,s along which th .. e e ements to calculate the mode.
- e mode rs to b
Returns : Modal values f . h e computed. By default axis = o.
0
t e array el
ements based on the set parameters.
Example:
import numpy as np
from scipy import stats
dataset= [1,1,2,3,4,G,lB]
#mode value
mode= stats.mode(dat aset)
print("Mode: ", mode)
Output:
ModeResult, is returned that has 2 attributes. The first attribute, mode, is_t_he number
th at is the
mode of the data set. The second attribute, count, is the number of times it
Syntax:
numpy.std(a,ax is=None,dtype =None,out=No ne)
Output:
arr: [20, 2, 7, 1, 34]
std of arr : 12.5761679378 09991
Output:
std of arr, axis= None : 15 .36684743205 32
std of arr, axis= 0 : [ 7.56224173 17 .684 73918 18.59267329 3.04138127 0.
]
th
The variance is e average of the squared deviations from the mean. numpy.var( arr,
axis= None) is used to Compute the variance of the given data (array elements) along
the specified axis(if any).
numpy.var(a,axis=None,dtype=None,out=Non e)
axis : [int or tuples of int] axis along which we want to calculate the standard
deviation.
Example:
import numpy as np
# 20 array
arr= [[2, 2, 2, 2, 2],
[15, 6, 27, 8, 2],
[23, 2, 54, 1, 2, ],
[11, 44, 34, 7, 2]]
# var of the flattened array
print("\nvar of arr, axis= None: ", np.var(arr))
# var along the axis = 0
print("\nvar of arr, axis = 0: ", np.var(arr, axis = 0))
Output:
. Python Interaction wit h tex t and
92 Um t-4 CSV
Exercise
Short Question
1. Explain close() me tho d.
2. List File modes.
3. Define CSV mo du le
4. Define the Python pandas?
5. Explain Dataframe in Pandas.
6. Define Series in Pandas?
7. Ho w can we calculate the sta
ndard de via tio n fro m the Series
?
Long Question
1. Explain open() in py tho n to wo
rk wit h file.
2. Explain reader(), wr ite r(), wr ite
row s() , DictReader{), Dic tW rite
r() wi th exa mp le.
3. Explain the dif fer en t type's of
Data Str uct ure s in Pandas?
4. Define the dif fer en t ways a
Data Frame can be cre ate d in pa nd
as?
5. Ho w to Extract specific att rib ute
s and row s fro m da taf ram e?
6. Explain fol low ing fun cti on of
Da taf ram e: he ad , tai l, lac, iloc
, va lue , to_ nu mp y{) ,
de scr ibe ()/
7. Explain Nu mp y ~~ tho ds ~
- n() ,m ed ian ()~mo de (),s td( ),v ar(
----- ,__,. ,~-. -
)
, .....