LO5
Prepare Data for Modeling
Topics cover in LO5 & LO6
• Introduce various python libraries
• Filtering and selecting data
• Concatenating and transforming data
• Data visualization best practices
• Visualizing data
• Creating a plot
• Creating statistical data graphics
• Performing basic math and linear algebra
• Correlation analysis
• Multivariate analysis
• Data sourcing via web scraping
Objective of today’s session
After attending this session, you should know
• Panda library introduction
• Filtering and selecting
• Treating missing values
Coding languages for data science
• Python
• R
• Julia
• Go
• Python is a high-level interpreted coding language that's useful for a wide
variety of applications.
• It is an official programming language of Google
Benefits of using Python
• It is extremely easy to learn and it's human readable.
• Got an extensive array of well-supported date science libraries.
• Got the biggest user base of all data science languages.
• Use for building predictive web applications as well and use for lot of
different functions, not just data science.
Python is a popular language
Python is most popular in data science
Why use Python for working with Data
Python is useful for:
• Data science, data analytics, and data engineering
• Useful in both a professional and an academic environment
• Python is an open-source programming language
• Web development
• Application development
• Game development
Main Python libraries for data science
Panda library introduction
• Pandas is useful for its fast data cleansing preparation, powerful analysis
capabilities, ease of use for data visualization, ease of use for machine
learning
• its compatibility with NumPy array and matrices.
• It is built on top of NumPy.
• Arrays and matrices are called series and DataFrames in pandas.
Shortcuts in jupyter: https://yoursdata.net/jupyter-lab-shortcut-and-magic-functions-tips/
Indexing in pandas
• An index is a list of integers or labels you use to uniquely identify rows
and columns.
We use
• A set of square-brackets […..]
• The .loc[] indexer
Introducing the pandas library
• A DataFrame object is pretty much a
spreadsheet of rows and columns
• the rows and columns individually are
actually series objects in the pandas
library
• DataFrames are indexable.
• A series object is a single row or
column and it is always indexed
Comparison operators in pandas
Code demonstration
• Introduce Jupyter notebook
• Plain indexing
• Data slicing
• Arithmetic comparisons
PACKAGES/MODULE: https://ajaytech.co/2020/04/21/modules-vs-packages-vs-libraries-vs-frameworks/
Random seed: https://www.youtube.com/watch?v=8B1z3xwNy2s
Summary
• Panda library introduction
• Filtering and selecting
• Treating missing values
Himanshu Patel, Instructor
Saskatchewan Polytechnic
email: patelh@saskpolytech.ca
Mining building, Saskatoon