0% found this document useful (0 votes)

24 views28 pages

Data Science Overview for TYCS VI

Data Science Notes

Uploaded by

Srushti Dewalekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views28 pages

Data Science Overview for TYCS VI

Data Science Notes

Uploaded by

Srushti Dewalekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

DATA SCIENCE

TYCS SEMESTER VI
UNIT 1
By Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Introduction
•Statistical Learning - field of statistics that largely
involve computational considerations
•Machine Learning - build computer systems that
automatically improve with experience
•Artificial Intelligence - the science and engineering of
making intelligent machines that are experts in
linguistics, philosophy, psychology, neuroscience,
mathematics, computer science and so on
•Data Mining – science of knowledge discovery using
database systems, ML and statistics
•Data Science - big umbrella that brings everything
together with a potential to show insight from data
and build intelligent systems inside it
Asst Prof. Bindy Wilson
Data Science
⚫ An interdisciplinary field that uses technology
from comp sc., database, statistics & machine
learning
⚫ Involves collection, preparation, analysis,
visualization, management and preservation
of data
⚫ Extracts meaningful information from data
sources to be used for business purposes
⚫ Use that knowledge to
⚫ • Make decisions
⚫ • Predict the future

Asst Prof. Bindy Wilson

Types of Data
⚫ Data is a collection of facts in a format
that can be processed by a computer.
⚫ Two data types
⚫ Quantitative data(Numerical
Variables): Data can be described using
numbers, and basic mathematical
procedures. Eg : the salary of employees
⚫ Qualitative data(Categorical
Variables) : This data cannot be described
using numbers and basic mathematics. Eg :
gender or country name
Asst Prof. Bindy Wilson
Categorical or Qualitative data
⚫ classified into Nominal, Binary, and Ordinal
⚫ Nominal - These are variables without any
regard for ordering. For example, candidate
names in polling data from a survey.
⚫ Ordinal - can have two or more categories
with an added condition how the categories
are ordered. For example, a customer rating
for a movie, variable rating has a relative
importance on a scale of 1 to 5
⚫ Binary - variables with exactly two
categories such as gender, possible
outcomes of a single coin toss, etc
Asst Prof. Bindy Wilson
Numerical or Quantitative data
⚫ measurable and represented as numbers, not
words or text
⚫ divided into continuous and discrete
⚫ Discrete variables have a logical end to them,
eg. days in a month. Continuous variables
don’t have a logical end to them, subdivided
into Interval and Ratio
⚫ Interval - measured along a continuous
range. 0o C has certain degree of temperature
⚫ Ratio - include distance, mass, and height. A
value 0 for a ratio variable means a none or
no measure.
Asst Prof. Bindy Wilson
Traditional vs Big Data
⚫ Traditional data – structured & stored in
databases in table format, contains numeric or
text values, usually managed in a single system
⚫ Big data – distributed across network of
computers and is bigger in 5 V’s
⚫ Volume – enormous volume
⚫ Variety – many sources & types, photos, videos,
audio, PDF, data from sensors, monitoring
devices..
⚫ Velocity – massive & continuous data flow
⚫ Veracity – uncertain, imprecise, abnormal data
⚫ Validity – if accurate for intended use
Asst Prof. Bindy Wilson
Different types of data sources
1) Structured - always the easiest to understand,
represent, store, query, and process
⚫ data will have rows and columns stored in a
tabular manner
⚫ data coming from CSV and Excel files
2) Semi-Structured - is the web data that
consists of XML, HTML etc
⚫ data generated from Twitter and Facebook
⚫ Stored in NoSQL Databases like MongoDB and
Cassandra
3) Unstructured - data like images, videos, web
logs, and click stream, and also data from
newspapers and books which are non-digitized
data.
Asst Prof. Bindy Wilson
The Five Steps of Data Science
⚫ 1. Asking an interesting question
⚫ 2. Obtaining the data - finding the right data
set
⚫ 3. Exploring the data - understanding the data
⚫ 4. Modeling the data - involves the use of
statistical and machine learning models
⚫ 5. Communicating and visualizing the results -
conclude your results in a digestible format

Asst Prof. Bindy Wilson

Data Collection
⚫ Primary and Secondary data
⚫ Primary data is data originated for the first time by
the researcher through direct efforts and experience
⚫ Also known as the first hand or raw data
⚫ The data collected surveys, observations, physical
testing, mailed questionnaires, interviews
⚫ Secondary data is second-hand information which is
already collected and recorded by any other person
⚫ Readily from various sources like censuses,
government publications, internal records of the
organisation, reports, books, journal articles, websites
Asst Prof. Bindy Wilson
Various types of data collection methods
⚫ 1)Companies and Proprietary Data Sources
⚫ 2)Government Data Sources
⚫ 3)Academic Data Sets
⚫ 4)Sweat Equity
⚫ 5)Scraping

⚫ A) Casual & Scientific

⚫ B) Simple & Systematic
⚫ C) Subjective & Objective
⚫ D) Factual & Inferential
⚫ E) Direct & Indirect
⚫ F) Behavioral & Non-behavioral
Asst Prof. Bindy Wilson
Web scraping
⚫ used for extracting data from websites
⚫ Web scraping a web page involves fetching it and
extracting from it
⚫ The content of a page may be parsed, searched,
reformatted, its data copied into a spreadsheet
⚫ Human Copy-Paste
⚫ Text pattern matching - using regular expression
matching facilities of programming languages
⚫ API Interface
⚫ DOM Parsing
Asst Prof. Bindy Wilson
Data wrangling or Data cleaning
⚫ Initial Data Analysis (IDA)
⚫ Removing inconsistencies from the data,
like missing values, and follow a standard
format
⚫ Correcting Factor Variables
⚫ Dealing with NAs - impute, a process of
filling the missing values
⚫ Dealing with Dates and Times

Asst Prof. Bindy Wilson

Handling missing data
⚫ Heuristic-based imputation - make a reasonable
guess
⚫ Mean value imputation- Using the mean value of
a variable
⚫ Random value imputation - select a random
value from the column
⚫ Imputation by nearest neighbor - identify the
record which matches most closely on all fields,
and use this nearest neighbor to infer the values
⚫ Imputation by interpolation - use a method like
linear regression to predict the value

Asst Prof. Bindy Wilson

Exploratory Data Analysis (EDA)
⚫ Fundamental step after data collection and
pre-processing
⚫ Most EDA techniques are graphical in nature
⚫ Objectives of EDA :
⚫ 1) Maximize insight into the database
⚫ 2) Visualize relationships between exposure
and outcome variables
⚫ 3) Detect outliers and anomalies
⚫ 4) Extract and create relevant variables
⚫ 5) Finding a suitable model
Asst Prof. Bindy Wilson
⚫ EDA methods - Graphical or non-graphical
⚫ Non-graphical - Summary statistics include,
frequency, mean, median, mode, range,
interquartile range, maximum and minimum
values
⚫ Graphical - Data visualization, multiple types
of charts, graphs

Asst Prof. Bindy Wilson

Summary Statistics
⚫ Mean - we sum values and divide by the
number of observations
⚫ Median - middle value in a sorted data set
⚫ Variance - measure of the spread for the
given set of numbers
⚫ Interquartile range (IQR) - data situated
between the 1st and the 3rd quartiles
⚫ Skewness - measures asymmetry about the
mean
Asst Prof. Bindy Wilson
Summary Statistics (contd)
⚫ Kurtosis - measure of peakedness and tailedness
of the probability distribution of a random
variable
⚫ Covariance and correlation - measure the
degree of the relationship between two random
variables
⚫ If two variables have a correlation close to -1, it
means that as one variable increases, the other
decreases, and if two variables have a correlation
close to +1, it means that those variables move
together in the same direction

Asst Prof. Bindy Wilson

Data visualization
⚫ the process of creating and studying the visual
representation of data to bring some meaningful
insights
⚫ deals with visualizing the information in a given data
⚫ Benefits
⚫ • Identifying red spots in data
⚫ • Tracking and identifying relations among different
attributes
⚫ • Seeing the trend
⚫ • Summarizing complicated long spreadsheets and
databases into visual art
⚫ • Easy to use and very impactful way to store and
present information
Asst Prof. Bindy Wilson
Data visualization
⚫ Four types of presentation in data visualization
Comparison, Relationship, Distribution, and
Composition
⚫ Comparison is used to see the differences
between multiple items at given point in time Eg –
Line chart
⚫ Relationship helps in finding correlation between
two or more variables Eg - Scatter and bubble
⚫ Distribution charts like column and line
histograms show the spread of data. Skewness
toward left or right could be easily spotted.
⚫ Composition refers to a stacked chart with
multiple components like a pie chart or stacked
column chart
Asst Prof. Bindy Wilson
Boxplots
⚫ Boxplots are a compact way of representing
the five-number summary namely median, first
and third quartiles (25th and 75th percentile)
and min and max.
⚫ The upper side of the vertical rectangular box
represents the third quartile and the lower, the
first quartile. The difference between the two
points is known as the interquartile range,
which consist of 50% of the data.
⚫ A line dividing the rectangle represents the
median.
⚫ It also contains a line extending on both sides
(known as whiskers) of the rectangle
⚫ The points plotted, which are shown as
extensions of the lines, are called outliers.

Asst Prof. Bindy Wilson

Line Chart
⚫ A line chart is a basic visualization chart type in which
information is displayed in a series of data points
connected by line segments. Line charts are used for
showing trends

Asst Prof. Bindy Wilson

Scatterplots
⚫ A scatterplot is a graph that helps identify if there is
a relationship between two variables.
⚫ Scatterplots use Cartesian coordinates to show two
variables on an x- and y-axis. Higher dimensional
scatterplots are also possible
⚫ If we add dimensions of color or shape or size, so we
can present more than two variables

Asst Prof. Bindy Wilson

Correlation Plots
⚫ The best way to show how much one indicator relates
to another is by computing the correlation.
⚫ The combination of color, size, and position
encapsulates a numeric value into a visual
representation in a Correlation plot
⚫ Direction of the ellipse represents a positive or negative
correlation & size represents the value

Asst Prof. Bindy Wilson

⚫ Stacked column charts are an elegant way of showing
the composition of various categories that make up a
particular variable
⚫ A histogram is one of the most basic and easy to
understand graphical representations of numerical data.
⚫ It consists of rectangular boxes. The width of each
rectangle has a certain range and the height signifies the
number of data points within that range.

Asst Prof. Bindy Wilson

⚫ A Pie Chart is a type of graph that uses pie slices to
show relative sizes of data.
⚫ Heatmaps are visualization of data where values are
represented as different shades of colors, darker the
shade, higher is the value.
⚫ Dendograms are visual representations specifically
useful in clustering analysis. They are tree diagrams
frequently used to illustrate the formation of clusters
⚫ The y-axis in dendograms measures the closeness (or
similarity) of clusters.

Asst Prof. Bindy Wilson

High level Programming language
⚫ A high-level language (HLL) is a programming language that
enables a programmer to write programs that are independent
of a particular type of computer.
⚫ They are closer to human languages and further from machine
languages.
⚫ Assembly languages are considered low-level because they are
very close to machine languages.
⚫ Advantages of high-level language
⚫ High-level languages are programmer-friendly. They are easy to
write, debug and maintain.
⚫ It provides a higher level of abstraction from machine languages.
⚫ It is a machine-independent language.
⚫ Easy to learn.
⚫ Less error-prone, easy to find and debug errors.
⚫ High-level programming results in better programming
productivity.

Asst Prof. Bindy Wilson

Integrated Development
Environment (IDE )
⚫ An IDE enables programmers to consolidate the different
aspects of writing a computer program.
⚫ It is a software for building applications that combines common
developer tools into a single graphical user interface (GUI)
⚫ Development tools include text editors, code libraries,
compilers and test platforms
⚫ An IDE typically offers
⚫ a text editor,
⚫ automated code validation,
⚫ syntax highlighting,
⚫ auto completion,
⚫ contextual suggestions,
⚫ easy access to help, and
⚫ debugging tools

Asst Prof. Bindy Wilson

Important Questions
No ratings yet
Important Questions
26 pages
Lect 3
No ratings yet
Lect 3
51 pages
Lecture 01-05 Data, Central Tendency PDF
No ratings yet
Lecture 01-05 Data, Central Tendency PDF
51 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Imp Mcs226
No ratings yet
Imp Mcs226
321 pages
Multivariate Data Analysis Course
No ratings yet
Multivariate Data Analysis Course
92 pages
Cec 218 - 042006
No ratings yet
Cec 218 - 042006
83 pages
02 Data
No ratings yet
02 Data
24 pages
Data Science Lecture No 03
No ratings yet
Data Science Lecture No 03
23 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
Unit-1 Theory
No ratings yet
Unit-1 Theory
26 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
3 Data Visualization
No ratings yet
3 Data Visualization
75 pages
Data Analysis & Visualization Guide
No ratings yet
Data Analysis & Visualization Guide
63 pages
It B.tech II Year II Sem DV (R18a0555)
No ratings yet
It B.tech II Year II Sem DV (R18a0555)
73 pages
Data Science UNIT 1 Final
No ratings yet
Data Science UNIT 1 Final
107 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Module 1
No ratings yet
Module 1
64 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
DM Unit-1-1
No ratings yet
DM Unit-1-1
56 pages
DV - Unit 1
No ratings yet
DV - Unit 1
40 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Data Science 5
100% (4)
Data Science 5
216 pages
Data Science
No ratings yet
Data Science
12 pages
Unit 1b
No ratings yet
Unit 1b
69 pages
Lec.02 Getting To Know Your Data
No ratings yet
Lec.02 Getting To Know Your Data
62 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Bustat Reviewer
No ratings yet
Bustat Reviewer
6 pages
02 Data
No ratings yet
02 Data
62 pages
Unit 3
No ratings yet
Unit 3
30 pages
Lecture1 2
No ratings yet
Lecture1 2
63 pages
MGT 1103
No ratings yet
MGT 1103
4 pages
EDA Unit 1
No ratings yet
EDA Unit 1
41 pages
FDS Pyq2
No ratings yet
FDS Pyq2
10 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
16 pages
Data Mining
No ratings yet
Data Mining
34 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Data Mining 2
No ratings yet
Data Mining 2
64 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
02 Data
No ratings yet
02 Data
65 pages
Cheatsheet Data
No ratings yet
Cheatsheet Data
3 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
22UCS303 DS-Unit III-N
No ratings yet
22UCS303 DS-Unit III-N
85 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
02 Data
No ratings yet
02 Data
64 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
9 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
FDS Module 1 Notes
No ratings yet
FDS Module 1 Notes
27 pages
Data Visualization and Story Telling Notes
No ratings yet
Data Visualization and Story Telling Notes
31 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Statistics
No ratings yet
Statistics
2 pages
CS3352 QB
No ratings yet
CS3352 QB
35 pages
02 Data
No ratings yet
02 Data
65 pages
Time and Work PDF Set 1
No ratings yet
Time and Work PDF Set 1
40 pages
Profit N Loss Quiz 16
No ratings yet
Profit N Loss Quiz 16
11 pages
Probability PDF Set 1
No ratings yet
Probability PDF Set 1
33 pages
Percentage Quiz 14
No ratings yet
Percentage Quiz 14
8 pages
Chap-6 Exception Handling
No ratings yet
Chap-6 Exception Handling
26 pages
Chap-2 Classes & Methods
No ratings yet
Chap-2 Classes & Methods
28 pages
Asymptotic Analysis
No ratings yet
Asymptotic Analysis
16 pages
Introduction To Graphs
No ratings yet
Introduction To Graphs
15 pages
Exp 1 2cchsisro0
No ratings yet
Exp 1 2cchsisro0
24 pages
Unit-1 Introduction To Linux (E-Next - In)
No ratings yet
Unit-1 Introduction To Linux (E-Next - In)
64 pages
Asian Girl Gand Moti Wala Us5853
No ratings yet
Asian Girl Gand Moti Wala Us5853
3 pages
Fairy Tale Story
No ratings yet
Fairy Tale Story
2 pages
Solved The Following Data Pertain To Three Divisions of Nevada Aggregates Inc
No ratings yet
Solved The Following Data Pertain To Three Divisions of Nevada Aggregates Inc
1 page
Economics Exam: NCUK Foundation 2010
No ratings yet
Economics Exam: NCUK Foundation 2010
8 pages
Brief of Project
No ratings yet
Brief of Project
30 pages
Danait Beyene Resume Latest
No ratings yet
Danait Beyene Resume Latest
4 pages
B.Com Exam Paper Patterns
No ratings yet
B.Com Exam Paper Patterns
12 pages
Chapter 57
No ratings yet
Chapter 57
6 pages
History and Origins of Popcorn
No ratings yet
History and Origins of Popcorn
6 pages
Robotic Arm Design Optimization
No ratings yet
Robotic Arm Design Optimization
10 pages
Study Guide For Exam PL-400 - Microsoft Power Platform Developer - Microsoft Learn
No ratings yet
Study Guide For Exam PL-400 - Microsoft Power Platform Developer - Microsoft Learn
13 pages
Banaphool Wildfire 070216
No ratings yet
Banaphool Wildfire 070216
256 pages
Ministry Magazine - Adventists and Politics-2
No ratings yet
Ministry Magazine - Adventists and Politics-2
13 pages
Sibao Chen: Practice
No ratings yet
Sibao Chen: Practice
3 pages
PB AmE B1 Unit Test U1 Version A
No ratings yet
PB AmE B1 Unit Test U1 Version A
4 pages
The Art of Writing Lecture Notes
No ratings yet
The Art of Writing Lecture Notes
26 pages
Apéry's Constant
No ratings yet
Apéry's Constant
8 pages
USG Boral Product Catalogue PDF
No ratings yet
USG Boral Product Catalogue PDF
11 pages
Week 5
No ratings yet
Week 5
14 pages
Economics School Based Project
No ratings yet
Economics School Based Project
2 pages
Quarter 3 Module 3
No ratings yet
Quarter 3 Module 3
10 pages
God-Manifestation in Scripture
No ratings yet
God-Manifestation in Scripture
133 pages
Spputm07 PDF
No ratings yet
Spputm07 PDF
205 pages
Ebook Ebook PDF A History of Crime and The American Criminal Justice System 3Rd Edition All Chapter PDF Docx Kindle
100% (40)
Ebook Ebook PDF A History of Crime and The American Criminal Justice System 3Rd Edition All Chapter PDF Docx Kindle
41 pages
General Medicine Assignment - ADGN 2
No ratings yet
General Medicine Assignment - ADGN 2
3 pages
Chemistry Exam: Form VI April 2024
No ratings yet
Chemistry Exam: Form VI April 2024
4 pages
Application For Refund: Zentrale Verwaltung
No ratings yet
Application For Refund: Zentrale Verwaltung
2 pages
Kapalikas
No ratings yet
Kapalikas
7 pages
CompTIA STD - Brochure
No ratings yet
CompTIA STD - Brochure
4 pages
Clark DefenseAbstractExpressionism 1994
No ratings yet
Clark DefenseAbstractExpressionism 1994
28 pages

Data Science Overview for TYCS VI

Uploaded by

Data Science Overview for TYCS VI

Uploaded by

DATA SCIENCE

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

⚫ A) Casual & Scientific

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

Asst Prof. Bindy Wilson

You might also like