DATA SCIENCE
TYCS SEMESTER VI
UNIT 1
By Asst Prof. Bindy Wilson
                             Asst Prof. Bindy Wilson
Introduction
•Statistical Learning - field of statistics that largely
 involve computational considerations
•Machine Learning - build computer systems that
 automatically improve with experience
•Artificial Intelligence - the science and engineering of
 making intelligent machines that are experts in
 linguistics, philosophy, psychology, neuroscience,
 mathematics, computer science and so on
•Data Mining – science of knowledge discovery using
 database systems, ML and statistics
•Data Science - big umbrella that brings everything
 together with a potential to show insight from data
 and build intelligent systems inside it
                                  Asst Prof. Bindy Wilson
Data Science
⚫   An interdisciplinary field that uses technology
    from comp sc., database, statistics & machine
    learning
⚫   Involves collection, preparation, analysis,
    visualization, management and preservation
    of data
⚫   Extracts meaningful information from data
    sources to be used for business purposes
⚫   Use that knowledge to
⚫   • Make decisions
⚫   • Predict the future
                              Asst Prof. Bindy Wilson
Types of Data
⚫  Data is a collection of facts in a format
  that can be processed by a computer.
⚫ Two data types
⚫ Quantitative data(Numerical
  Variables): Data can be described using
  numbers, and basic mathematical
  procedures. Eg : the salary of employees
⚫ Qualitative data(Categorical
  Variables) : This data cannot be described
  using numbers and basic mathematics. Eg :
  gender or country name
                         Asst Prof. Bindy Wilson
    Categorical or Qualitative data
⚫ classified into Nominal, Binary, and Ordinal
⚫ Nominal - These are variables without any
  regard for ordering. For example, candidate
  names in polling data from a survey.
⚫ Ordinal - can have two or more categories
  with an added condition how the categories
  are ordered. For example, a customer rating
  for a movie, variable rating has a relative
  importance on a scale of 1 to 5
⚫ Binary - variables with exactly two
  categories such as gender, possible
  outcomes of a single coin toss, etc
                           Asst Prof. Bindy Wilson
Numerical or Quantitative data
⚫ measurable and represented as numbers, not
  words or text
⚫ divided into continuous and discrete
⚫ Discrete variables have a logical end to them,
  eg. days in a month. Continuous variables
  don’t have a logical end to them, subdivided
  into Interval and Ratio
⚫ Interval - measured along a continuous
  range. 0o C has certain degree of temperature
⚫ Ratio - include distance, mass, and height. A
  value 0 for a ratio variable means a none or
  no measure.
                            Asst Prof. Bindy Wilson
    Traditional vs Big Data
⚫ Traditional data – structured & stored in
  databases in table format, contains numeric or
  text values, usually managed in a single system
⚫ Big data – distributed across network of
  computers and is bigger in 5 V’s
⚫ Volume – enormous volume
⚫ Variety – many sources & types, photos, videos,
  audio, PDF, data from sensors, monitoring
  devices..
⚫ Velocity – massive & continuous data flow
⚫ Veracity – uncertain, imprecise, abnormal data
⚫ Validity – if accurate for intended use
                              Asst Prof. Bindy Wilson
  Different types of data sources
1) Structured - always the easiest to understand,
  represent, store, query, and process
⚫ data will have rows and columns stored in a
  tabular manner
⚫ data coming from CSV and Excel files
2) Semi-Structured - is the web data that
  consists of XML, HTML etc
⚫ data generated from Twitter and Facebook
⚫ Stored in NoSQL Databases like MongoDB and
  Cassandra
3) Unstructured - data like images, videos, web
  logs, and click stream, and also data from
  newspapers and books which are non-digitized
  data.
                             Asst Prof. Bindy Wilson
    The Five Steps of Data Science
⚫ 1. Asking an interesting question
⚫ 2. Obtaining the data - finding the right data
  set
⚫ 3. Exploring the data - understanding the data
⚫ 4. Modeling the data - involves the use of
  statistical and machine learning models
⚫ 5. Communicating and visualizing the results -
  conclude your results in a digestible format
                            Asst Prof. Bindy Wilson
        Data Collection
⚫   Primary and Secondary data
⚫   Primary data is data originated for the first time by
    the researcher through direct efforts and experience
⚫   Also known as the first hand or raw data
⚫   The data collected surveys, observations, physical
    testing, mailed questionnaires, interviews
⚫   Secondary data is second-hand information which is
    already collected and recorded by any other person
⚫   Readily from various sources like censuses,
    government publications, internal records of the
    organisation, reports, books, journal articles, websites
                                      Asst Prof. Bindy Wilson
Various types of data collection methods
⚫   1)Companies and Proprietary Data Sources
⚫   2)Government Data Sources
⚫   3)Academic Data Sets
⚫   4)Sweat Equity
⚫   5)Scraping
⚫   A) Casual & Scientific
⚫   B) Simple & Systematic
⚫   C) Subjective & Objective
⚫   D) Factual & Inferential
⚫   E) Direct & Indirect
⚫   F) Behavioral & Non-behavioral
                                Asst Prof. Bindy Wilson
     Web scraping
⚫   used for extracting data from websites
⚫   Web scraping a web page involves fetching it and
    extracting from it
⚫   The content of a page may be parsed, searched,
    reformatted, its data copied into a spreadsheet
⚫   Human Copy-Paste
⚫   Text pattern matching - using regular expression
    matching facilities of programming languages
⚫   API Interface
⚫   DOM Parsing
                                 Asst Prof. Bindy Wilson
 Data wrangling or Data cleaning
⚫ Initial Data Analysis (IDA)
⚫ Removing inconsistencies from the data,
  like missing values, and follow a standard
  format
⚫ Correcting Factor Variables
⚫ Dealing with NAs - impute, a process of
  filling the missing values
⚫ Dealing with Dates and Times
                            Asst Prof. Bindy Wilson
    Handling missing data
⚫    Heuristic-based imputation - make a reasonable
    guess
⚫   Mean value imputation- Using the mean value of
    a variable
⚫   Random value imputation - select a random
    value from the column
⚫   Imputation by nearest neighbor - identify the
    record which matches most closely on all fields,
    and use this nearest neighbor to infer the values
⚫   Imputation by interpolation - use a method like
    linear regression to predict the value
                                 Asst Prof. Bindy Wilson
    Exploratory Data Analysis (EDA)
⚫ Fundamental step after data collection and
  pre-processing
⚫ Most EDA techniques are graphical in nature
⚫ Objectives of EDA :
⚫ 1) Maximize insight into the database
⚫ 2) Visualize relationships between exposure
  and outcome variables
⚫ 3) Detect outliers and anomalies
⚫ 4) Extract and create relevant variables
⚫ 5) Finding a suitable model
                            Asst Prof. Bindy Wilson
⚫ EDA methods - Graphical or non-graphical
⚫ Non-graphical - Summary statistics include,
  frequency, mean, median, mode, range,
  interquartile range, maximum and minimum
  values
⚫ Graphical - Data visualization, multiple types
  of charts, graphs
                              Asst Prof. Bindy Wilson
    Summary Statistics
⚫ Mean - we sum values and divide by the
  number of observations
⚫ Median - middle value in a sorted data set
⚫ Variance - measure of the spread for the
  given set of numbers
⚫ Interquartile range (IQR) - data situated
  between the 1st and the 3rd quartiles
⚫ Skewness - measures asymmetry about the
  mean
                           Asst Prof. Bindy Wilson
    Summary Statistics (contd)
⚫ Kurtosis - measure of peakedness and tailedness
  of the probability distribution of a random
  variable
⚫ Covariance and correlation - measure the
  degree of the relationship between two random
  variables
⚫ If two variables have a correlation close to -1, it
  means that as one variable increases, the other
  decreases, and if two variables have a correlation
  close to +1, it means that those variables move
  together in the same direction
                                Asst Prof. Bindy Wilson
    Data visualization
⚫   the process of creating and studying the visual
    representation of data to bring some meaningful
    insights
⚫   deals with visualizing the information in a given data
⚫   Benefits
⚫   • Identifying red spots in data
⚫   • Tracking and identifying relations among different
    attributes
⚫   • Seeing the trend
⚫   • Summarizing complicated long spreadsheets and
    databases into visual art
⚫   • Easy to use and very impactful way to store and
    present information
                                    Asst Prof. Bindy Wilson
    Data visualization
⚫   Four types of presentation in data visualization
    Comparison, Relationship, Distribution, and
    Composition
⚫   Comparison is used to see the differences
    between multiple items at given point in time Eg –
    Line chart
⚫   Relationship helps in finding correlation between
    two or more variables Eg - Scatter and bubble
⚫   Distribution charts like column and line
    histograms show the spread of data. Skewness
    toward left or right could be easily spotted.
⚫   Composition refers to a stacked chart with
    multiple components like a pie chart or stacked
    column chart
                                 Asst Prof. Bindy Wilson
       Boxplots
⚫   Boxplots are a compact way of representing
    the five-number summary namely median, first
    and third quartiles (25th and 75th percentile)
    and min and max.
⚫   The upper side of the vertical rectangular box
    represents the third quartile and the lower, the
    first quartile. The difference between the two
    points is known as the interquartile range,
    which consist of 50% of the data.
⚫   A line dividing the rectangle represents the
    median.
⚫    It also contains a line extending on both sides
    (known as whiskers) of the rectangle
⚫   The points plotted, which are shown as
    extensions of the lines, are called outliers.
                                               Asst Prof. Bindy Wilson
     Line Chart
⚫   A line chart is a basic visualization chart type in which
    information is displayed in a series of data points
    connected by line segments. Line charts are used for
    showing trends
                                     Asst Prof. Bindy Wilson
    Scatterplots
⚫   A scatterplot is a graph that helps identify if there is
    a relationship between two variables.
⚫   Scatterplots use Cartesian coordinates to show two
    variables on an x- and y-axis. Higher dimensional
    scatterplots are also possible
⚫   If we add dimensions of color or shape or size, so we
    can present more than two variables
                                     Asst Prof. Bindy Wilson
    Correlation Plots
⚫   The best way to show how much one indicator relates
    to another is by computing the correlation.
⚫   The combination of color, size, and position
    encapsulates a numeric value into a visual
    representation in a Correlation plot
⚫   Direction of the ellipse represents a positive or negative
    correlation & size represents the value
                                      Asst Prof. Bindy Wilson
⚫ Stacked column charts are an elegant way of showing
  the composition of various categories that make up a
  particular variable
⚫ A histogram is one of the most basic and easy to
  understand graphical representations of numerical data.
⚫ It consists of rectangular boxes. The width of each
  rectangle has a certain range and the height signifies the
  number of data points within that range.
                                      Asst Prof. Bindy Wilson
⚫   A Pie Chart is a type of graph that uses pie slices to
    show relative sizes of data.
⚫   Heatmaps are visualization of data where values are
    represented as different shades of colors, darker the
    shade, higher is the value.
⚫   Dendograms are visual representations specifically
    useful in clustering analysis. They are tree diagrams
    frequently used to illustrate the formation of clusters
⚫   The y-axis in dendograms measures the closeness (or
    similarity) of clusters.
                                      Asst Prof. Bindy Wilson
    High level Programming language
⚫   A high-level language (HLL) is a programming language that
    enables a programmer to write programs that are independent
    of a particular type of computer.
⚫   They are closer to human languages and further from machine
    languages.
⚫   Assembly languages are considered low-level because they are
    very close to machine languages.
⚫   Advantages of high-level language
⚫   High-level languages are programmer-friendly. They are easy to
    write, debug and maintain.
⚫   It provides a higher level of abstraction from machine languages.
⚫   It is a machine-independent language.
⚫   Easy to learn.
⚫   Less error-prone, easy to find and debug errors.
⚫   High-level programming results in better programming
    productivity.
                                           Asst Prof. Bindy Wilson
    Integrated Development
    Environment (IDE )
⚫   An IDE enables programmers to consolidate the different
    aspects of writing a computer program.
⚫   It is a software for building applications that combines common
    developer tools into a single graphical user interface (GUI)
⚫   Development tools include text editors, code libraries,
    compilers and test platforms
⚫   An IDE typically offers
⚫   a text editor,
⚫   automated code validation,
⚫   syntax highlighting,
⚫   auto completion,
⚫   contextual suggestions,
⚫   easy access to help, and
⚫   debugging tools
                                         Asst Prof. Bindy Wilson