Unit 1.
INTRODUCTION TO DATA SCIENCE
  Learning Objectives
  •Introduction to core concepts and technologies
  •Familiarity with terminology related with data science
  •Dealing with Data Science Process
  •Getting acquainted with various popular Data Science toolkit
  •Types of data dealt with in Data Science
  •Familiarity with example applications
              1.1 DATA SCIENCE
• Interdisciplinary field of scientific methods processes
  & systems to extract knowledge or insights from data
  in structured or unstructured similar to data mining
• Data Science is the science which uses computer
  science, statistics and machine learning, visualization
  and human-computer interactions to collect, clean,
  integrate, analyze, visualize, interact with data to
  create data products
Data Science as convergence of various knowledge
                    domains
Discipline of using quantative
methods from Statistics and
Mathematics with Technology
Broad Canvas of Data Science Dealing with
                Big Data
    1.2 TERMINOLOGY RELATED WITH DATA
                 SCIENCE
Big Data is also data but with a huge size. Big
  Data is a term used to describe a collection of
  data that is huge in size and yet growing
  exponentially with time.
  – Characteristics Of Big Data
       Volume, Velocity, Value, Verasity,
  Variety,
• Structured data –
  Structured data is data whose elements are addressable for effective
  analysis. It has been organized into a formatted repository that is typically
  a database data.
• Semi-Structured data –
  Semi-structured data is information that does not reside in a relational
  database but that have some organizational properties that make it
  easier to analyze. With some process, you can store them in the relation
  database
• Unstructured data –
  Unstructured data is a data which is not organized in a predefined
  manner or does not have a predefined data model, thus it is not a good
  fit for a mainstream relational database.
properties               Structured data                 Semi-structured data           Unstructured data
                                                         It is based on
                         It is based on Relational                                      It is based on character
Technology                                               XML/RDF(Resource
                         database table                                                 and binary data
                                                         Description Framework).
                         Matured transaction and                                        No transaction
                                                         Transaction is adapted
Transaction management   various concurrency                                            management and no
                                                         from DBMS not matured
                         techniques                                                     concurrency
                         Versioning over                 Versioning over tuples or
Version management                                                                      Versioned as a whole
                         tuples,row,tables               graph is possible
                                                         It is more flexible than
                                                                                    It is more flexible and
                         It is schema dependent          structured data but less
Flexibility                                                                         there is absence of
                         and less flexible               flexible than unstructured
                                                                                    schema
                                                         data
                         It is very difficult to scale   It’s scaling is simpler than
Scalability                                                                             It is more scalable.
                         DB schema                       structured data
                                                         New technology, not very
Robustness               Very robust                                                    —
                                                         spread
                         Structured query allow          Queries over anonymous         Only textual queries are
Query performance
                         complex joining                 nodes are possible             possible
    TERMINOLOGY RELATED WITH DATA
              SCIENCE
Business Intelligence-technology which is uses
  the transformed and loaded historical data to
  generate the report.
-Help executives, managers & other corporate
  end users make informed business decisions
Business Intelligence & Big Data
Business Intelligence & Big Data
• Data Analytics-collect, process, perform statistical
  analysis of data
• Data Wrangling-Data wrangling is the process of cleaning
  and unifying messy and complex data sets for easy
  access and analysis.
The key steps to data wrangling:
• Data Acquisition: Identify and obtain access to the data
  within your sources
• Joining Data : Combine the edited data for further use
  and analysis
• Data Cleansing: Redesign the data into a
  usable/functional format and correct/remove any bad
  data
           Goals of data wrangling:
• Reveal a “deeper intelligence” within your data,
  by gathering data from multiple sources
• Provide accurate, actionable data in the hands
  of business analysts in a timely matter
• Reduce the time spent collecting and
  organizing unruly data before it can be utilized
• Enable data scientists and analysts to focus on
  the analysis of data, rather than the wrangling
• Drive better decision-making skills by senior
  leaders in an organization
• Algorithm-Series of repeatable steps for
  carrying out a certain type of task with data.
• Machine Learning-Machine learning is an
  application of artificial intelligence (AI) that
  provides systems the ability to automatically
  learn and improve from experience without
  being explicitly programmed
• Web Analytics-Web analytics is the process of
  analyzing the behaviour of visitors to a
  Web site.
 The use of Web analytics is said to enable a
  business to attract more visitors, retain or
  attract new customers for goods or services,
  or to increase the dollar volume each
  customer spends.
  1.3 METHODS OF DATA REPOSITORY
• Data Repository refers to an enterprise data
  storage entity into which data has been
  partitioned for an analytical or reporting
  purpose.
• Data Lakes,
• Data Marts,
• DW,
• Big Data and Hadoop
  1.3 Methods of Data Repository
• Data Lake-is a storage repository that holds a vast
  amount of raw data in its native format until is
  needed and refined elsewhere
 Characteristics-
1. All data is loaded from source systems. No data is
   turned away
2. Data is stored at the leaf level in an untransformed
   or nearly untransformed state
3. Data is transformed and scheme is applied to fulfil
   the needs of analysis
Data Warehouse-is constructed by integrating data from
  multiple heterogeneous sources that support analytical
  reporting, structured or ad-hoc queries and decision
  making.
Understanding a DW-
1. kept separate from organization's operational database
2. No frequent updating done in a warehouse
3. Historical data
4. Helps in the integration of diversity of application system
5. Helps in consolidated historical data analysis
                        Data Warehouse Models
Virtual Warehouse-the view over an operational DW.
   easy to build, requires excess capacity on operational database servers.
Data Mart-subset of organization-wide data
Characteristics-
1.    window-based of unix /linux based servers are used to data marts
2.    Measured short period of time
3.    Data mart cycle is measured in short periods of time
4.    Small in size
5.    Customized by department
6.    Source of data mart is departmentally structured DW
7.    It is flexible
Enterprise Warehouse - collects all the information and the subjects spanning an entire
      organization
Provides us enterprise-wide data integration
Integrated from operational systems and external information providers
              1.5 Types of Data
• Unstructured data: Heterogeneous data source
  containing a combination of simple text files, Images,
  videos etc. Word, PDF, Text, Media Logs
• Semi Structured data: Web Pages are generated in
  scripting of HTML XML data
• Structured Data: stored, accessed and processed in the
  form of fixed format Relational Data
1.6 Data Science Process
• Frame or define the (domain) problem
• Collect the raw data needed for your problem
• Data Preparation for process the data for analysis
• Explore the data
• Perform in depth analysis and producing
  prescriptive business insight
• Evaluation
• Visualization and Communication of Result of the
  analysis
               1.7 Data Science Project’s Life Cycle
High level of phases of crisp-DM suggested for the Data Science
                             Projects
    Data Science Project’s Life Cycle
•   Business Understanding
•   Data Acquisition and Understanding
•   Modeling
•   Deployment
•   Customer acceptance
Life cycle of Data Science Process
         1.8 Popular Data Science Toolkit
• R Programming Language
• Python
• KNIME-open source analytics platform for data reporting, mining and
  predictive analysis
• SQL
• Apache Hadoop and Big Data tools
  Apache mahaout-an environment for building scalable machine learning
  algorithm
   Apache Spark-cluster computing framework for data analysis
   Impala-MPP database for Apache Hadoop
   Apache storm-computational platform for real time analytics
   MongoDB-NOSQL database –scalability and high performance
• Tensor Flow-dataflow programming across a range of tasks.
                    1.9 Familiarity with Example Application
1. Airline Route planning-Predict flight delay,
   Decide which class of airplane buy,whether to
   land directly to the destination or take halt in
   between,effectively drive cutomer loyality
   programs
2. Fraud & Risk Detection-customer profiling,
   past expenditure
3. Delivery Logistics-improve their operational
   efficiency.
4. Uber’s Taxi Service- Smart phone app based taxi booking service
5. Price Comparison website - comparing the price of product from
    multiple vendors at one place.
6. People analytics –application of analytics helps companies
    manage human resources
7. Portfolio Analytics-make decisions on when to lend money
8. Risk Analytics-risk scores for individual customer
9.Digital Analytics-business & technical activity that define ,create
    collect, verify or transform digital data into
    reporting,research,analysis,recommendations,optimiztions,pred
    ictions,automations
10.Security Analytics-event management and user behavior
    analytics
Marketing, Finance, Human Resources, Health Care, Government