0% found this document useful (0 votes)
17 views86 pages

ITDS Unit 1 - Merged

The document provides an introduction to data science, detailing the evolution of data, types of data, and the data collection process. It emphasizes the importance of data preparation, the roles of data scientists, and the stages of a data science project. Additionally, it outlines common mistakes in data science projects and the significance of understanding both data and business context for successful outcomes.

Uploaded by

sc2022sa00976
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views86 pages

ITDS Unit 1 - Merged

The document provides an introduction to data science, detailing the evolution of data, types of data, and the data collection process. It emphasizes the importance of data preparation, the roles of data scientists, and the stages of a data science project. Additionally, it outlines common mistakes in data science projects and the significance of understanding both data and business context for successful outcomes.

Uploaded by

sc2022sa00976
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 86

BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Data Evolution - Data to Data Science


In general, data is a collection of characters, numbers, and other symbols that represents values of
some situations or variables. Data is plural and singular of the word data is “datum”. Using
computers, data are stored in electronic forms because data processing becomes faster and easier as
compared to manual data processing done by people. The Information and Communication
Technology (ICT) revolution led by computer, mobile and Internet has resulted in generation of
large volume of data and at a very fast pace. The following list contains some examples of data that
we often come across.
 Name, age, gender, contact details, etc., of a person
 Transactions data generated through banking, ticketing, shopping, etc. whether online or
offline
 Images, graphics, animations, audio, video
 Documents and web pages
 Online posts, comments and messages
 Signals generated by sensors
 Satellite data including meteorological data, communication data, earth observation data,
etc.
Data is usually organized into structures such as tables that provide additional context and meaning,
and which may themselves be used as data in larger structures. Data may be used as variables in a
computational process.
Data may represent abstract ideas or concrete measurements. Data is commonly used in scientific
research, economics, and in virtually every other form of human organizational activity. Examples
of data sets include price indices (such as consumer price index), unemployment rates, literacy rates,
and census data. In this context, data represents the raw facts and figures which can be used in such
a manner in order to capture the useful information out of it.
Data can be seen as the smallest units of factual information that can be used as a basis for
calculation, reasoning, or discussion. Data can range from abstract ideas to concrete measurements,
including but not limited to, statistics. Thematically connected data presented in some relevant
context can be viewed as information. Contextually connected pieces of information can then be
described as data insights or intelligence. The stock of insights and intelligence that accumulates
over time resulting from the synthesis of data into information, can then be described as knowledge.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Type of Data
The data is classified into majorly four categories:

 Nominal data
 Ordinal data
 Discrete data
 Continuous data

Qualitative or Categorical Data


Qualitative data, also known as the categorical data, describes the data that fits into the categories.
Qualitative data are not numerical. The categorical information involves categorical variables that
describe the features such as a person’s gender, home town etc. Categorical measures are defined
in terms of natural language specifications, but not in terms of numbers.

Nominal Data
Nominal data is one of the types of qualitative information which helps to label the variables without
providing the numerical value. Nominal data is also called the nominal scale. It cannot be ordered
and measured. But sometimes, the data can be qualitative and quantitative. Examples of nominal
data are letters, symbols, words, gender etc.
The nominal data are examined using the grouping method. In this method, the data are grouped
into categories, and then the frequency or the percentage of the data can be calculated. These data
are visually represented using the pie charts.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Ordinal Data
Ordinal data/variable is a type of data that follows a natural order. The significant feature of the
nominal data is that the difference between the data values is not determined. This variable is mostly
found in surveys, finance, economics, questionnaires, and so on.
The ordinal data is commonly represented using a bar chart. These data are investigated and
interpreted through many visualisation tools. The information may be expressed using tables in
which each row in the table shows the distinct category.

Quantitative or Numerical Data


Quantitative data is also known as numerical data which represents the numerical value (i.e., how
much, how often, how many). Numerical data gives information about the quantities of a specific
thing. Some examples of numerical data are height, length, size, weight, and so on. The quantitative
data can be classified into two different types based on the data sets. The two different classifications
of numerical data are discrete data and continuous data.

Discrete Data
Discrete data can take only discrete values. Discrete information contains only a finite number of
possible values. Those values cannot be subdivided meaningfully. Here, things can be counted in
whole numbers.
Example: Number of students in the class

Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable values that can
be selected within a given specific range.
Example: Temperature range
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

The Evolution of Data


Change is the only constant in life, and technology, civilizations, and culture have evolved over
history. What has not changed are facts.

With the passage of time and the evolution of technologies, civilizations, and culture, the
methodologies used to capture, store, process, and use facts have evolved. Similarly, data (a
representation of facts) and data management have had their own evolution cycles and they continue
to evolve.

Until the advent of computers, limited facts were documented, given the expense and scarcity of
resources and effort to store and maintain them. In ancient times, it was not uncommon for
knowledge to be transferred from one generation to another by the process of oral learning. The
oral tradition is a contrast to the current digital age, which has elaborate document and content
management systems that store knowledge in the form of documents and records.

Different Sources of Data for Data Analysis


Data collection is the process of acquiring, collecting, extracting, and storing the voluminous
amount of data which may be in the structured or unstructured form like text, video, audio, XML
files, records, or other image files used in later stages of data analysis.
In the process of big data analysis, “Data collection” is the initial step before starting to analyze the
patterns or useful information in data. The data which is to be analyzed must be collected from
different valid sources.

The data which is collected is known as raw data which is not useful now but on cleaning the impure
and utilizing that data for further analysis forms information, the information obtained is known as
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

“knowledge”. Knowledge has many meanings like business knowledge or sales of enterprise
products, disease treatment, etc. The main goal of data collection is to collect information-rich data.

Data collection starts with asking some questions such as what type of data is to be collected and
what is the source of collection. Most of the data collected are of two types known as “qualitative
data“ which is a group of non-numerical data such as words, sentences mostly focus on behavior
and actions of the group and another one is “quantitative data” which is in numerical forms and can
be calculated using different scientific tools and sampling data.

The actual data is then further divided mainly into two types known as:
Primary data
The data which is Raw, original, and extracted directly from the official sources is known as primary
data. This type of data is collected directly by performing techniques such as questionnaires,
interviews, and surveys. The data collected must be according to the demand and requirements of
the target audience on which analysis is performed otherwise it would be a burden in the data
processing.
1. Interview method:
2. Survey method:
3. Observation method:
4. Experimental method:
Secondary data:
Secondary data is the data which has already been collected and reused again for some valid
purpose. This type of data is previously recorded from primary data and it has two types of sources
named internal source and external source.
Internal source:
External source:
Other sources:
Sensors data
Satellites data
Web traffic
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Preparing and gathering data and knowledge


What is data preparation?
Data preparation is the process of cleaning and transforming raw data prior to processing and
analysis. It is an important step prior to processing and often involves reformatting data, making
corrections to data, and combining datasets to enrich data.
Data preparation is often a lengthy undertaking for data engineers or business users, but it is
essential as a prerequisite to put data in context in order to turn it into insights and eliminate bias
resulting from poor data quality.

Data preparation steps


he specifics of the data preparation process vary by industry, organization, and need, but the
workflow remains largely the same.

1. Gather data
The data preparation process begins with finding the right data. This can come from an existing data
catalog or data sources can be added ad-hoc.
2. Discover and assess data
After collecting the data, it is important to discover each dataset. This step is about getting to know
the data and understanding what has to be done before the data becomes useful in a particular
context.
3. Cleanse and validate data
Cleaning up the data is traditionally the most time-consuming part of the data preparation process,
but it’s crucial for removing faulty data and filling in gaps. Important tasks here include:

 Removing extraneous data and outliers


 Filling in missing values
 Conforming data to a standardized pattern
 Masking private or sensitive data entries
Once data has been cleansed, it must be validated by testing for errors in the data preparation process
up to this point. Often, an error in the system will become apparent during this validation step and
will need to be resolved before moving forward.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

4. Transform and enrich data


Data transformation is the process of updating the format or value entries in order to reach a well-
defined outcome, or to make the data more easily understood by a wider audience. Enriching data
refers to adding and connecting data with other related information to provide deeper insights.
5. Store data
Once prepared, the data can be stored or channeled into a third party application — such as a
business intelligence tool — clearing the way for processing and analysis to take place.

Soft Skills for Data Scientists


soft skills are essential for data scientists to succeed in their career, especially in the early stage. We
want to introduce soft skills for data scientists before discussing technical components.

Comparison between Statistician and Data Scientist


BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Data and Analytics

Data scientists usually have a good sense of data and analytics, but data science projects are much
more than that. A data science project may involve people with different roles, especially in a large
company:
• The business owner or leader who identifies business problem and value;
• The data owner and computation resource/infrastructure owner from the IT department;
• A dedicated policy owner to make sure the data and model are under model governance, security
and privacy guidelines and laws;
• A dedicated engineering team to implement, maintain and refresh the model;
The entire team usually will have multiple rounds of discussion of resource allocation among groups
at the beginning of the project and during the project.

Effective communication and in-depth domain knowledge about the business problem are essential
requirements for a successful data scientist. A data scientist may interact with people at various
levels, from senior leaders who set the corporate strategies to front-line employees who do the daily
work.

Three Pillars of Knowledge


It is well known there are three pillars of essential knowledge for a successful data scientist.
(1) Analytics knowledge and toolsets
A successful data scientist needs to have a strong technical background in data mining, statistics,
and machine learning.
The in-depth understanding of modelling with insight about data enables a data scientist to convert
a business problem to a data science problem.
Many chapters of this book are focusing on analytics knowledge and toolsets.

(2) Domain knowledge and collaboration


A successful data scientist needs in-depth domain knowledge to understand the business problem
well. For any data science project, the data scientist needs to collaborate with other team members.
Communication and leadership skills are critical for data scientists during the entire project cycle,
especially when there is only one scientist in the project. The scientist needs to decide the timeline
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

and impact with uncertainty.

(3) (Big) data management and (new) IT skills


The last pillar is about computation environment and model implementation in a big data platform.
It used to be the most difficult one for a data scientist with a statistics background.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Data Science Project Cycle


A data science project has various stages. Many textbooks and blogs focus on one or two specific
stages, and it is rare to see an end-to-end life cycle of a data science project.
To get a good grasp of the end-to-end process requires years of real-world experience.

Types of Data Science Projects


People often use data science projects to describe any project that uses data to solve a business
problem, including traditional business analytics, data visualization, or machine learning modelling.
The types of data used and the final model development define the different kinds of data science
projects.

Offline and Online Data


There are offline and online data.
Offline data are historical data stored in databases or data warehouses. With the development of
data storage techniques, the cost to store a large amount of data is low. Offline data are versatile
and rich in general.
Online data are real-time information that flows to models to make automatic actions. Real-time
data can frequently change (for example, the keywords a customer is searching for can change at
any given time). Capturing and using real-time online data requires the integration of a machine
learning model to the production infrastructure. It used to be a steep learning curve for data scientists
not familiar with computer engineering, but the cloud infrastructure makes it much more
manageable. Based on the offline and online data and model properties.

Offline Training and Offline Application


This type of data science project is for a specific business problem that needs to be solved once or
multiple times. But the dynamic and disruptive nature of this type of business problem requires
substantial work every time. One example of such a project is “whether a brand-new business
workflow is going to improve efficiency.” In this case, we often use internal/external offline data
and business insight to build models. The final results are delivered as a report to answer the specific
business question. It is similar to the traditional business intelligence project but with more focus
on data and models. Sometimes the data size and model complexity are beyond the capacity of a
single computer. Then we need to use distributed storage and computation.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Offline Training and Online Application


Another type of data science project uses offline data for training and applies the trained model to
real-time online data in the
production environment. For example, we can use historical data to train a personalized
advertisement recommendation model that
provides a real-time ad recommendation. The model training uses historical offline data. The trained
model then takes customers’ online real-time data as input features and run the model in realtime to
provide an automatic action. The model training is very similar to the “offline training, offline
application” project. But to put the trained model into production, there are specific requirements.
For example, as features used in the offline training have to be available online in real-time, the
model’s online run-time has to be short enough without impacting user experience.

Online Training and Online Application


For some business problems, it is so dynamic that even yesterday’s data is out of date. In this case,
we can use online data to train the model and apply it in real-time. We call this type of data science
project “online training, online application.” This type of data science project requires high
automation and low latency.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Common Mistakes in Data Science


Data science projects can go wrong at different stages in many ways. Most textbooks and online
blogs focus on technical mistakes about machine learning models, algorithms, or theories, such as
detecting outliers and overfitting. It is important to avoid these technical mistakes.
 Problem Formulation Stage
Project Planning Stage
 Project Modeling Stage
 Model Implementation and Post Production Stage

Summary of Common Mistakes


The data science project is a combination of art, science, and engineering. A data science project
may fail in different ways. However, the data science project can provide significant business value
if we put data and business context at the center of the project, get familiar with the data science
project cycle and proactively identify and avoid these potential mistakes. Here is the summary of
the mistakes:
• Solving the wrong problem
• Overpromise on business value
• Too optimistic about the timeline
• Too optimistic about data availability and quality
• Unrepresentative data
• Overfitting and obsession with complicated models
• Take too long to fail
• Missing A/B testing
• Fail to scale in real-time applications
• Missing necessary online checkup

Big Data Cloud Platform


People used to store data on papers, tapes, diskettes, or hard drives. Only recently, with the
development of computer hardware and software, the volume, variety, and speed of the data exceed
the capacity of a traditional statistician or analyst.

In the past few years, by utilizing commodity hardware and open-source software, people created a
big data ecosystem for data storage, data retrieval, and parallel computation. Hadoop and Spark
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

have become a popular platform that enables data scientists, statisticians, and analysts to access the
data and to build models. Programming skills in the big data platform have been an obstacle for a
traditional statistician or analyst to become a successful data scientist.

cloud computing reduces the difficulty significantly. The user interface of the data platform is much
more friendly today, and people push much of the technical details to the background. Today’s
cloud systems also enable quick implementation of the production environment. Now data science
emphasizes more on the data itself, models and algorithms on top of the data, rather than the
platform, infrastructure and low-level programming such as Java.

Power of Cluster of Computers


We are familiar with our laptop/desktop computers which have three main components to do data
computation:
(1) Hard disk,
(2) Memory, and
(3) CPU

The data and codes stored in the hard disk have specific features such as slow to read and write, and
large capacity of around a few TB in today’s market. Memory is fast to read and write but with
small capacity in the order of a few dozens of GB in today’s market.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Evolution of Cluster Computing


Using computer clusters to solve general-purpose data and analytics problems needs a lot of effort
if we have to specifically control every element and steps such as data storage, memory allocation,
and parallel computation. Fortunately, high tech companies and open source communities have
developed the entire ecosystem based on Hadoop and Spark. Users need only to know high-level
scripting languages such as Python and R to leverage computer clusters’ distributed storage,
memory and parallel computation power.

Hadoop
The very first problem internet companies face is that a lot of data has been collected and how to
better store these data for future analysis. Google developed its own file system to provide efficient,
reliable access to data using large clusters of commodity hardware.

The open-source version is known as Hadoop Distributed File System (HDFS). Both systems use
Map-Reduce to allocate computation across computation nodes on top of the file system. Hadoop
is written in Java and writing map-reduce job using Java is a direct way to interact with Hadoop
which is not familiar to many in the data and analytics community.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Spark
Spark works on top of a distributed file system including HDFS with better data and analytics
efficiency by leveraging in-memory operations. Spark is more tailored for data processing and
analytics and the need to interact with Hadoop directly is greatly reduced.
The spark system includes an SQL-like framework called Spark SQL and a parallel machine
learning library called MLlib Fortunately for many in the analytics community, Spark also supports
R and Python. We can interact with data stored in a distributed file system using parallel computing
across nodes easily with R and Python through the Spark API and do not need to worry about lower-
level details of distributed computing. We will introduce how to use an R notebook to drive Spark
computations.

Introduction of Cloud Environment


Even though Spark provides a solution for big data analytics, the maintenance of the computing
cluster and Spark system requires a dedicated team.

There are many cloud computing environments such as Amazon’s AWS, Google cloud and
Microsoft Azure which provide a complete list of functions for heavy-duty enterprise applications.
For example, Netflix runs its business entirely on AWS without owning any data centers. For
beginners, however, Databricks provides an easy to use cloud system for learning purposes.
Databricks is a company founded by the creators of Apache Spark and it provides a userfriendly
web-based notebook environment that can create a Spark cluster on the fly to run / Python/ Scala/
SQL scripts.

Databases and SQL


Databases have been around for many years to efficiently organize, store, retrieve, and update data
systematically. In the past, statisticians and analysts usually dealt with small datasets stored in text
or spreadsheet files and often did not interact with database systems. Students from the traditional
statistics department usually lack the necessary database knowledge.

as data grow bigger, database knowledge becomes essential and required for statisticians, analysts
and data scientists in an enterprise environment where data are stored in some form of database
systems.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Databases often contain a collection of tables and the relationship among these tables (i.e. schema).
The table is the fundamental structure for databases that contain rows and columns similar to data
frames in R or Python. Database management systems (DBMS) ensure data integration and security
in real time operations.

There are many different DBMS such as Oracle, SQL Server, MySQL, Teradata, Hive, Redshift
and Hana. The majority of database operations are very similar among different DBMS, and
Structured Query Language (SQL) is the standard language to use these systems.

Database, Table and View


A database is a collection of tables that are related to each other. A database has its own database
name and each table has its name as well. We can think a database is a “folder” where tables
within a database are “files” within the folder. A table has rows and columns exactly as an R or
Python pandas data frame. Each row (also called record) represents a unique instance of the subject
and each column (also called field or attribute) represents a characteristic of the subject on the table.
For each table, there is a special column called the primary key which uniquely identifies each of
its records.

Basic SQL Statement


There are a few very easy SQL statements to help usbunderstand the database and table structure:

show database: show current databases in the system create database db_name: create a new
database with name db_name drop database db_name: delete database db_name (be careful when
using it!)
use db_name: set up the current database to be used
show tables: show all the tables within the currently used database
describe tbl_name: show the structure of table with name tbl_name (i.e. list of column name and
data type)
drop tbl_name: delete a table with name tbl_name (be careful when using it!)
select * from metrics limit 10: show the first 10 rows of a table
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Data Pre-processing
Data preprocessing is the process of converting raw data into clean data that is proper for modeling.
A model fails for various reasons. One is that the modeller doesn’t correctly pre-process data before
modelling. Data pre-processing can significantly impact model results, such as imputing missing
value and handling with outliers. So data pre-processing is a very critical part.

In real life, depending on the stage of data cleanup, data has the following types:
1. Raw data
2. Technically correct data
3. Data that is proper for the model
4. Summarized data
5. Data with fixed format

The raw data is the first-hand data that analysts pull from the database, market survey responds from
your clients, the experimental results collected by the research and development department, and so
on.

Technically correct data is the data, after preliminary cleaning or format conversion, that R (or
another tool you use) can successfully import it.

we have loaded the data into R with reasonable column names, variable format and so on. That does
not mean the data is entirely correct. There may be some observations that do not make sense, such
as age is negative, the discount percentage is greater than 1, or data is missing. Depending on the
situation, there may be a variety of problems with the data. It is necessary to clean the data before
modeling.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Data aggregation is also necessary for presentation, or for data visualization.

data analysts will take the results from data scientists and adjust the format, such as labels, cell
color, highlight. It is important for a data scientist to make sure the results look consistent which
makes the next step easier for data analysts.
It is highly recommended to store each step of the data and the R code, making the whole process
as repeatable as possible. The R markdown reproducible report will be extremely helpful for that.
If the data changes, it is easy to rerun the process.

What Is Data Wrangling?


Data wrangling is the process of removing errors and combining complex data sets to make them
more accessible and easier to analyze. Due to the rapid expansion of the amount of data and data
sources available today, storing and organizing large quantities of data for analysis is becoming
increasingly necessary.

A data wrangling process, also known as a data munging process, consists of reorganizing,
transforming and mapping data from one "raw" form into another in order to make it more usable
and valuable for a variety of downstream uses including analytics.

Data wrangling can be defined as the process of cleaning, organizing, and transforming raw data
into the desired format for analysts to use for prompt decision-making. Also known as data
cleaning or data munging, data wrangling enables businesses to tackle more complex data in less
time, produce more accurate results, and make better decisions. The exact methods vary from
project to project depending upon your data and the goal you are trying to achieve. More and
more organizations are increasingly relying on data wrangling tools to make data ready for
downstream analytics.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Importance of Data Wrangling


The primary importance of using data wrangling tools can be described as:

 Making raw data usable. Accurately wrangled data guarantees that quality data is entered into
the downstream analysis.
 Getting all data from various sources into a centralized location so it can be used.
 Automated data integration tools are used as data wrangling techniques that clean and convert
source data into a standard format that can be used repeatedly according to end requirements.
 Cleansing the data from the noise or flawed, missing elements
 Data wrangling acts as a preparation stage for the data mining process, which involves gathering
data and making sense of it.

Data Wrangling Tools


There are different tools for data wrangling that can be used for gathering, importing, structuring,
and cleaning data before it can be fed into analytics and BI apps.
You can use automated tools for data wrangling, where the software allows you to validate data
mappings and scrutinize data samples at every step of the transformation process. This helps to
quickly detect and correct errors in data mapping. Automated data cleaning becomes necessary in
businesses dealing with exceptionally large data sets.

Some examples of basic data Wrangling tools are:


Spreadsheets / Excel Power Query - It is the most basic manual data wrangling tool
OpenRefine - An automated data cleaning tool that requires programming skills
Tabula – It is a tool suited for all data types
Google DataPrep – It is a data service that explores, cleans, and prepares data
Data wrangler – It is a data cleaning and transforming tool
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

What are the Steps to Perform Data Wrangling?

Step 1: Data Discovery


Step 2: Data Structuring
Step 3: Data Cleaning
Step 4: Data Enriching
Step 5: Data Validating
Step 6: Data Publishing

Step 1: Data Discovery


The first step in the Data Wrangling process is Discovery. This is an all-encompassing term for
understanding or getting familiar with your data. You must take a look at the data you have and
think about how you would like it organized to make it easier to consume and analyze.

Step 2: Data Structuring


When raw data is collected, it’s in a wide range of formats and sizes. It has no definite structure,
which means that it lacks an existing model and is completely disorganized. It needs to be
restructured to fit in with the Analytical Model deployed by your business, and giving it a
structure allows for better analysis.

Unstructured data is often text-heavy and contains things such as Dates, Numbers, ID codes, etc.
At this stage of the Data Wrangling process, the dataset needs to be parsed.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science

Step 3: Data Cleaning


Data Cleaning involves Tackling Outliers, Making Corrections, Deleting Bad Data completely,
etc. This is done by applying algorithms to tidy up and sanitize the dataset.

Cleaning the data does the following:


 It removes outliers from your dataset that can potentially skew your results when analyzing
the data.
 It changes any null values and standardizes the data format to improve quality and
consistency.
 It identifies duplicate values and standardizes systems of measurements, fixes structural
errors and typos, and validates the data to make it easier to handle.

Step 4: Data Enriching


Enriching the data is an optional step that you only need to take if your current data doesn’t meet
your requirements.

Step 5: Data Validating


Validating the data is an activity that services any issues in the quality of your data so they can be
addressed with the appropriate transformations.
The rules of data validation require repetitive programming processes that help to verify the
following:
 Quality
 Consistency
 Accuracy
 Security
 Authenticity

Step 6: Data Publishing


All the steps are completed and the data is ready for analytics. All that’s left is to publish the
newly Wrangled Data in a place where it can be easily accessed and used by you and other
stakeholders.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

What Does Digital Data Mean?


Digital data is data that represents other forms of data using specific machine language systems that
can be interpreted by various technologies. The most fundamental of these systems is a binary
system, which simply stores complex audio, video or text information in a series of binary
characters, traditionally ones and zeros, or "on" and "off" values.

Digital data, in information theory and information systems, is information represented as a string
of discrete symbols, each of which can take on one of only a finite number of values from some
alphabet, such as letters or digits. An example is a text document, which consists of a string of
alphanumeric characters.

What are types of digital data?


It can be classified into two broad categories: • Bitmap objects : For example, image, video, or audio
files. Textual objects : For example, Microsoft Word documents, emails, or Microsoft Excel spread-
sheets. text without any structure.

What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical,
or mechanical recording media.

What is Big Data?


Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is a
data with so large size and complexity that none of traditional data management tools can store it
or process it efficiently. Big data is also a data but with huge size.

What is an Example of Big Data?


Following are some of the Big Data examples-

1. The New York Stock Exchange is an example of Big Data that generates about one terabyte
of new trade data per day.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

2. Social Media: The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated in terms
of photo and video uploads, message exchanges, putting comments etc.

3. A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Types of Big Data


Following are the types of Big Data:
 Structured
 Unstructured
 Semi-structured

Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data. Over the period of time, talent in computer science has achieved greater success
in developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a
size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.

Examples of Structured Data:


An ‘Employee’ table in a database is an example of Structured Data.

Employee_ID Employee_Name Gender Department Salary_In_lacs


2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for deriving
value out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don’t know how to derive value out of it since this data
is in its raw form or unstructured format.

Examples of Un-structured Data


The output returned by ‘Google Search’

Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.

Examples of Semi-structured Data


Personal data stored in an XML file-

<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>

<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>

<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>

<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>

Characteristics of Big Data


Big data can be described by the following characteristics:

 Volume
 Variety
 Velocity
 Variability

(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
very crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one
characteristic which needs to be considered while dealing with Big Data solutions.

(ii) Variety – The next aspect of Big Data is its variety.

Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most
of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices,
PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured
data poses certain issues for storage, mining and analyzing data.

(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.

Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.

(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Advantages of Big Data Processing


Ability to process Big Data in DBMS brings in multiple benefits, such as-

 Businesses can utilize outside intelligence while taking decisions


 Improved customer service
 Early identification of risk to the product/services, if any
 Better operational efficiency

Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of Big
Data technologies and data warehouse helps an organization to offload infrequently accessed data.

Big Data Challenges of conventional systems


 Big data is the storage and analysis of large data sets.
 These are complex data sets that can be both structured or unstructured.
 They are so large that it is not possible to work on them with traditional analytical tools.
 One of the major challenges of conventional systems was the uncertainty of the Data
Management Landscape.
 Big data is continuously expanding, there are new companies and technologies that are being
developed every day.
 A big challenge for companies is to find out which technology works bests for them without
the introduction of new risks and problems.
 These days, organizations are realising the value they get out of big data analytics and hence
they are deploying big data tools and processes to bring more efficiency in their work
environment.

Explain the drawbacks / Pitfalls of Conventional File Processing


System
There are some disadvantages of File Processing system, which are as mentioned below.

1. Data Redundancy: Data redundancy means, the same information is repeated in several
files.

2. Data Inconsistency: Data Inconsistency arises, when there is Data Redundancy. It means,
the various copies of the same data in different files is not get updated when changes are made once.
Thus the required information cannot get by an Application programs because there is no such
programs in the list of Application Programs or the fields of the file may vary at the time of
Application Design.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

3. Data Isolation: The data is scattered in various files with different formats. Therefore, it is
difficult to write a new application program and hence difficult to retrieve appropriate data from the
files.

4. Integrity Problems: The data values stored in a file must be satisfied with certain data
integrity constraints. The programmers need to provide integrity constraints and data must be
validated from time to time. It is having limitations in the file system.

5. Concurrent Access: The system requires to allow multiple users to access and update the
data simultaneously, instead of a Single user system. The interaction with concurrency may result
inconsistency.

6. Security Problems: The system should not give access to the unauthorized users to operate
as the data is important and sensitive data. It should allow only some of the users who have given
privileges to access and manipulate data.

BIG DATA ECOSYSTEM


 With the advances in technology and the rapid evolution of computing technology, it is
becoming a very tedious to process and manage huge amount of information without the use
of supercomputers.
 There are some tools and techniques that are available for data management like Google
BigTable, Data Stream Management System (DSMS), NoSQL amongst others.
 However, there is an urgent need for companies to deploy special tools and technologies
that can be used to store, access, analyse and large amounts of data in near-real time.
 Big Data cannot be stored in a single machine and thus, several machines are required.
 Common tools that are used to manipulate Big Data are Hadoop, MapReduce, and BigTable.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

What is Data Science? Introduction, Basic Concepts & Process


What is Data Science?

 Data Science is the area of study which involves extracting insights from vast amounts of
data using various scientific methods, algorithms, and processes.
 It helps you to discover hidden patterns from the raw data.
 The term Data Science has emerged because of the evolution of mathematical statistics, data
analysis, and big data.

 Data Science is an interdisciplinary field that allows you to extract knowledge from
structured or unstructured data.
 Data science enables you to translate a business problem into a research project and then
translate it back into a practical solution.

Why Data Science?


Here are significant advantages of using Data Analytics Technology:

 Data is the oil for today’s world. With the right tools, technologies, algorithms, we can use
data and convert it into a distinct business advantage
 Data Science can help you to detect fraud using advanced machine learning algorithms
 It helps you to prevent any significant monetary losses
 Allows to build intelligence ability in machines
 You can perform sentiment analysis to gauge customer brand loyalty
 It enables you to take better and faster decisions
 It helps you to recommend the right product to the right customer to enhance your business
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Data Science Components

1. Statistics:

Statistics is the most critical unit of Data Science basics, and it is the method or science of collecting
and analyzing numerical data in large quantities to get useful insights.

2. Visualization:
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Visualization technique helps you access huge amounts of data in easy to understand and digestible
visuals.

3. Machine Learning:

Machine Learning explores the building and study of algorithms that learn to make predictions
about unforeseen/future data.

4. Deep Learning:

Deep Learning method is new machine learning research where the algorithm selects the analysis
model to follow.

Data Science Process

1. Discovery:

Discovery step involves acquiring data from all the identified internal & external sources, which
helps you answer the business question.

The data can be:

 Logs from webservers


 Data gathered from social media
 Census datasets
 Data streamed from online sources using APIs
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

2. Preparation:

Data can have many inconsistencies like missing values, blank columns, an incorrect data format,
which needs to be cleaned. You need to process, explore, and condition data before modelling. The
cleaner your data, the better are your predictions.

3. Model Planning:

In this stage, you need to determine the method and technique to draw the relation between input
variables. Planning for a model is performed by using different statistical formulas and visualization
tools. SQL analysis services, R, and SAS/access are some of the tools used for this purpose.

4. Model Building:

In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing. Techniques like association, classification, and clustering are applied to the
training data set. The model, once prepared, is tested against the “testing” dataset.

5. Operationalize:

You deliver the final baselined model with reports, code, and technical documents in this stage.
Model is deployed into a real-time production environment after thorough testing.

6. Communicate Results

In this stage, the key findings are communicated to all stakeholders. This helps you decide if the
project results are a success or a failure based on the inputs from the model.

Data Science Jobs Roles


Most prominent Data Scientist job titles are:

 Data Scientist
 Data Engineer
 Data Analyst
 Statistician
 Data Architect
 Data Admin
 Business Analyst
 Data/Analytics Manager
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Tools for Data Science

Difference between Data Science with BI (Business Intelligence)


Parameters Business Intelligence Data Science
Perception Looking Backward Looking Forward
Data Structured Data. Mostly SQL, but some time Structured and Unstructured data.
Sources Data Warehouse) Like logs, SQL, NoSQL, or text
Statistics, Machine Learning, and
Approach Statistics & Visualization
Graph
Analysis & Neuro-linguistic
Emphasis Past & Present
Programming
Tools Pentaho. Microsoft Bl, QlikView, R, TensorFlow

Applications of Data Science


Some application of Data Science are:

1. Internet Search:

Google search uses Data science technology to search for a specific result within a fraction of a
second

2. Recommendation Systems:

To create a recommendation system. For example, “suggested friends” on Facebook or suggested


videos” on YouTube, everything is done with the help of Data Science.

3. Image & Speech Recognition:


BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Speech recognizes systems like Siri, Google Assistant, and Alexa run on the Data science technique.
Moreover, Facebook recognizes your friend when you upload a photo with them, with the help of
Data Science.

4. Gaming world:

EA Sports, Sony, Nintendo are using Data science technology. This enhances your gaming
experience. Games are now developed using Machine Learning techniques, and they can update
themselves when you move to higher levels.

5. Online Price Comparison:

PriceRunner, Junglee, Shopzilla work on the Data science mechanism. Here, data is fetched from
the relevant websites using APIs.

Challenges of Data Science Technology


 A high variety of information & data is required for accurate analysis
 Not adequate data science talent pool available
 Management does not provide financial support for a data science team
 Unavailability of/difficult access to data
 Business decision-makers do not effectively use data Science results
 Explaining data science to others is difficult
 Privacy issues
 Lack of significant domain expert
 If an organization is very small, it can’t have a Data Science team

Summary
 Data Science is the area of study that involves extracting insights from vast amounts of data
by using various scientific methods, algorithms, and processes.
 Statistics, Visualization, Deep Learning, Machine Learning are important Data Science
concepts.
 Data Science Process goes through Discovery, Data Preparation, Model Planning, Model
Building, Operationalize, Communicate Results.
 Important Data Scientist job roles are: 1) Data Scientist 2) Data Engineer 3) Data Analyst
4) Statistician 5) Data Architect 6) Data Admin 7) Business Analyst 8) Data/Analytics
Manager.
 R, SQL, Python, SaS are essential Data science tools.
 The predictions of Business Intelligence is looking backwards, while for Data Science, it is
looking forward.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

 Important applications of Data science are 1) Internet Search 2) Recommendation Systems


3) Image & Speech Recognition 4) Gaming world 5) Online Price Comparison.
 The high variety of information & data is the biggest challenge of Data science technology.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

Machine Learning Models

 A machine learning model is defined as a mathematical representation of the output of the


training process.
 Machine learning is the study of different algorithms that can improve automatically through
experience & old data and build the model.
 A machine learning model is similar to computer software designed to recognize patterns or
behaviours based on previous experience or data.
 The learning algorithm discovers patterns within the training data, and it outputs an ML
model which captures these patterns and makes predictions on new data.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

Classification of Machine Learning Models:


 Based on different business goals and data sets, there are three learning models for
algorithms.
 Each machine learning algorithm settles into one of the three models:

 Supervised Learning
 Unsupervised Learning
 Reinforcement Learning

Supervised Learning is further divided into two categories:


o Classification
o Regression

Unsupervised Learning is also divided into below categories:


o Clustering
o Association Rule
o Dimensionality Reduction
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

1. Supervised Machine Learning Models


 Supervised Learning is the simplest machine learning model to understand in which input
data is called training data and has a known label or result as an output.
 So, it works on the principle of input-output pairs.
 It requires creating a function that can be trained using a training data set, and then it is
applied to unknown data and makes some predictive performance.
 Supervised learning is task-based and tested on labelled data sets.

Regression
In regression problems, the output is a continuous variable. Some commonly used Regression
models are as follows:

a) Linear Regression
 Linear regression is the simplest machine learning model in which we try to predict one
output variable using one or more input variables.
 The representation of linear regression is a linear equation, which combines a set of input
values(x) and predicted output(y) for the set of those input values.
 It is represented in the form of a line:

Y = bx+ c.

 The main aim of the linear regression model is to find the best fit line that best fits the data
points.
 Linear regression is extended to multiple linear regression (find a plane of best fit) and
polynomial regression (find the best fit curve).
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

b) Decision Tree

 Decision trees are the popular machine learning models that can be used for both regression
and classification problems.
 A decision tree uses a tree-like structure of decisions along with their possible consequences
and outcomes.
 In this, each internal node is used to represent a test on an attribute; each branch is used to
represent the outcome of the test.
 The more nodes a decision tree has, the more accurate the result will be.
 The advantage of decision trees is that they are intuitive and easy to implement, but they
lack accuracy.
 Decision trees are widely used in operations research, specifically in decision analysis,
strategic planning, and mainly in machine learning.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

c) Random Forest
 Random Forest is the ensemble learning method, which consists of a large number of
decision trees.
 Each decision tree in a random forest predicts an outcome, and the prediction with the
majority of votes is considered as the outcome.
 A random forest model can be used for both regression and classification problems.
 For the classification task, the outcome of the random forest is taken from the majority of
votes.
 Whereas in the regression task, the outcome is taken from the mean or average of the
predictions generated by each tree.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

d) Neural Networks
 Neural networks are the subset of machine learning and are also known as artificial neural
networks.
 Neural networks are made up of artificial neurons and designed in a way that resembles the
human brain structure and working.
 Each artificial neuron connects with many other neurons in a neural network, and such
millions of connected neurons create a sophisticated cognitive structure.

 Neural networks consist of a multilayer structure, containing one input layer, one or more
hidden layers, and one output layer.
 As each neuron is connected with another neuron, it transfers data from one layer to the
other neuron of the next layers.
 Finally, data reaches the last layer or output layer of the neural network and generates output.
 Neural networks depend on training data to learn and improve their accuracy. However, a
perfectly trained & accurate neural network can cluster data quickly and become a powerful
machine learning and AI tool.
 One of the best-known neural networks is Google's search algorithm.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

Classification
 Classification models are the second type of Supervised Learning techniques, which are
used to generate conclusions from observed values in the categorical form.
 For example, the classification model can identify if the email is spam or not; a buyer will
purchase the product or not, etc.
 Classification algorithms are used to predict two classes and categorize the output into
different groups.
 In classification, a classifier model is designed that classifies the dataset into different
categories, and each category is assigned a label.

There are two types of classifications in machine learning:

 Binary classification: If the problem has only two possible classes, called a binary classifier.
For example, cat or dog, Yes or No.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

 Multi-class classification: If the problem has more than two possible classes, it is a multi-
class classifier.

Some popular classification algorithms are as below:

a) Logistic Regression
 Logistic Regression is used to solve the classification problems in machine learning.
 They are similar to linear regression but used to predict the categorical variables.
 It can predict the output in either Yes or No, 0 or 1, True or False, etc. However, rather than
giving the exact values, it provides the probabilistic values between 0 & 1.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

b) Support Vector Machine


 Support vector machine or SVM is the popular machine learning algorithm, which is widely
used for classification and regression tasks. However, specifically, it is used to solve
classification problems.
 The main aim of SVM is to find the best decision boundaries in an N-dimensional space,
which can segregate data points into classes, and the best decision boundary is known as
Hyperplane.
 SVM selects the extreme vector to find the hyperplane, and these vectors are known as
support vectors.

c) Naïve Bayes
 Naïve Bayes is another popular classification algorithm used in machine learning.
 It is called so as it is based on Bayes theorem and follows the naive (independent)
assumption between the features which is given as:

 Each naïve Bayes classifier assumes that the value of a specific variable is independent of
any other variable/feature. For example, if a fruit needs to be classified based on color,
shape, and taste.
 So yellow, oval, and sweet will be recognized as mango. Here each feature is independent
of other features.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

2. Unsupervised Machine learning models


 Unsupervised Machine learning models implement the learning process opposite to
supervised learning, which means it enables the model to learn from the unlabelled training
dataset.
 Based on the unlabelled dataset, the model predicts the output. Using unsupervised learning,
the model learns hidden patterns from the dataset by itself without any supervision.

Unsupervised learning models are mainly used to perform three tasks, which are as follows:
Clustering
 Clustering is an unsupervised learning technique that involves clustering or groping the data
points into different clusters based on similarities and differences.
 The objects with the most similarities remain in the same group, and they have no or very
few similarities from other groups.
 Clustering algorithms can be widely used in different tasks such as Image segmentation,
Statistical data analysis, Market segmentation, etc.
 Some commonly used Clustering algorithms are K-means Clustering, hierarchal Clustering,
DBSCAN, etc.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

Association Rule Learning


 Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset.
 The main aim of this learning algorithm is to find the dependency of one data item on another
data item and map those variables accordingly so that it can generate maximum profit.
 This algorithm is mainly applied in Market Basket analysis, Web usage mining, continuous
production, etc.
 Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-
growth algorithm.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

Dimensionality Reduction
 The number of features/variables present in a dataset is known as the dimensionality of the
dataset, and the technique used to reduce the dimensionality is known as the dimensionality
reduction technique.
 Although more data provides more accurate results, it can also affect the performance of the
model/algorithm, such as overfitting issues. In such cases, dimensionality reduction
techniques are used.
 "It is a process of converting the higher dimensions dataset into lesser dimensions dataset
ensuring that it provides similar information."
 Different dimensionality reduction methods such as PCA (Principal Component Analysis),
Singular Value Decomposition, etc.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

3. Reinforcement Learning
 In reinforcement learning, the algorithm learns actions for a given set of states that lead to a
goal state.
 It is a feedback-based learning model that takes feedback signals after each state or action
by interacting with the environment.
 This feedback works as a reward (positive for each good action and negative for each bad
action), and the agent's goal is to maximize the positive rewards to improve their
performance.
 The behaviour of the model in reinforcement learning is similar to human learning, as
humans learn things by experiences as feedback and interact with the environment.

 Below are some popular algorithms that come under reinforcement learning:

 Q-learning: Q-learning is one of the popular model-free algorithms of reinforcement


learning, which is based on the Bellman equation.
 It aims to learn the policy that can help the AI agent to take the best action for maximizing
the reward under a specific circumstance.
 It incorporates Q values for each state-action pair that indicate the reward to following a
given state path, and it tries to maximize the Q-value.

 State-Action-Reward-State-Action (SARSA): SARSA is an On-policy algorithm based


on the Markov decision process.
 It uses the action performed by the current policy to learn the Q-value.
 The SARSA algorithm stands for State Action Reward State Action, which symbolizes the
tuple (s, a, r, s', a').
 Deep Q Network: DQN or Deep Q Neural network is Q-learning within the neural network.
 It is basically employed in a big state space environment where defining a Q-table would be
a complex task. So, in such a case, rather than using Q-table, the neural network uses Q-
values for each action based on the state.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

Difference between Machine learning model and Algorithms


 The machine learning model is not the same as an algorithm.
 In a simple way, an ML algorithm is like a procedure or method that runs on data to discover
patterns from it and generate the model.
 At the same time, a machine learning model is like a computer program that generates output
or makes predictions.
 More specifically, when we train an algorithm with data, it becomes a model.

Machine Learning Model = Model Data + Prediction Algorithm

Supervised Learning Unsupervised Learning

Supervised learning algorithms are Unsupervised learning algorithms are


trained using labeled data. trained using unlabeled data.

Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting take any feedback.
correct output or not.

Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.

In supervised learning, input data is In unsupervised learning, only input


provided to the model along with the data is provided to the model.
output.

The goal of supervised learning is to The goal of unsupervised learning is to


train the model so that it can predict the find the hidden patterns and useful
output when it is given new data. insights from the unknown dataset.

Supervised learning needs supervision Unsupervised learning does not need


to train the model. any supervision to train the model.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

Supervised learning can be categorized Unsupervised Learning can be


in Classification and Regression proble classified
ms. in Clustering and Associations proble
ms.

Supervised learning can be used for Unsupervised learning can be used for
those cases where we know the input as those cases where we have only input
well as corresponding outputs. data and no corresponding output data.

Supervised learning model produces an Unsupervised learning model may give


accurate result. less accurate result as compared to
supervised learning.

Supervised learning is not close to true Unsupervised learning is more close to


Artificial intelligence as in this, we first the true Artificial Intelligence as it
train the model for each data, and then learns similarly as a child learns daily
only it can predict the correct output. routine things by his experiences.

It includes various algorithms such as It includes various algorithms such as


Linear Regression, Logistic Regression, Clustering, KNN, and Apriori
Support Vector Machine, Multi-class algorithm.
Classification, Decision tree, Bayesian
Logic, etc.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science

What is Exploratory Data Analysis?

 Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data
sets and summarize their main characteristics, often employing data visualization methods.

 Exploratory data analysis (EDA) is an approach of analysing data sets to summarize their
main characteristics, often using statistical graphics and other data visualization methods.

 A statistical model can be used or not, but primarily EDA is for seeing what the data can
tell us beyond the formal modelling and thereby contrasts traditional hypothesis testing.
 Exploratory data analysis has been promoted by John Tukey since 1970 to encourage
statisticians to explore the data, and possibly formulate hypotheses that could lead to new
data collection and experiments.
 EDA is different from initial data analysis (IDA)
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

FIRST STEPS IN BIG DATA

 Big data is a phenomenon that is characterized by the rapid expansion of raw data.

 This data that is being collected and generated so quickly that it is inundating government

and society. Therefore, it represents both a challenge and an opportunity.

 The challenge is related to how this volume of data is harnessed(tie together), and the

opportunity is related to how the effectiveness of society’s institutions is enhanced by

properly analysing this information.

 It is now commonplace to distinguish big data solutions from conventional IT solutions by

considering the following four dimensions:

1. Volume. Big data solutions must manage and process larger amounts of data.

2. Velocity. Big data solutions must process more rapidly arriving data.

3. Variety. Big data solutions must deal with more kinds of data, both structured and

unstructured.

4. Veracity. Big data solutions must validate the correctness of the large amount of rapidly

arriving data.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Enterprise Information Management For Big Data:

 Enterprise Information Management strategies are the evolution of traditional information

management practices due to the explosion of data and the rise of the Information Enterprise.

 EIM enables businesses to secure their information across the diverse and complex landscapes

of organizational departments, legacy systems, corporate and regulatory policies, business

content and unstructured big data.

 Enterprise Information Management software helps businesses attain 360 degree views of their

big data and analytics by streamlining organizational workflows, increasing the quality of

information and creating integrated user interfaces for end users within a single source

platform.

 OpenText offers EIM software systems and services that let you build a cohesive information

management strategy that leverages existing assets, meets urgent needs and establishes a fast

path to the future.


BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

 We believe that good information strategy is good business strategy, and that an effective EIM

strategy requires three things: Information Readiness, Information Capabilities, and

Information Confidence.

 Master Data Management: enabling flexible, efficient and effective business processes with

high master data quality

 Analytics: enabling big data related skills to leverage the full potential of data

 Governance and organization: defining and setting up EIM related organizations including

relevant roles and responsibilities

 Enterprise architecture: the big picture – ensuring that processes, data, technology and

organizational architecture are aligned as far as information is concerned

 Risk and compliance: safeguarding information/data security, privacy and compliance with

legal frameworks

 Transformation: enabling organizational change with innovative and efficient methods


BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Capabilities needed for Big data Big Data:

 Big data has brought game-changing shifts to the way data is acquired, analyzed, stored, and

used. Solutions can be more flexible, more scalable, and more cost-effective than ever before.

 Instead of building one-off systems designed to address specific problems for specific business

units, companies can create a common platform leveraged in different ways by different parts of

the business and all kinds of data — structured and unstructured, internal and external can be

incorporated.

Data Usage:

 Identifying Opportunities and Building Trust. Companies must create a culture that encourages

experimentation and supports a data-driven ideation process.

 They need to focus on trust, too—not just building it with consumers but wielding it as a

competitive weapon.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

 Businesses that use data in transparent and responsible ways will ultimately have more access

to more information than businesses that don’t.

The Data Engine:

 Laying the Technical Foundation and Shaping the Organization.

 Technical platforms that are fast, scalable, and flexible enough to handle different types of

applications are critical. So, too, are the skill sets required to build and manage them.

 In general, these new platforms will prove remarkably cost-effective, using commodity hardware

and leveraging cloud-based and open-source technologies. But their all-purpose nature means

that they will often be located outside individual business units.

 It’s crucial, therefore, to link them back to those businesses and their goals, priorities, and

expertise.

 Companies will also need to put the insights they gain from big data to use—embedding them in

operational processes, in or near real time.

The Data Ecosystem:

 Participating in a Big-Data Ecosystem and Making Relationships Work.

 Big data is creating opportunities that are often outside a company’s traditional business or

markets.

 Partnerships will be increasingly necessary to obtain required data, expertise, capabilities, or

customers. Businesses must be able to identify the right relationships—and successfully maintain

them.

 In a world where information moves fast, businesses that are quick to see, and pursue, the new

ways to work with data are the ones that will get ahead and stay ahead.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Application Architectures for Big Data and Analytics

Big Data Warehouse & Analytics

In order to examine the truth (or lack thereof) in this line of thinking, we need to start with the

basics.

First, what is big data? There are actually many different forms of big data. But the most widely

understood form of big data is the form found in Hadoop, Cloudera, etc.

A good working definition of big data solutions is:

• Technology capable of holding very large amounts of data.

• Technology that can hold the data in inexpensive storage devices.

• Technology where processing is done by the “Roman census” method.

• Technology where the data is stored in an unstructured format.

There are probably other ramifications and features, but these basic characteristics are a good

working description of what most people mean when they talk about a big data solution

Comparing Big Data Solutions to a Data Warehouse

 When we compare a big data solution to a data warehouse, what do we find?

 We find that a big data solution is a technology and that data warehousing is an architecture.

 They are two very different things.

 A technology is just that – a means to store and manage large amounts of data.

 A data warehouse is a way of organizing data so that there is corporate credibility and integrity.

 When someone takes data from a data warehouse, that person knows that other people are using

the same data for other purposes.

 There is a basis for reconcilability of data when there is a data warehouse.


BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Data Modelling Approaches for Big Data and Analytics Solutions

 Data integration, in effect is the acquisition of data from diverse source systems (like

operational applications for ERP, CRM, supply chain, where most enterprise data originates

and a host of external sources of data like social networks, external third party data sources,

etc.) through multiple transformations of the data to get it ready for loading into target systems

(like data warehouses, customer data hubs, and product catalogs).

 Heterogeneity is the norm for both data sources and targets, since there are various types of

applications, databases, file types, and so on.

 All these have different data models, so the data must be transformed in the middle of the

process, and the transformations themselves vary widely.

 Then there are the interfaces that connect these pieces, which are equally diverse and the data

doesn’t flow uninterrupted or in a straight line, so you need data staging areas.

 Simply put, that’s a lot of complex and diverse activities that you must perform to organize

data to make it useful.

 Eventually the data integration processes and approaches influence the data model development

as well.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Understanding Data Integration Patterns

 Data integration approaches can become highly complex especially when you are dealing with

big data types. Below is an attempt to outline the complexities of data integration processes.

 Level 0: Simple point to point data integration with little or no transformation. This just means

information is flowing from one system to another.

 Level 1: Simple data integration processes, transforming one schema to another, without

applying any data manipulation functions like “if,” “then,” “else,” etc.

 Level 2: Simple data integration processes, transforming one schema to another, with

application of data manipulation functions like “if,” “then,” “else,” etc.

 Level 3: Complex data integration patterns, transforming the subject data dealing with complex

schemas and semantic management involving both structured and unstructured data.

 In this scenario there could be one or more data sources (data could be also at rest or in motion)

and one or more schema targets.

 These design patterns (and there could be many more depending on the applications you are

trying to develop and the nature of data sources) need to be aligned with the right integration

architectures and influence the resulting data model to a great extent.

 We purposefully stayed away from discussing the granularity of data, state of data changes,

and governance processes around data: if you add those aspects to the data integration patterns

you can realize the complexity of the solution.


BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

NOSQL Data Modelling Technique

 NoSQL databases are document-oriented.

 This way, non-structured data (such as articles, photos, social media data, videos, or content

within a blog post) can be stored in a single document that can be easily found but isn’t

necessarily categorized into fields like a relational database does.

 It’s more intuitive, but note that storing data in bulk like this requires extra processing effort and

more storage than highly organized SQL data. That’s why Hadoop, an open-source computing

and data analysis platform capable of processing huge amounts of data in the cloud, is so popular

in conjunction with NoSQL database stacks.

 NoSQL databases offer another major advantage, particularly to app developers: ease of access.

 Relational databases have a fraught relationship with applications written in object-oriented

programming languages like Java, PHP, and Python.

 NoSQL databases are often able to sidestep this problem through APIs, which allow developers

to execute queries without having to learn SQL or understand the underlying architecture of their

database system.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Key-value model: the least complex NoSQL option, which stores data in a schema-less way that

consists of indexed keys and values. Examples: Cassandra, Azure, LevelDB, and Riak.

1. Column store or wide-column store: which stores data tables as columns rather than rows. It’s

more than just an inverted table sectioning out columns allows for excellent scalability and high

performance.

2. Document database: taking the key-value concept and adding more complexity, each document

in this type of database has its own data, and its own unique key, which is used to retrieve it.

It’s a great option for storing, retrieving and managing data that’s document-oriented but still

somewhat structured.

3. Graph database: have data that’s interconnected and best represented as a graph?

This method is capable of lots of complexity.


BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

What is ACID?

 Ask any data professional and they could probably explain the ACID (Atomicity, Consistency,

Isolation, and Durability) acronym quite well.

 The concept has been around for decades and until recently was the primary benchmark that all

databases strive to achieve – without the ACID requirement in place within a given system,

reliability was suspect.

ACID, which some two decades later has come to mean:

• Atomicity: Either the task (or all tasks) within a transaction are performed or none of them are.

This is the all-or-none principle. If one element of a transaction fails the entire transaction fails.

• Consistency: The transaction must meet all protocols or rules defined by the system at all times.

The transaction does not violate those protocols and the database must remain in a consistent state

at the beginning and end of a transaction; there are never any half-completed transactions.

• Isolation: No transaction has access to any other transaction that is in an intermediate or

unfinished state. Thus, each transaction is independent unto itself.

This is required for both performance and consistency of transactions within a database.

• Durability: Once the transaction is complete, it will persist as complete and cannot be undone;

it will survive system failure, power loss and other types of system breakdowns.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

There are of course many facets to those definitions and within the actual ACID requirement of

each particular database, but overall in the RDBMS world, ACID is overlord and without ACID

reliability is uncertain.

The Evolution of NoSQL

 The SQL scalability issue was recognized by some Web 2.0 companies with huge

environments, growing data and big infrastructure needs, like Google, Amazon, or Facebook.

 They presented their own solutions to the problem – technologies like BigTable, DynamoDB,

and Cassandra.

 This interest for alternatives resulted in a number of NoSQL Database Management Systems

(DBMS’s), with a clear direction to performance, reliability, and consistency.

 A number of existing indexing structures were reused and improved trying to enhance

searching and reading performance.

 The first solutions of NoSQL database types were developed by big companies to meet their

specific needs, like Google’s BigTable, maybe the first NoSQL system, and Amazon’s

DynamoDB.

 The success of these proprietary systems generated a big interest and there appeared a number

of similar open-source and proprietary database systems, some of the most popular ones being

Hypertable, Cassandra, MongoDB, DynamoDB, HBase, and Redis.


BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

What Makes NoSQL Different?

 One important difference between NoSQL databases and common relational databases is the

fact that NoSQL is a form of unstructured storage.

 This means that NoSQL databases do not have a fixed table structure like the ones found in

relational databases.

Advantages and Disadvantages of NoSQL Databases

 NoSQL databases have many advantages compared to SQL relational databases.

 One important, underlying difference is that NoSQL databases have a simple and flexible

structure, without a schema.

 NoSQL databases are based on key-value pairs.

 NoSQL databases may include column store, document store, key value store, graph store,

object store, XML store, and other data store modes.

 Each value in the database will have a key usually.


BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

 Some NoSQL database models also allow developers to store serialized objects into the

database, not only simple string values.

 Open-source NoSQL databases don’t require expensive licensing fees and can run on low-

resources hardware, rendering their deployment cost-effective.

 Also, when working with NoSQL databases, either open-source or proprietary, scalation is

easier and cheaper than when working with relational databases.

 This is because it’s done by horizontally scaling and distributing the load on all nodes, rather

than the usual vertical done with relational database systems, which is replacing the main host

with a more powerful one.

Disadvantages

 First, most NoSQL databases do not support reliability features that are natively supported by

relational databases.

 These reliability features can be atomicity, consistency, isolation, and durability.

 This also means that NoSQL databases, which don’t support those features, trade consistency

for performance and scalability.

 In order to support reliability and consistency features, developers must implement their own

personal code, which makes the system more complex.

 This might limit the number of applications that can rely on NoSQL databases for secure and

reliable transactions, like banking systems or personal data management.

 Other problem found in most NoSQL databases is incompatibility with SQL queries.

 This means that a manual or proprietary querying language is needed, adding more time and

complexity.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

NoSQL vs. Relational Databases

This table provides a quick feature comparison between NoSQL and relational databases:

 It should be noted that the table shows a comparison on the database level, not the

various database management systems that implement both models.

 These systems provide their own proprietary techniques to overcome some of the problems

and shortcomings in both systems, and in some cases, significantly improve performance and

reliability.

NoSQL Data Store Types

Key Value Store

 In the Key Value store type, it is used a hash table in which a unique key points to an specific

item.

 Keys can be organized into logical groups, only requiring keys to be unique on their own group.

 This allows the existence of identical keys in different logical groups.

 The following table contains an example of a key-value store, in which the key is the name of

the city, and the value is the address for Ulster University in that city.

 The most famous NoSQL database that uses a key value store is Amazon’s DynamoDB.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Document Store

 Document stores are similar to key value stores in that they are schema-less and based on a

key-value model. Also both share many of the same advantages and disadvantages.

 They lack consistency on the database level, which makes way for applications to provide more

reliability and consistency features.

 There are however, some important differences between the two.

 In Document Stores, the values (documents) provide encoding for the data stored. Those

encodings can be XML, JSON, or BSON (Binary encoded JSON).

 Also, querying based on data is possible.

 The most popular database application that relies on a Document Store is MongoDB.

Column Store

 In a Column Store database, data is stored in columns, as contrast to being stored in rows as is

done in most relational database management systems.

 A Column Store is comprised of one or more Column Families that logically group specific

columns of the database.

 A key is used to identify and point to a number of columns, with a keyspace attribute that

defines the scope of this key.

 Each column contains tuples of names-values, ordered and comma separated.

 Column Stores have fast read/write access to the information. Rows that correspond to a single

column are stored as a single disk entry. This means faster access during read/write operations.

 The most popular databases that use the column store include Google’s BigTable, HBase, and

Cassandra.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

Graph Base

 In a Graph Base model, a directed graph structure is used to represent the data. The graph is

comprised of edges and nodes.

 Formally, a graph is a representation of a pack of objects, where some pairs of objects are

connected by links.

 The interconnected objects are represented by mathematical abstractions, called vertices, and

the links that connect some pairs of vertices are called edges.

 A set of vertices and the edges that connect them is what is called graph.

 This illustrates the structure of a graph based database that uses edges and nodes.

 These nodes are organized by some relationships with other nodes, which are represented by

edges between the nodes. Both the nodes and the relationships have some defined properties.

 Graph databases are used typically in social networking applications.

 They allow developers to focus more on relations between objects rather than on the objects

themselves.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science

 In this context, they indeed allow for a scalable and easy-to-use environment.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

ETHICS AND DATA SCIENCE

What are ethics in data science?

 Data ethics is a branch of ethics that evaluates data practices collecting, generating,
analysing and disseminating data, both structured and unstructured that have the potential
to adversely impact people and society.

Why ethics is an important term in data science?

 The way data scientists build models can have real implications for justice, health, and
opportunity in people's lives and we have an obligation to consider the ethics of our
discipline each and every day. When built correctly, algorithms can have massive power to
do good in the world.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

Basic Ethical Principles


 Three basic principles, among those generally accepted in our cultural tradition, are
particularly relevant to the ethics of research involving human subjects:
 The principles of respect of
o persons,
o beneficence and
o justice.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

What is scope of ethics?

 The Scope of Ethics is wide which is mainly concerned with the principles or causes of
action as : - What obligation is common to all ? - What is good in all good acts? - The sense
of duty and responsibility. - Individual and Society. The entire question is laid under the
scope of ethics.
 Data ownership refers to both the possession of and responsibility for information.
Ownership implies power as well as control.
 The control of information includes not just the ability to access, create, modify, package,
derive benefit from, sell or remove data, but also the right to assign these access privileges
to others
 According to Garner (1999), individuals having intellectual property have rights to control
intangible objects that are products of human intellect.
 The range of these products encompasses the fields of art, industry, and science.
 Research data is recognized as a form of intellectual property and subject to protection by
government.

Importance of data ownership:

 According to Loshin (2002), data has intrinsic value as well as having added value as a
byproduct of information processing, “at the core, the degree of ownership (and by corollary,
the degree of responsibility) is driven by the value that each interested party derives from
the use of that information”.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

Considerations/issues in data ownership

 Researchers should have a full understanding of various issues related to data ownership to
be able to make better decisions regarding data ownership.
 These issues include paradigm of ownership, data hoarding, data ownership policies,
balance of obligations, and technology.
 Each of these issues gives rise to a number of considerations that impact decisions
concerning data ownership
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

The 5 Cs of Data Ethics

To ensure that there is a mechanism to foster a dialog, the following guidelines have been suggested
for building data products:

1. Consent

2. Clarity

3. Consistency and trust

4. Control (and Transparency)

5. Consequences (and Harm)

 Consent doesn’t mean anything. unless the user has clarity on the terms & conditions of the
contract.
 Usually contracts are a series of negotiations, but in all our online transactions it’s always a
binary condition.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

 The user either accepts the terms or rejects it. Developers of data products should not only
ensure that they take consent from the user. But the users should also have clarity on

1) What data they are providing;

2) How their data will be used and

3) What are the downstream consequences of using the data.

 Clarity is closely related to consent. You can’t really consent to anything unless you’re told
clearly what you’re consenting to.
 Users must have clarity about what data they are providing, what is going to be done with
the data, and any downstream consequences of how their data is used.
 Consistency and trust: Trust requires consistency over time. You can’t trust someone who
is unpredictable.
 They may have the best intentions, but they may not honour those intentions when you need
them to. Or they may interpret their intentions in a strange and unpredictable way. And once
broken, rebuilding trust may take a long time.
 Restoring trust requires a prolonged period of consistent behaviour.
 Consistency, and therefore trust, can be broken either explicitly or implicitly.
 An organization that exposes user data can do so intentionally or unintentionally.
 In the past years, we’ve seen many security incidents in which customer data was stolen:
Yahoo!, Target, Anthem, local hospitals, government data, and data brokers like Experian,
the list grows longer each day. Failing to safeguard customer data breaks trust and
safeguarding data means nothing if not consistency over time.
 Control: Once you have given your data to a service, you must be able to understand what
is happening to your data.
 Can you control how the service uses your data? For example, Facebook asks for (but
doesn’t require) your political views, religious views, and gender preference.
 What happens if you change your mind about the data you’ve provided? If you decide you’re
rather keep your political affiliation quiet, do you know whether Facebook actually deletes
that information? Do you know whether Facebook continues to use that information in ad
placement?
 All too often, users have no effective control over how their data is used.
 They are given all-or-nothing choices, or a convoluted set of options that make controlling
access overwhelming and confusing.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

 Data products are designed to add value for a particular user or system.
 As these products increase in sophistication, and have broader societal implications, it is
essential to ask whether the data that is being collected could cause harm to an individual or
a group.
 Consequences: We continue to hear about unforeseen consequences and the “unknown
unknowns” about using data and combining data sets.
 Risks can never be eliminated completely.
 However, many unforeseen consequences and unknown unknowns could be foreseen and
known, if only people had tried.
 All too often, unknown unknowns are unknown because we don’t want to know.

Implementing the 5 Cs

 Responsibility for the 5 Cs can’t be limited to the designers.


 It’s the responsibility of the entire team.
 The data scientists need to approach the problem asking “what if” scenarios that get to all
of the five C’s.
 The same is true for the product managers, business leaders, sales, marketing, and also
executives.
 The five C’s need to be part of every organization’s culture.
 Product and design reviews should go over the five Cs regularly.
 They should consider developing a checklist before releasing a project to the public. All too
often, we think of data products as minimal viable products.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

What is ethics and security?

 It's the knowledge of right and wrong, and the ability to adhere to ethical principles while
on the job.
 Simply put, actions that are technically compliant may not be in the best interest of the
customer or the company, and security professionals need to be able to judge these matters
accordingly.

What defines ethics in information security?

 Ethics can be defined as a moral code by which a person lives.


 For corporations, ethics can also include the framework you develop for what is or isn’t
acceptable behaviour within your organization.
 In computer security, cyber-ethics is what separates security personnel from the hackers.
It’s the knowledge of right and wrong, and the ability to adhere to ethical principles while
on the job.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

 Simply put, actions that are technically compliant may not be in the best interest of the
customer or the company, and security professionals need to be able to judge these matters
accordingly.

Why is ethics significant to information security?

 The data targeted in cyber-attacks is often personal and sensitive.


 Loss of that sensitive data can be potentially devastating for your customers, and it’s crucial
that you have the full trust of the individuals you’ve hired to protect it.
 Cybersecurity professionals have access to the sensitive personal data they were hired to
protect.
 So it’s imperative that employees in these fields have a strong sense of ethics and respect
for the privacy of your customers.
 The field of information technology also expands and shifts so frequently that a strong
ethical core is necessary to navigate it.
 It’s important that your staff can determine what’s in the best interest of your customers and
the company as a whole.
 Specific scenarios that your employees might confront can sometimes be impossible to
foresee, so a strong ethical core can be the foundation that lets employees act in those best
interests even in difficult, unpredictable circumstances.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

Building Ethics into a Data-Driven Culture

What is a data-driven culture?

 A data-driven culture is one where the workforce uses analytics and statistics to optimize
their processes and accomplish their tasks.
 Team members and company leaders, collect information to learn insights into the impact
of their decisions before implementing new policies or making significant changes in the
workplace.
 People within a data-driven workplace value the insights they can learn from different types
of company analytics, such as data about finances or productivity.
 Having a data-driven environment also involves making information easily accessible
through systems such as databases or reporting software.

What goes into a data-driven culture?

There are several components that contribute to a data-driven culture:

Data maturity

 Data maturity is how the information you store and retrieve improves over time.
 This means having data with important metadata, limited duplicates and accurate
information.
 To have data maturity, it requires a company to have governance over the processes and
maintenance of its information.
 This can then provide valuable guidance for team members executing their tasks based on
the information they have.

Data leadership

 Data leadership means managers and leadership ensure accurate storage and maintenance of
data.
 They understand the importance of this information to help teams make the right decision.
 They also lead this type of culture by making decisions based on the information they have,
showing how this can be most effective.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

Data literacy

 Data literacy means the information a company stores is accessible, readable and usable to
all people.
 This often means storing data in a structured way.
 An important part of this can also be training employees on how to understand and use data
so that they can make decisions and evaluate information effectively.

Ethics into a Data-Driven Culture

 The main aim is to empower all employees to actively use data to enhance their daily work
and to reach their potential by making decisions easier, customer conversations more useful
and to be more strategic.

 To become a data-driven organization, start by taking these steps:

1. Start at the top

 Data-driven culture starts at the top.


 Companies with strong data-driven cultures tend have managers who set an expectation that
decisions must be anchored in data all the time.
 They lead through example and board reports refer to data-driven decisions made throughout
the organization.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

2. Improve data transparency and organization

 A common struggle, for companies of all sizes, is keeping data organized.


 New solutions are often implemented without consideration of whether they’re compatible
with existing software.
 One functional area may prefer the features of one kind of software, while another team opts
for a different legacy product.
 The lack of organization leads to confusion and subsequently, poor data transparency.
 Data-driven organizations benefit from more precise methods, which improve data
transparency.

3. Standardize processes

 Your team may be comfortable conducting business one way; another team may prefer
different methods.
 While each team’s processes may work fine individually, enough differences exist to cause
hiccups when forced to merge.
 Data-driven organizations rely on standardized processes, which let data flow with routine
and predictability.

4. Measure with variety

 Rather than measuring the same things each year, it pays to mix it up.
 Organizations that are data-driven are in an advantageous position – having data at their
fingertips - because they can identify a mix of measurement criteria.
 The variety of that approach offers greater insight into data and a richer set of predictive
tools.

5. Invest in analytics tools

 Data-driven organizations appreciate the value of a business intelligence (BI) solution that
delivers analytics functionality.
 Cloud BI tools like Phocas provide real-time value while being scalable, secure, and
available on-demand.
 Mobile functionality adds an additional layer of opportunity, allowing users in data-driven
cultures to access information from anywhere, allowing remote productivity.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

Benefits of creating a data-driven culture

There are many advantages to promoting data analysis on your team:

Encouraging teamwork

 When you work in an environment that values evidence-based projects, you can improve
collaboration between team members in an organization.
 People rely on the IT department to share and manage data for their projects.
 They produce reports to share with other teams and develop models and projections that
anyone in the company can use.
 Anyone who conducts research with company data can collaborate with others throughout
the process and share the outcomes with their team and other departments.

Remaining Competitive

 Using research and quantitative data as a tool allows you to remain competitive with other
companies as your industry develops.
 Applying data-driven insights helps you adapt and determine which trends are most relevant
to your particular environment.
 By consistently monitoring changes in data, you align with the needs of modern customers
and can incorporate useful innovations into your workflows.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

What Is Competitive Strategy?

Refining strategies

 When a team has a data-driven mindset, it's easier for them to identify when a strategy works
well and when it needs improvement.
 By analyzing data and truly valuing the insights you learn, you can constantly adjust the
methods you use to complete your tasks.
 You can apply this mindset to everything you do in the workplace, allowing you to optimize
efficiency in all aspects of your individual work and organizational operations.

Business Strategy vs. Corporate Strategy: What's the Difference?

Identifying cause and effect

 Having an extensive record of company data gives you the ability to identify trends and
patterns to determine potential causes of successes and failures.
 Using data to recognize cause and effect can inform your long-term choices.
 For example, you can review financial data for the year and compare it to your product
releases to determine if offering a new product increased the popularity of your brand.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

Improving outcomes

 When you use facts to make decisions, you can improve the overall outcomes of your
choices.
 From sales numbers to productivity to customer satisfaction, you can increase your metrics
in the workplace.

What will be the future of data science?

 The Data Science future is studded with career opportunities.


 Future of Data Science 2030 is estimated to bring opportunities in various areas of banking,
finance, insurance, entertainment, telecommunication, automobile, etc.
 A data scientist will help grow an organization by assisting them in making better decisions.

There are three types of Data Science careers:

Data Analyst - A data analyst collects data from a database. They also summaries result after data
processing.

Data Scientists - They manage, mine, and clean the data. They are also responsible for building
models to interpret big data and analyze the results.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

Data Engineer- This person mines data to get insights from it. He is also responsible for
maintaining data design and architecture. He also develops large warehouses with the help of extra
transform load.

Other Options:
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science

Scope of Data Science in India

 The field of Data Science is one of the fastest growing in India.


 In recent years, there has been a surge in the amount of data available, and businesses are
increasingly looking for ways to make use of this data. As a result, data scientists are in high
demand.
 Data Science is a relatively new field, covering a wide range of topics, from machine
learning and artificial intelligence to statistics and cloud computing.
 Data Science is a relatively new field in India, so there is still a lot of excitement and interest
surrounding it.
 The potential applications of data science are vast, and Indian businesses are just beginning
to scratch the surface of what is possible.
 Many Indian companies are investing heavily in Data Science as they realize the competitive
advantage that it can provide.
 The Indian government also supports Data Science careers in India, investing in
infrastructure and initiatives to promote adopting data-driven practices.
 The talent pool of data scientists in India is rapidly growing as more people see data science's
future scope in India.
 There are already many success stories of Data Science applications in India, and this will
only likely continue in the future.

You might also like