ITDS Unit 1 - Merged
ITDS Unit 1 - Merged
Type of Data
The data is classified into majorly four categories:
Nominal data
Ordinal data
Discrete data
Continuous data
Nominal Data
Nominal data is one of the types of qualitative information which helps to label the variables without
providing the numerical value. Nominal data is also called the nominal scale. It cannot be ordered
and measured. But sometimes, the data can be qualitative and quantitative. Examples of nominal
data are letters, symbols, words, gender etc.
The nominal data are examined using the grouping method. In this method, the data are grouped
into categories, and then the frequency or the percentage of the data can be calculated. These data
are visually represented using the pie charts.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
Ordinal Data
Ordinal data/variable is a type of data that follows a natural order. The significant feature of the
nominal data is that the difference between the data values is not determined. This variable is mostly
found in surveys, finance, economics, questionnaires, and so on.
The ordinal data is commonly represented using a bar chart. These data are investigated and
interpreted through many visualisation tools. The information may be expressed using tables in
which each row in the table shows the distinct category.
Discrete Data
Discrete data can take only discrete values. Discrete information contains only a finite number of
possible values. Those values cannot be subdivided meaningfully. Here, things can be counted in
whole numbers.
Example: Number of students in the class
Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable values that can
be selected within a given specific range.
Example: Temperature range
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
With the passage of time and the evolution of technologies, civilizations, and culture, the
methodologies used to capture, store, process, and use facts have evolved. Similarly, data (a
representation of facts) and data management have had their own evolution cycles and they continue
to evolve.
Until the advent of computers, limited facts were documented, given the expense and scarcity of
resources and effort to store and maintain them. In ancient times, it was not uncommon for
knowledge to be transferred from one generation to another by the process of oral learning. The
oral tradition is a contrast to the current digital age, which has elaborate document and content
management systems that store knowledge in the form of documents and records.
The data which is collected is known as raw data which is not useful now but on cleaning the impure
and utilizing that data for further analysis forms information, the information obtained is known as
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
“knowledge”. Knowledge has many meanings like business knowledge or sales of enterprise
products, disease treatment, etc. The main goal of data collection is to collect information-rich data.
Data collection starts with asking some questions such as what type of data is to be collected and
what is the source of collection. Most of the data collected are of two types known as “qualitative
data“ which is a group of non-numerical data such as words, sentences mostly focus on behavior
and actions of the group and another one is “quantitative data” which is in numerical forms and can
be calculated using different scientific tools and sampling data.
The actual data is then further divided mainly into two types known as:
Primary data
The data which is Raw, original, and extracted directly from the official sources is known as primary
data. This type of data is collected directly by performing techniques such as questionnaires,
interviews, and surveys. The data collected must be according to the demand and requirements of
the target audience on which analysis is performed otherwise it would be a burden in the data
processing.
1. Interview method:
2. Survey method:
3. Observation method:
4. Experimental method:
Secondary data:
Secondary data is the data which has already been collected and reused again for some valid
purpose. This type of data is previously recorded from primary data and it has two types of sources
named internal source and external source.
Internal source:
External source:
Other sources:
Sensors data
Satellites data
Web traffic
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
1. Gather data
The data preparation process begins with finding the right data. This can come from an existing data
catalog or data sources can be added ad-hoc.
2. Discover and assess data
After collecting the data, it is important to discover each dataset. This step is about getting to know
the data and understanding what has to be done before the data becomes useful in a particular
context.
3. Cleanse and validate data
Cleaning up the data is traditionally the most time-consuming part of the data preparation process,
but it’s crucial for removing faulty data and filling in gaps. Important tasks here include:
Data scientists usually have a good sense of data and analytics, but data science projects are much
more than that. A data science project may involve people with different roles, especially in a large
company:
• The business owner or leader who identifies business problem and value;
• The data owner and computation resource/infrastructure owner from the IT department;
• A dedicated policy owner to make sure the data and model are under model governance, security
and privacy guidelines and laws;
• A dedicated engineering team to implement, maintain and refresh the model;
The entire team usually will have multiple rounds of discussion of resource allocation among groups
at the beginning of the project and during the project.
Effective communication and in-depth domain knowledge about the business problem are essential
requirements for a successful data scientist. A data scientist may interact with people at various
levels, from senior leaders who set the corporate strategies to front-line employees who do the daily
work.
In the past few years, by utilizing commodity hardware and open-source software, people created a
big data ecosystem for data storage, data retrieval, and parallel computation. Hadoop and Spark
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
have become a popular platform that enables data scientists, statisticians, and analysts to access the
data and to build models. Programming skills in the big data platform have been an obstacle for a
traditional statistician or analyst to become a successful data scientist.
cloud computing reduces the difficulty significantly. The user interface of the data platform is much
more friendly today, and people push much of the technical details to the background. Today’s
cloud systems also enable quick implementation of the production environment. Now data science
emphasizes more on the data itself, models and algorithms on top of the data, rather than the
platform, infrastructure and low-level programming such as Java.
The data and codes stored in the hard disk have specific features such as slow to read and write, and
large capacity of around a few TB in today’s market. Memory is fast to read and write but with
small capacity in the order of a few dozens of GB in today’s market.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
Hadoop
The very first problem internet companies face is that a lot of data has been collected and how to
better store these data for future analysis. Google developed its own file system to provide efficient,
reliable access to data using large clusters of commodity hardware.
The open-source version is known as Hadoop Distributed File System (HDFS). Both systems use
Map-Reduce to allocate computation across computation nodes on top of the file system. Hadoop
is written in Java and writing map-reduce job using Java is a direct way to interact with Hadoop
which is not familiar to many in the data and analytics community.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
Spark
Spark works on top of a distributed file system including HDFS with better data and analytics
efficiency by leveraging in-memory operations. Spark is more tailored for data processing and
analytics and the need to interact with Hadoop directly is greatly reduced.
The spark system includes an SQL-like framework called Spark SQL and a parallel machine
learning library called MLlib Fortunately for many in the analytics community, Spark also supports
R and Python. We can interact with data stored in a distributed file system using parallel computing
across nodes easily with R and Python through the Spark API and do not need to worry about lower-
level details of distributed computing. We will introduce how to use an R notebook to drive Spark
computations.
There are many cloud computing environments such as Amazon’s AWS, Google cloud and
Microsoft Azure which provide a complete list of functions for heavy-duty enterprise applications.
For example, Netflix runs its business entirely on AWS without owning any data centers. For
beginners, however, Databricks provides an easy to use cloud system for learning purposes.
Databricks is a company founded by the creators of Apache Spark and it provides a userfriendly
web-based notebook environment that can create a Spark cluster on the fly to run / Python/ Scala/
SQL scripts.
as data grow bigger, database knowledge becomes essential and required for statisticians, analysts
and data scientists in an enterprise environment where data are stored in some form of database
systems.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
Databases often contain a collection of tables and the relationship among these tables (i.e. schema).
The table is the fundamental structure for databases that contain rows and columns similar to data
frames in R or Python. Database management systems (DBMS) ensure data integration and security
in real time operations.
There are many different DBMS such as Oracle, SQL Server, MySQL, Teradata, Hive, Redshift
and Hana. The majority of database operations are very similar among different DBMS, and
Structured Query Language (SQL) is the standard language to use these systems.
show database: show current databases in the system create database db_name: create a new
database with name db_name drop database db_name: delete database db_name (be careful when
using it!)
use db_name: set up the current database to be used
show tables: show all the tables within the currently used database
describe tbl_name: show the structure of table with name tbl_name (i.e. list of column name and
data type)
drop tbl_name: delete a table with name tbl_name (be careful when using it!)
select * from metrics limit 10: show the first 10 rows of a table
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
Data Pre-processing
Data preprocessing is the process of converting raw data into clean data that is proper for modeling.
A model fails for various reasons. One is that the modeller doesn’t correctly pre-process data before
modelling. Data pre-processing can significantly impact model results, such as imputing missing
value and handling with outliers. So data pre-processing is a very critical part.
In real life, depending on the stage of data cleanup, data has the following types:
1. Raw data
2. Technically correct data
3. Data that is proper for the model
4. Summarized data
5. Data with fixed format
The raw data is the first-hand data that analysts pull from the database, market survey responds from
your clients, the experimental results collected by the research and development department, and so
on.
Technically correct data is the data, after preliminary cleaning or format conversion, that R (or
another tool you use) can successfully import it.
we have loaded the data into R with reasonable column names, variable format and so on. That does
not mean the data is entirely correct. There may be some observations that do not make sense, such
as age is negative, the discount percentage is greater than 1, or data is missing. Depending on the
situation, there may be a variety of problems with the data. It is necessary to clean the data before
modeling.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
data analysts will take the results from data scientists and adjust the format, such as labels, cell
color, highlight. It is important for a data scientist to make sure the results look consistent which
makes the next step easier for data analysts.
It is highly recommended to store each step of the data and the R code, making the whole process
as repeatable as possible. The R markdown reproducible report will be extremely helpful for that.
If the data changes, it is easy to rerun the process.
A data wrangling process, also known as a data munging process, consists of reorganizing,
transforming and mapping data from one "raw" form into another in order to make it more usable
and valuable for a variety of downstream uses including analytics.
Data wrangling can be defined as the process of cleaning, organizing, and transforming raw data
into the desired format for analysts to use for prompt decision-making. Also known as data
cleaning or data munging, data wrangling enables businesses to tackle more complex data in less
time, produce more accurate results, and make better decisions. The exact methods vary from
project to project depending upon your data and the goal you are trying to achieve. More and
more organizations are increasingly relying on data wrangling tools to make data ready for
downstream analytics.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
Making raw data usable. Accurately wrangled data guarantees that quality data is entered into
the downstream analysis.
Getting all data from various sources into a centralized location so it can be used.
Automated data integration tools are used as data wrangling techniques that clean and convert
source data into a standard format that can be used repeatedly according to end requirements.
Cleansing the data from the noise or flawed, missing elements
Data wrangling acts as a preparation stage for the data mining process, which involves gathering
data and making sense of it.
Unstructured data is often text-heavy and contains things such as Dates, Numbers, ID codes, etc.
At this stage of the Data Wrangling process, the dataset needs to be parsed.
BCA SEM VI Unit 1 Elective 2: Introduction to Data Science
Digital data, in information theory and information systems, is information represented as a string
of discrete symbols, each of which can take on one of only a finite number of values from some
alphabet, such as letters or digits. An example is a text document, which consists of a string of
alphanumeric characters.
What is Data?
The quantities, characters, or symbols on which operations are performed by a computer, which
may be stored and transmitted in the form of electrical signals and recorded on magnetic, optical,
or mechanical recording media.
1. The New York Stock Exchange is an example of Big Data that generates about one terabyte
of new trade data per day.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
2. Social Media: The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly generated in terms
of photo and video uploads, message exchanges, putting comments etc.
3. A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed as a
‘structured’ data. Over the period of time, talent in computer science has achieved greater success
in developing techniques for working with such kind of data (where the format is well known in
advance) and also deriving value out of it. However, nowadays, we are foreseeing issues when a
size of such data grows to a huge extent, typical sizes are being in the rage of multiple zettabytes.
Unstructured
Any data with unknown form or the structure is classified as unstructured data. In addition to the
size being huge, un-structured data poses multiple challenges in terms of its processing for deriving
value out of it. A typical example of unstructured data is a heterogeneous data source containing a
combination of simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don’t know how to derive value out of it since this data
is in its raw form or unstructured format.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data as a
structured in form but it is actually not defined with e.g. a table definition in relational DBMS.
Example of semi-structured data is a data represented in an XML file.
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Volume
Variety
Velocity
Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size of data plays a
very crucial role in determining value out of data. Also, whether a particular data can actually be
considered as a Big Data or not, is dependent upon the volume of data. Hence, ‘Volume’ is one
characteristic which needs to be considered while dealing with Big Data solutions.
Variety refers to heterogeneous sources and the nature of data, both structured and unstructured.
During earlier days, spreadsheets and databases were the only sources of data considered by most
of the applications. Nowadays, data in the form of emails, photos, videos, monitoring devices,
PDFs, audio, etc. are also being considered in the analysis applications. This variety of unstructured
data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How fast the data is
generated and processed to meet the demands, determines real potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like business processes,
application logs, networks, and social media sites, sensors, Mobile devices, etc. The flow of data is
massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at times, thus
hampering the process of being able to handle and manage the data effectively.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
Big Data technologies can be used for creating a staging area or landing zone for new data before
identifying what data should be moved to the data warehouse. In addition, such integration of Big
Data technologies and data warehouse helps an organization to offload infrequently accessed data.
1. Data Redundancy: Data redundancy means, the same information is repeated in several
files.
2. Data Inconsistency: Data Inconsistency arises, when there is Data Redundancy. It means,
the various copies of the same data in different files is not get updated when changes are made once.
Thus the required information cannot get by an Application programs because there is no such
programs in the list of Application Programs or the fields of the file may vary at the time of
Application Design.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
3. Data Isolation: The data is scattered in various files with different formats. Therefore, it is
difficult to write a new application program and hence difficult to retrieve appropriate data from the
files.
4. Integrity Problems: The data values stored in a file must be satisfied with certain data
integrity constraints. The programmers need to provide integrity constraints and data must be
validated from time to time. It is having limitations in the file system.
5. Concurrent Access: The system requires to allow multiple users to access and update the
data simultaneously, instead of a Single user system. The interaction with concurrency may result
inconsistency.
6. Security Problems: The system should not give access to the unauthorized users to operate
as the data is important and sensitive data. It should allow only some of the users who have given
privileges to access and manipulate data.
Data Science is the area of study which involves extracting insights from vast amounts of
data using various scientific methods, algorithms, and processes.
It helps you to discover hidden patterns from the raw data.
The term Data Science has emerged because of the evolution of mathematical statistics, data
analysis, and big data.
Data Science is an interdisciplinary field that allows you to extract knowledge from
structured or unstructured data.
Data science enables you to translate a business problem into a research project and then
translate it back into a practical solution.
Data is the oil for today’s world. With the right tools, technologies, algorithms, we can use
data and convert it into a distinct business advantage
Data Science can help you to detect fraud using advanced machine learning algorithms
It helps you to prevent any significant monetary losses
Allows to build intelligence ability in machines
You can perform sentiment analysis to gauge customer brand loyalty
It enables you to take better and faster decisions
It helps you to recommend the right product to the right customer to enhance your business
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
1. Statistics:
Statistics is the most critical unit of Data Science basics, and it is the method or science of collecting
and analyzing numerical data in large quantities to get useful insights.
2. Visualization:
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
Visualization technique helps you access huge amounts of data in easy to understand and digestible
visuals.
3. Machine Learning:
Machine Learning explores the building and study of algorithms that learn to make predictions
about unforeseen/future data.
4. Deep Learning:
Deep Learning method is new machine learning research where the algorithm selects the analysis
model to follow.
1. Discovery:
Discovery step involves acquiring data from all the identified internal & external sources, which
helps you answer the business question.
2. Preparation:
Data can have many inconsistencies like missing values, blank columns, an incorrect data format,
which needs to be cleaned. You need to process, explore, and condition data before modelling. The
cleaner your data, the better are your predictions.
3. Model Planning:
In this stage, you need to determine the method and technique to draw the relation between input
variables. Planning for a model is performed by using different statistical formulas and visualization
tools. SQL analysis services, R, and SAS/access are some of the tools used for this purpose.
4. Model Building:
In this step, the actual model building process starts. Here, Data scientist distributes datasets for
training and testing. Techniques like association, classification, and clustering are applied to the
training data set. The model, once prepared, is tested against the “testing” dataset.
5. Operationalize:
You deliver the final baselined model with reports, code, and technical documents in this stage.
Model is deployed into a real-time production environment after thorough testing.
6. Communicate Results
In this stage, the key findings are communicated to all stakeholders. This helps you decide if the
project results are a success or a failure based on the inputs from the model.
Data Scientist
Data Engineer
Data Analyst
Statistician
Data Architect
Data Admin
Business Analyst
Data/Analytics Manager
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
1. Internet Search:
Google search uses Data science technology to search for a specific result within a fraction of a
second
2. Recommendation Systems:
Speech recognizes systems like Siri, Google Assistant, and Alexa run on the Data science technique.
Moreover, Facebook recognizes your friend when you upload a photo with them, with the help of
Data Science.
4. Gaming world:
EA Sports, Sony, Nintendo are using Data science technology. This enhances your gaming
experience. Games are now developed using Machine Learning techniques, and they can update
themselves when you move to higher levels.
PriceRunner, Junglee, Shopzilla work on the Data science mechanism. Here, data is fetched from
the relevant websites using APIs.
Summary
Data Science is the area of study that involves extracting insights from vast amounts of data
by using various scientific methods, algorithms, and processes.
Statistics, Visualization, Deep Learning, Machine Learning are important Data Science
concepts.
Data Science Process goes through Discovery, Data Preparation, Model Planning, Model
Building, Operationalize, Communicate Results.
Important Data Scientist job roles are: 1) Data Scientist 2) Data Engineer 3) Data Analyst
4) Statistician 5) Data Architect 6) Data Admin 7) Business Analyst 8) Data/Analytics
Manager.
R, SQL, Python, SaS are essential Data science tools.
The predictions of Business Intelligence is looking backwards, while for Data Science, it is
looking forward.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Regression
In regression problems, the output is a continuous variable. Some commonly used Regression
models are as follows:
a) Linear Regression
Linear regression is the simplest machine learning model in which we try to predict one
output variable using one or more input variables.
The representation of linear regression is a linear equation, which combines a set of input
values(x) and predicted output(y) for the set of those input values.
It is represented in the form of a line:
Y = bx+ c.
The main aim of the linear regression model is to find the best fit line that best fits the data
points.
Linear regression is extended to multiple linear regression (find a plane of best fit) and
polynomial regression (find the best fit curve).
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science
b) Decision Tree
Decision trees are the popular machine learning models that can be used for both regression
and classification problems.
A decision tree uses a tree-like structure of decisions along with their possible consequences
and outcomes.
In this, each internal node is used to represent a test on an attribute; each branch is used to
represent the outcome of the test.
The more nodes a decision tree has, the more accurate the result will be.
The advantage of decision trees is that they are intuitive and easy to implement, but they
lack accuracy.
Decision trees are widely used in operations research, specifically in decision analysis,
strategic planning, and mainly in machine learning.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science
c) Random Forest
Random Forest is the ensemble learning method, which consists of a large number of
decision trees.
Each decision tree in a random forest predicts an outcome, and the prediction with the
majority of votes is considered as the outcome.
A random forest model can be used for both regression and classification problems.
For the classification task, the outcome of the random forest is taken from the majority of
votes.
Whereas in the regression task, the outcome is taken from the mean or average of the
predictions generated by each tree.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science
d) Neural Networks
Neural networks are the subset of machine learning and are also known as artificial neural
networks.
Neural networks are made up of artificial neurons and designed in a way that resembles the
human brain structure and working.
Each artificial neuron connects with many other neurons in a neural network, and such
millions of connected neurons create a sophisticated cognitive structure.
Neural networks consist of a multilayer structure, containing one input layer, one or more
hidden layers, and one output layer.
As each neuron is connected with another neuron, it transfers data from one layer to the
other neuron of the next layers.
Finally, data reaches the last layer or output layer of the neural network and generates output.
Neural networks depend on training data to learn and improve their accuracy. However, a
perfectly trained & accurate neural network can cluster data quickly and become a powerful
machine learning and AI tool.
One of the best-known neural networks is Google's search algorithm.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science
Classification
Classification models are the second type of Supervised Learning techniques, which are
used to generate conclusions from observed values in the categorical form.
For example, the classification model can identify if the email is spam or not; a buyer will
purchase the product or not, etc.
Classification algorithms are used to predict two classes and categorize the output into
different groups.
In classification, a classifier model is designed that classifies the dataset into different
categories, and each category is assigned a label.
Binary classification: If the problem has only two possible classes, called a binary classifier.
For example, cat or dog, Yes or No.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science
Multi-class classification: If the problem has more than two possible classes, it is a multi-
class classifier.
a) Logistic Regression
Logistic Regression is used to solve the classification problems in machine learning.
They are similar to linear regression but used to predict the categorical variables.
It can predict the output in either Yes or No, 0 or 1, True or False, etc. However, rather than
giving the exact values, it provides the probabilistic values between 0 & 1.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science
c) Naïve Bayes
Naïve Bayes is another popular classification algorithm used in machine learning.
It is called so as it is based on Bayes theorem and follows the naive (independent)
assumption between the features which is given as:
Each naïve Bayes classifier assumes that the value of a specific variable is independent of
any other variable/feature. For example, if a fruit needs to be classified based on color,
shape, and taste.
So yellow, oval, and sweet will be recognized as mango. Here each feature is independent
of other features.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science
Unsupervised learning models are mainly used to perform three tasks, which are as follows:
Clustering
Clustering is an unsupervised learning technique that involves clustering or groping the data
points into different clusters based on similarities and differences.
The objects with the most similarities remain in the same group, and they have no or very
few similarities from other groups.
Clustering algorithms can be widely used in different tasks such as Image segmentation,
Statistical data analysis, Market segmentation, etc.
Some commonly used Clustering algorithms are K-means Clustering, hierarchal Clustering,
DBSCAN, etc.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science
Dimensionality Reduction
The number of features/variables present in a dataset is known as the dimensionality of the
dataset, and the technique used to reduce the dimensionality is known as the dimensionality
reduction technique.
Although more data provides more accurate results, it can also affect the performance of the
model/algorithm, such as overfitting issues. In such cases, dimensionality reduction
techniques are used.
"It is a process of converting the higher dimensions dataset into lesser dimensions dataset
ensuring that it provides similar information."
Different dimensionality reduction methods such as PCA (Principal Component Analysis),
Singular Value Decomposition, etc.
BCA SEM VI Unit 3 Elective 2: Introduction to Data Science
3. Reinforcement Learning
In reinforcement learning, the algorithm learns actions for a given set of states that lead to a
goal state.
It is a feedback-based learning model that takes feedback signals after each state or action
by interacting with the environment.
This feedback works as a reward (positive for each good action and negative for each bad
action), and the agent's goal is to maximize the positive rewards to improve their
performance.
The behaviour of the model in reinforcement learning is similar to human learning, as
humans learn things by experiences as feedback and interact with the environment.
Below are some popular algorithms that come under reinforcement learning:
Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting take any feedback.
correct output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output. hidden patterns in data.
Supervised learning can be used for Unsupervised learning can be used for
those cases where we know the input as those cases where we have only input
well as corresponding outputs. data and no corresponding output data.
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data
sets and summarize their main characteristics, often employing data visualization methods.
Exploratory data analysis (EDA) is an approach of analysing data sets to summarize their
main characteristics, often using statistical graphics and other data visualization methods.
A statistical model can be used or not, but primarily EDA is for seeing what the data can
tell us beyond the formal modelling and thereby contrasts traditional hypothesis testing.
Exploratory data analysis has been promoted by John Tukey since 1970 to encourage
statisticians to explore the data, and possibly formulate hypotheses that could lead to new
data collection and experiments.
EDA is different from initial data analysis (IDA)
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
Big data is a phenomenon that is characterized by the rapid expansion of raw data.
This data that is being collected and generated so quickly that it is inundating government
The challenge is related to how this volume of data is harnessed(tie together), and the
1. Volume. Big data solutions must manage and process larger amounts of data.
2. Velocity. Big data solutions must process more rapidly arriving data.
3. Variety. Big data solutions must deal with more kinds of data, both structured and
unstructured.
4. Veracity. Big data solutions must validate the correctness of the large amount of rapidly
arriving data.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
management practices due to the explosion of data and the rise of the Information Enterprise.
EIM enables businesses to secure their information across the diverse and complex landscapes
Enterprise Information Management software helps businesses attain 360 degree views of their
big data and analytics by streamlining organizational workflows, increasing the quality of
information and creating integrated user interfaces for end users within a single source
platform.
OpenText offers EIM software systems and services that let you build a cohesive information
management strategy that leverages existing assets, meets urgent needs and establishes a fast
We believe that good information strategy is good business strategy, and that an effective EIM
Information Confidence.
Master Data Management: enabling flexible, efficient and effective business processes with
Analytics: enabling big data related skills to leverage the full potential of data
Governance and organization: defining and setting up EIM related organizations including
Enterprise architecture: the big picture – ensuring that processes, data, technology and
Risk and compliance: safeguarding information/data security, privacy and compliance with
legal frameworks
Big data has brought game-changing shifts to the way data is acquired, analyzed, stored, and
used. Solutions can be more flexible, more scalable, and more cost-effective than ever before.
Instead of building one-off systems designed to address specific problems for specific business
units, companies can create a common platform leveraged in different ways by different parts of
the business and all kinds of data — structured and unstructured, internal and external can be
incorporated.
Data Usage:
Identifying Opportunities and Building Trust. Companies must create a culture that encourages
They need to focus on trust, too—not just building it with consumers but wielding it as a
competitive weapon.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
Businesses that use data in transparent and responsible ways will ultimately have more access
Technical platforms that are fast, scalable, and flexible enough to handle different types of
applications are critical. So, too, are the skill sets required to build and manage them.
In general, these new platforms will prove remarkably cost-effective, using commodity hardware
and leveraging cloud-based and open-source technologies. But their all-purpose nature means
It’s crucial, therefore, to link them back to those businesses and their goals, priorities, and
expertise.
Companies will also need to put the insights they gain from big data to use—embedding them in
Big data is creating opportunities that are often outside a company’s traditional business or
markets.
customers. Businesses must be able to identify the right relationships—and successfully maintain
them.
In a world where information moves fast, businesses that are quick to see, and pursue, the new
ways to work with data are the ones that will get ahead and stay ahead.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
In order to examine the truth (or lack thereof) in this line of thinking, we need to start with the
basics.
First, what is big data? There are actually many different forms of big data. But the most widely
understood form of big data is the form found in Hadoop, Cloudera, etc.
There are probably other ramifications and features, but these basic characteristics are a good
working description of what most people mean when they talk about a big data solution
We find that a big data solution is a technology and that data warehousing is an architecture.
A technology is just that – a means to store and manage large amounts of data.
A data warehouse is a way of organizing data so that there is corporate credibility and integrity.
When someone takes data from a data warehouse, that person knows that other people are using
Data integration, in effect is the acquisition of data from diverse source systems (like
operational applications for ERP, CRM, supply chain, where most enterprise data originates
and a host of external sources of data like social networks, external third party data sources,
etc.) through multiple transformations of the data to get it ready for loading into target systems
Heterogeneity is the norm for both data sources and targets, since there are various types of
All these have different data models, so the data must be transformed in the middle of the
Then there are the interfaces that connect these pieces, which are equally diverse and the data
doesn’t flow uninterrupted or in a straight line, so you need data staging areas.
Simply put, that’s a lot of complex and diverse activities that you must perform to organize
Eventually the data integration processes and approaches influence the data model development
as well.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
Data integration approaches can become highly complex especially when you are dealing with
big data types. Below is an attempt to outline the complexities of data integration processes.
Level 0: Simple point to point data integration with little or no transformation. This just means
Level 1: Simple data integration processes, transforming one schema to another, without
applying any data manipulation functions like “if,” “then,” “else,” etc.
Level 2: Simple data integration processes, transforming one schema to another, with
Level 3: Complex data integration patterns, transforming the subject data dealing with complex
schemas and semantic management involving both structured and unstructured data.
In this scenario there could be one or more data sources (data could be also at rest or in motion)
These design patterns (and there could be many more depending on the applications you are
trying to develop and the nature of data sources) need to be aligned with the right integration
We purposefully stayed away from discussing the granularity of data, state of data changes,
and governance processes around data: if you add those aspects to the data integration patterns
This way, non-structured data (such as articles, photos, social media data, videos, or content
within a blog post) can be stored in a single document that can be easily found but isn’t
It’s more intuitive, but note that storing data in bulk like this requires extra processing effort and
more storage than highly organized SQL data. That’s why Hadoop, an open-source computing
and data analysis platform capable of processing huge amounts of data in the cloud, is so popular
NoSQL databases offer another major advantage, particularly to app developers: ease of access.
NoSQL databases are often able to sidestep this problem through APIs, which allow developers
to execute queries without having to learn SQL or understand the underlying architecture of their
database system.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
Key-value model: the least complex NoSQL option, which stores data in a schema-less way that
consists of indexed keys and values. Examples: Cassandra, Azure, LevelDB, and Riak.
1. Column store or wide-column store: which stores data tables as columns rather than rows. It’s
more than just an inverted table sectioning out columns allows for excellent scalability and high
performance.
2. Document database: taking the key-value concept and adding more complexity, each document
in this type of database has its own data, and its own unique key, which is used to retrieve it.
It’s a great option for storing, retrieving and managing data that’s document-oriented but still
somewhat structured.
3. Graph database: have data that’s interconnected and best represented as a graph?
What is ACID?
Ask any data professional and they could probably explain the ACID (Atomicity, Consistency,
The concept has been around for decades and until recently was the primary benchmark that all
databases strive to achieve – without the ACID requirement in place within a given system,
• Atomicity: Either the task (or all tasks) within a transaction are performed or none of them are.
This is the all-or-none principle. If one element of a transaction fails the entire transaction fails.
• Consistency: The transaction must meet all protocols or rules defined by the system at all times.
The transaction does not violate those protocols and the database must remain in a consistent state
at the beginning and end of a transaction; there are never any half-completed transactions.
This is required for both performance and consistency of transactions within a database.
• Durability: Once the transaction is complete, it will persist as complete and cannot be undone;
it will survive system failure, power loss and other types of system breakdowns.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
There are of course many facets to those definitions and within the actual ACID requirement of
each particular database, but overall in the RDBMS world, ACID is overlord and without ACID
reliability is uncertain.
The SQL scalability issue was recognized by some Web 2.0 companies with huge
environments, growing data and big infrastructure needs, like Google, Amazon, or Facebook.
They presented their own solutions to the problem – technologies like BigTable, DynamoDB,
and Cassandra.
This interest for alternatives resulted in a number of NoSQL Database Management Systems
A number of existing indexing structures were reused and improved trying to enhance
The first solutions of NoSQL database types were developed by big companies to meet their
specific needs, like Google’s BigTable, maybe the first NoSQL system, and Amazon’s
DynamoDB.
The success of these proprietary systems generated a big interest and there appeared a number
of similar open-source and proprietary database systems, some of the most popular ones being
One important difference between NoSQL databases and common relational databases is the
This means that NoSQL databases do not have a fixed table structure like the ones found in
relational databases.
One important, underlying difference is that NoSQL databases have a simple and flexible
NoSQL databases may include column store, document store, key value store, graph store,
Some NoSQL database models also allow developers to store serialized objects into the
Open-source NoSQL databases don’t require expensive licensing fees and can run on low-
Also, when working with NoSQL databases, either open-source or proprietary, scalation is
This is because it’s done by horizontally scaling and distributing the load on all nodes, rather
than the usual vertical done with relational database systems, which is replacing the main host
Disadvantages
First, most NoSQL databases do not support reliability features that are natively supported by
relational databases.
This also means that NoSQL databases, which don’t support those features, trade consistency
In order to support reliability and consistency features, developers must implement their own
This might limit the number of applications that can rely on NoSQL databases for secure and
Other problem found in most NoSQL databases is incompatibility with SQL queries.
This means that a manual or proprietary querying language is needed, adding more time and
complexity.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
This table provides a quick feature comparison between NoSQL and relational databases:
It should be noted that the table shows a comparison on the database level, not the
These systems provide their own proprietary techniques to overcome some of the problems
and shortcomings in both systems, and in some cases, significantly improve performance and
reliability.
In the Key Value store type, it is used a hash table in which a unique key points to an specific
item.
Keys can be organized into logical groups, only requiring keys to be unique on their own group.
The following table contains an example of a key-value store, in which the key is the name of
the city, and the value is the address for Ulster University in that city.
The most famous NoSQL database that uses a key value store is Amazon’s DynamoDB.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
Document Store
Document stores are similar to key value stores in that they are schema-less and based on a
key-value model. Also both share many of the same advantages and disadvantages.
They lack consistency on the database level, which makes way for applications to provide more
In Document Stores, the values (documents) provide encoding for the data stored. Those
The most popular database application that relies on a Document Store is MongoDB.
Column Store
In a Column Store database, data is stored in columns, as contrast to being stored in rows as is
A Column Store is comprised of one or more Column Families that logically group specific
A key is used to identify and point to a number of columns, with a keyspace attribute that
Column Stores have fast read/write access to the information. Rows that correspond to a single
column are stored as a single disk entry. This means faster access during read/write operations.
The most popular databases that use the column store include Google’s BigTable, HBase, and
Cassandra.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
Graph Base
In a Graph Base model, a directed graph structure is used to represent the data. The graph is
Formally, a graph is a representation of a pack of objects, where some pairs of objects are
connected by links.
The interconnected objects are represented by mathematical abstractions, called vertices, and
the links that connect some pairs of vertices are called edges.
A set of vertices and the edges that connect them is what is called graph.
This illustrates the structure of a graph based database that uses edges and nodes.
These nodes are organized by some relationships with other nodes, which are represented by
edges between the nodes. Both the nodes and the relationships have some defined properties.
They allow developers to focus more on relations between objects rather than on the objects
themselves.
BCA SEM VI Unit 2 Elective 2: Introduction to Data Science
In this context, they indeed allow for a scalable and easy-to-use environment.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science
Data ethics is a branch of ethics that evaluates data practices collecting, generating,
analysing and disseminating data, both structured and unstructured that have the potential
to adversely impact people and society.
The way data scientists build models can have real implications for justice, health, and
opportunity in people's lives and we have an obligation to consider the ethics of our
discipline each and every day. When built correctly, algorithms can have massive power to
do good in the world.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science
The Scope of Ethics is wide which is mainly concerned with the principles or causes of
action as : - What obligation is common to all ? - What is good in all good acts? - The sense
of duty and responsibility. - Individual and Society. The entire question is laid under the
scope of ethics.
Data ownership refers to both the possession of and responsibility for information.
Ownership implies power as well as control.
The control of information includes not just the ability to access, create, modify, package,
derive benefit from, sell or remove data, but also the right to assign these access privileges
to others
According to Garner (1999), individuals having intellectual property have rights to control
intangible objects that are products of human intellect.
The range of these products encompasses the fields of art, industry, and science.
Research data is recognized as a form of intellectual property and subject to protection by
government.
According to Loshin (2002), data has intrinsic value as well as having added value as a
byproduct of information processing, “at the core, the degree of ownership (and by corollary,
the degree of responsibility) is driven by the value that each interested party derives from
the use of that information”.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science
Researchers should have a full understanding of various issues related to data ownership to
be able to make better decisions regarding data ownership.
These issues include paradigm of ownership, data hoarding, data ownership policies,
balance of obligations, and technology.
Each of these issues gives rise to a number of considerations that impact decisions
concerning data ownership
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science
To ensure that there is a mechanism to foster a dialog, the following guidelines have been suggested
for building data products:
1. Consent
2. Clarity
Consent doesn’t mean anything. unless the user has clarity on the terms & conditions of the
contract.
Usually contracts are a series of negotiations, but in all our online transactions it’s always a
binary condition.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science
The user either accepts the terms or rejects it. Developers of data products should not only
ensure that they take consent from the user. But the users should also have clarity on
Clarity is closely related to consent. You can’t really consent to anything unless you’re told
clearly what you’re consenting to.
Users must have clarity about what data they are providing, what is going to be done with
the data, and any downstream consequences of how their data is used.
Consistency and trust: Trust requires consistency over time. You can’t trust someone who
is unpredictable.
They may have the best intentions, but they may not honour those intentions when you need
them to. Or they may interpret their intentions in a strange and unpredictable way. And once
broken, rebuilding trust may take a long time.
Restoring trust requires a prolonged period of consistent behaviour.
Consistency, and therefore trust, can be broken either explicitly or implicitly.
An organization that exposes user data can do so intentionally or unintentionally.
In the past years, we’ve seen many security incidents in which customer data was stolen:
Yahoo!, Target, Anthem, local hospitals, government data, and data brokers like Experian,
the list grows longer each day. Failing to safeguard customer data breaks trust and
safeguarding data means nothing if not consistency over time.
Control: Once you have given your data to a service, you must be able to understand what
is happening to your data.
Can you control how the service uses your data? For example, Facebook asks for (but
doesn’t require) your political views, religious views, and gender preference.
What happens if you change your mind about the data you’ve provided? If you decide you’re
rather keep your political affiliation quiet, do you know whether Facebook actually deletes
that information? Do you know whether Facebook continues to use that information in ad
placement?
All too often, users have no effective control over how their data is used.
They are given all-or-nothing choices, or a convoluted set of options that make controlling
access overwhelming and confusing.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science
Data products are designed to add value for a particular user or system.
As these products increase in sophistication, and have broader societal implications, it is
essential to ask whether the data that is being collected could cause harm to an individual or
a group.
Consequences: We continue to hear about unforeseen consequences and the “unknown
unknowns” about using data and combining data sets.
Risks can never be eliminated completely.
However, many unforeseen consequences and unknown unknowns could be foreseen and
known, if only people had tried.
All too often, unknown unknowns are unknown because we don’t want to know.
Implementing the 5 Cs
It's the knowledge of right and wrong, and the ability to adhere to ethical principles while
on the job.
Simply put, actions that are technically compliant may not be in the best interest of the
customer or the company, and security professionals need to be able to judge these matters
accordingly.
Simply put, actions that are technically compliant may not be in the best interest of the
customer or the company, and security professionals need to be able to judge these matters
accordingly.
A data-driven culture is one where the workforce uses analytics and statistics to optimize
their processes and accomplish their tasks.
Team members and company leaders, collect information to learn insights into the impact
of their decisions before implementing new policies or making significant changes in the
workplace.
People within a data-driven workplace value the insights they can learn from different types
of company analytics, such as data about finances or productivity.
Having a data-driven environment also involves making information easily accessible
through systems such as databases or reporting software.
Data maturity
Data maturity is how the information you store and retrieve improves over time.
This means having data with important metadata, limited duplicates and accurate
information.
To have data maturity, it requires a company to have governance over the processes and
maintenance of its information.
This can then provide valuable guidance for team members executing their tasks based on
the information they have.
Data leadership
Data leadership means managers and leadership ensure accurate storage and maintenance of
data.
They understand the importance of this information to help teams make the right decision.
They also lead this type of culture by making decisions based on the information they have,
showing how this can be most effective.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science
Data literacy
Data literacy means the information a company stores is accessible, readable and usable to
all people.
This often means storing data in a structured way.
An important part of this can also be training employees on how to understand and use data
so that they can make decisions and evaluate information effectively.
The main aim is to empower all employees to actively use data to enhance their daily work
and to reach their potential by making decisions easier, customer conversations more useful
and to be more strategic.
3. Standardize processes
Your team may be comfortable conducting business one way; another team may prefer
different methods.
While each team’s processes may work fine individually, enough differences exist to cause
hiccups when forced to merge.
Data-driven organizations rely on standardized processes, which let data flow with routine
and predictability.
Rather than measuring the same things each year, it pays to mix it up.
Organizations that are data-driven are in an advantageous position – having data at their
fingertips - because they can identify a mix of measurement criteria.
The variety of that approach offers greater insight into data and a richer set of predictive
tools.
Data-driven organizations appreciate the value of a business intelligence (BI) solution that
delivers analytics functionality.
Cloud BI tools like Phocas provide real-time value while being scalable, secure, and
available on-demand.
Mobile functionality adds an additional layer of opportunity, allowing users in data-driven
cultures to access information from anywhere, allowing remote productivity.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science
Encouraging teamwork
When you work in an environment that values evidence-based projects, you can improve
collaboration between team members in an organization.
People rely on the IT department to share and manage data for their projects.
They produce reports to share with other teams and develop models and projections that
anyone in the company can use.
Anyone who conducts research with company data can collaborate with others throughout
the process and share the outcomes with their team and other departments.
Remaining Competitive
Using research and quantitative data as a tool allows you to remain competitive with other
companies as your industry develops.
Applying data-driven insights helps you adapt and determine which trends are most relevant
to your particular environment.
By consistently monitoring changes in data, you align with the needs of modern customers
and can incorporate useful innovations into your workflows.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science
Refining strategies
When a team has a data-driven mindset, it's easier for them to identify when a strategy works
well and when it needs improvement.
By analyzing data and truly valuing the insights you learn, you can constantly adjust the
methods you use to complete your tasks.
You can apply this mindset to everything you do in the workplace, allowing you to optimize
efficiency in all aspects of your individual work and organizational operations.
Having an extensive record of company data gives you the ability to identify trends and
patterns to determine potential causes of successes and failures.
Using data to recognize cause and effect can inform your long-term choices.
For example, you can review financial data for the year and compare it to your product
releases to determine if offering a new product increased the popularity of your brand.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science
Improving outcomes
When you use facts to make decisions, you can improve the overall outcomes of your
choices.
From sales numbers to productivity to customer satisfaction, you can increase your metrics
in the workplace.
Data Analyst - A data analyst collects data from a database. They also summaries result after data
processing.
Data Scientists - They manage, mine, and clean the data. They are also responsible for building
models to interpret big data and analyze the results.
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science
Data Engineer- This person mines data to get insights from it. He is also responsible for
maintaining data design and architecture. He also develops large warehouses with the help of extra
transform load.
Other Options:
BCA SEM VI Unit 5 Elective 2: Introduction to Data Science