Royal Education Society’s
College of Computer Science and Information Technology, Latur
Department of Computer Science
Academic Year (2021-22)
Choice Based Credit System (CBCS Pattern)
Class / Semester: BSC CS TY/ V Name of Paper: Data Science (BCS-503)
UNIT I
TITLE OF UNIT: Introduction to Data Science
1.1 Data Mining, classification, regression
1.2 Essential of algorithms and data structure
1.3 Data Visualization
1.4 Software Engineering trends and technique
1.1 Data Mining, classification, regression
Data science is a multidisciplinary field that uses statistics, scientific methods, artificial
intelligence (AI), data analysis and of course, data mining, to refine useful information from
massive volumes of data. Data scientists also work with programming languages including R
and Python.
Data Mining:
The process of extracting information to identify patterns, trends, and useful data that would
allow the business to take the data-driven decision from huge sets of data is called Data Mining.
In other words, we can say that Data Mining is the process of investigating hidden patterns of
information to various perspectives for categorization into useful data, which is collected and
assembled in particular areas such as data warehouses, efficient analysis, data mining algorithm,
helping decision making and other data requirement to eventually cost-cutting and generating
revenue.
Data mining is the act of automatically searching for large stores of information to find trends
and patterns that go beyond simple analysis procedures. Data mining utilizes complex
mathematical algorithms for data segments and evaluates the probability of future events. Data
Mining is also called Knowledge Discovery of Data (KDD).
Data Mining is a process used by organizations to extract specific data from huge databases to
solve business problems. It primarily turns raw data into useful information.
Classification: It is a data analysis task, i.e. the process of finding a model that describes and
distinguishes data classes and concepts. Classification is the problem of identifying to which of a
set of categories (subpopulations), a new observation belongs to, on the basis of a training set of
data containing observations and whose categories membership is known.
Example: Before starting any project, we need to check its feasibility. In this case, a classifier is
required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further
approve it. It is a two-step process such as:
1. Learning Step (Training Phase): Construction of Classification Model
Different Algorithms are used to build a classifier by making the model learn using the training
set available. The model has to be trained for the prediction of accurate results.
2. Classification Step: Model used to predict class labels and testing the constructed model on
test data and hence estimate the accuracy of the classification rules.
Training and Testing:
Suppose there is a person who is sitting under a fan and the fan starts falling on him, he should
get aside in order not to get hurt. So, this is his training part to move away. While Testing if the
person sees any heavy object coming towards him or falling on him and moves aside then the
system is tested positively and if the person does not move aside then the system is negatively
tested.
Same is the case with the data, it should be trained in order to get the accurate and best results.
There are certain data types associated with data mining that actually tells us the format of the
file (whether it is in text format or in numerical format).
Attributes – Represents different features of an object. Different types of attributes are:
1. Binary: Possesses only two values i.e. True or False
Example: Suppose there is a survey evaluating some products. We need to check whether
it’s useful or not. So, the Customer has to answer it in Yes or No.
Product usefulness: Yes / No
• Symmetric: Both values are equally important in all aspects
• Asymmetric: When both the values may not be important.
2. Nominal: When more than two outcomes are possible. It is in Alphabet form rather than
being in Integer form.
Example: One needs to choose some material but of different colors. So, the color might be
Yellow, Green, Black, Red.
Different Colors: Red, Green, Black, Yellow
• Ordinal: Values that must have some meaningful order.
Example: Suppose there are grade sheets of few students which might contain different
grades as per their performance such as A, B, C, D
Grades: A, B, C, D
• Continuous: May have an infinite number of values, it is in float type
Example: Measuring the weight of few Students in a sequence or orderly manner i.e. 50,
51, 52, 53
Weight: 50, 51, 52, 53
• Discrete: Finite number of values.
Example: Marks of a Student in a few subjects: 65, 70, 75, 80, 90
Marks: 65, 70, 75, 80, 90
Syntax:
• Mathematical Notation: Classification is based on building a function taking input feature
vector “X” and predicting its outcome “Y” (Qualitative response taking values in set C)
• Here Classifier (or model) is used which is a Supervised function, can be designed
manually based on expert’s knowledge. It has been constructed to predict class labels
(Example: Label – “Yes” or “No” for the approval of some event).
Regression:
The process of identifying the relationship and the effects of this relationship on the outcome of
future values of objects is defined as regression. Regression helps in identifying the behavior of a
variable when other variable(s) are changed in the process. Regression analysis is used for
prediction and forecasting applications.
In short, when the intention is to assign objects to different categories then we use classification
algorithms and when we want to predict future values then we use regression algorithms.
For example, regression would be used to predict a home's value based on its location, square feet,
price when last sold, the price of similar homes, and other factors.
Linear Regression
Linear regression is such a useful and established algorithm, that it is both a statistical model and
a machine learning model. Linear regression tries a draw a best fit line that is close to the data by
finding the slope and intercept.
Linear regression equation is,
Y=a+bx
In this equation:
• y is the output variable. It is also called the target variable in machine learning or the
dependent variable.
• x is the input variable. It is also referred to as the feature in machine learning or it is called
the independent variable.
• a is the constant
• b is the coefficient of independent variable
Here the weather is given in Fahrenheit and Electricity bill is in dollars...
Lets plot in graph and find the best fit line
Multiple linear regression
Multiple linear regression refers to a statistical technique that is used to predict the outcome of a
variable based on the value of two or more variables. It is sometimes known simply as multiple
regression, and it is an extension of linear regression. The variable that we want to predict is
known as the dependent variable, while the variables we use to predict the value of the dependent
variable are known as independent or explanatory variables.
Figure 1: Multiple linear regression model predictions for individual observations (Source)
• Multiple linear regression refers to a statistical technique that uses two or more independent
variables to predict the outcome of a dependent variable.
• The technique enables analysts to determine the variation of the model and the relative
contribution of each independent variable in the total variance.
• Multiple regression can take two forms, i.e., linear regression and non-linear regression.
Multiple Linear Regression Formula
Where:
• yi is the dependent or predicted variable
• β0 is the y-intercept, i.e., the value of y when both xi and x2 are 0.
• β1 and β2 are the regression coefficients representing the change in y relative to a one-unit
change in xi1 and xi2, respectively.
• βp is the slope coefficient for each independent variable
• ϵ is the model’s random error (residual) term.
Understanding Multiple Linear Regression
Simple linear regression enables statisticians to predict the value of one variable using the
available information about another variable. Linear regression attempts to establish the
relationship between the two variables along a straight line.
Multiple regression is a type of regression where the dependent variable shows
a linear relationship with two or more independent variables. It can also be non-linear, where the
dependent and independent variables do not follow a straight line.
Both linear and non-linear regression track a particular response using two or more variables
graphically. However, non-linear regression is usually difficult to execute since it is created from
assumptions derived from trial and error.
Assumptions of Multiple Linear Regression
Multiple linear regression is based on the following assumptions:
1. A linear relationship between the dependent and independent variables
The first assumption of multiple linear regression is that there is a linear relationship between the
dependent variable and each of the independent variables. The best way to check the linear
relationships is to create scatterplots and then visually inspect the scatterplots for linearity. If the
relationship displayed in the scatterplot is not linear, then the analyst will need to run a non-linear
regression or transform the data using statistical software, such as SPSS.
2. The independent variables are not highly correlated with each other
The data should not show multicollinearity, which occurs when the independent variables
(explanatory variables) are highly correlated. When independent variables show multicollinearity,
there will be problems figuring out the specific variable that contributes to the variance in the
dependent variable. The best method to test for the assumption is the Variance Inflation Factor
method.
3. The variance of the residuals is constant
Multiple linear regression assumes that the amount of error in the residuals is similar at each point
of the linear model. This scenario is known as homoscedasticity. When analyzing the data, the
analyst should plot the standardized residuals against the predicted values to determine if the
points are distributed fairly across all the values of independent variables. To test the assumption,
the data can be plotted on a scatterplot or by using statistical software to produce a scatterplot that
includes the entire model.
4. Independence of observation
The model assumes that the observations should be independent of one another. Simply put, the
model assumes that the values of residuals are independent. To test for this assumption, we use
the Durbin Watson statistic.
The test will show values from 0 to 4, where a value of 0 to 2 shows positive autocorrelation, and
values from 2 to 4 show negative autocorrelation. The mid-point, i.e., a value of 2, shows that
there is no autocorrelation.
5. Multivariate normality
Multivariate normality occurs when residuals are normally distributed. To test this assumption,
look at how the values of residuals are distributed. It can also be tested using two main methods,
i.e., a histogram with a superimposed normal curve or the Normal Probability Plot method.
Logistic regression:
Logistic regression is another powerful supervised ML algorithm used for
binary classification problems (when target is categorical). The best way to think about logistic
regression is that it is a linear regression but for classification problems. Logistic regression
essentially uses a logistic function defined below to model a binary output variable (Tolles &
Meurer, 2016). The primary difference between linear regression and logistic regression is that
logistic regression's range is bounded between 0 and 1. In addition, as opposed to linear
regression, logistic regression does not require a linear relationship between inputs and output
variables. This is due to applying a nonlinear log transformation to the odds ratio (will be defined
shortly).
(5.6)Logisticfunction=11+e−x
In the logistic function equation, x is the input variable. Let's feed in values −20 to 20 into the
logistic function. As illustrated in Fig. 5.17, the inputs have been transferred to between 0 and 1.
Logistic regression is a statistical model that in its basic form uses a logistic
function to model a binary dependent variable, although many more complex extensions exist.
In regression analysis, logistic regression[1] (or logit regression) is estimating the parameters of a
logistic model (a form of binary regression). Mathematically, a binary logistic model has a
dependent variable with two possible values, such as pass/fail which is represented by an indicator
variable, where the two values are labeled "0" and "1".
Consider a scenario where we need to classify whether an email is spam or not. If we use linear
regression for this problem, there is a need for setting up a threshold based on which classification
can be done. Say if the actual class is malignant, predicted continuous value 0.4 and the threshold
value is 0.5, the data point will be classified as not malignant which can lead to serious
consequence in real time.
From this example, it can be inferred that linear regression is not suitable for classification
problem. Linear regression is unbounded, and this brings logistic regression into picture. Their
value strictly ranges from 0 to 1.
Sigmoid Function
1.2 Essential of algorithms and data structure
As data scientists, we use statistical principles to write code such that we can effectively explore
the problem at hand. This necessitates at least a basic understanding of data structures, algorithms,
and time-space complexity so that we can program more efficiently and understand the tools that
we use. With larger datasets, this becomes particularly important. The way that we write our code
influences the speed at which our data is analysed and conclusions can be reached accordingly
Data Structures and Algorithms can be used to determine how a problem is represented internally
or how the actual storage pattern works & what is happening under the hood for a problem
Essential of algorithms:
In data science, computer science and statistics converge. As data scientists, we use statistical
principles to write code such that we can effectively explore the problem at hand.
This necessitates at least a basic understanding of data structures, algorithms, and time-space
complexity so that we can program more efficiently and understand the tools that we use. With
larger datasets, this becomes particularly important. The way that we write our code influences the
speed at which our data is analyzed and conclusions can be reached accordingly.
Algorithms are everywhere in programming, and with good reason. They provide a set of
instructions for solving all kinds of software problems, making life easier for developers. There are
thousands of programming algorithms out there today, so good software developers and engineers
need to know the different ones that are available and when it is most appropriate to use them. A
good algorithm will find the most efficient way to perform a function or solve a problem, both in
terms of speed and by minimizing the use of computer memory.
Many problems can be resolved by starting with some of the most popular algorithms according to
the function required. For example, sorting algorithms are instructions for arranging the items of an
array or list into a particular order, while searching algorithms are used to find and retrieve an
element from wherever it is stored in a data structure. Here are 11 algorithms that we think every
programmer should know about:
Here are some principles that are important to understand before discussing some of the common
algorithms.
Sorting Algorithms:
Sorting raw data sets is a simple but crucial step in computer/data science, and increasingly
important in the age of big data. Sorting typically involves finding numerical or alphabetical
orders (ascending or descending).
Searching Algorithms
Searching is such a basic IT function, but so important to get right when programming. It could
involve searching for something within an internal database or trawling virtual spaces for a specific
piece of information, and there are two standard approaches used today.
Recursion: Recursion is when a function calls itself. Perhaps the quintessential example of
recursion is in implementation of a factorial function:
def factorial(n):
if n < 1: #base case
return 1
else: #recursive case
return n * factorial(n-1)
The function is called within the function itself and will continue calling itself until the base case
(in this case, when n is 1) is reached.
Divide and Conquer (D&C): A recursive approach for problem-solving, D&C (1) determines
the simplest case for the problem (AKA the base case) and (2) reduces the problem until it is
now the base case. That is, a complex problem is broken down into simpler sub-problems. These
sub-problems are solved and their solutions are then combined to solve the original, larger
problem
Essential of data Structures:
In computer science, a data structure is a data organization, management, and storage format
that enables efficient access and modification. More precisely, a data structure is a collection of
data values, the relationships among them, and the functions or operations that can be applied to
the data, i.e., it is an algebraic structure about data.
Data structures are generally based on the ability of a computer to fetch and store data at any place
in its memory, specified by a pointer—a bit string, representing a memory address, that can be
itself stored in memory and manipulated by the program. Thus, the array and record data
structures are based on computing the addresses of data items with arithmetic operations, while
the linked data structures are based on storing addresses of data items within the structure itself.
The implementation of a data structure usually requires writing a set of procedures that create and
manipulate instances of that structure. The efficiency of a data structure cannot be analyzed
separately from those operations. This observation motivates the theoretical concept of an abstract
data type, a data structure that is defined indirectly by the operations that may be performed on it,
and the mathematical properties of those operations (including their space and time cost).
1.3 Data Visualization
Data visualization is the practice of translating information into a visual context, such as a map or
graph, to make data easier for the human brain to understand and pull insights from. The main
goal of data visualization is to make it easier to identify patterns, trends and outliers in large data
sets. The term is often used interchangeably with others, including information graphics,
information visualization and statistical graphics.
Data visualization is one of the steps of the data science process, which states that after data has
been collected, processed and modeled, it must be visualized for conclusions to be made. Data
visualization is also an element of the broader data presentation architecture (DPA) discipline,
which aims to identify, locate, manipulate, format and deliver data in the most efficient way
possible.
Data visualization is important for almost every career. It can be used by teachers to display
student test results, by computer scientists exploring advancements in artificial intelligence (AI) or
by executives looking to share information with stakeholders. It also plays an important role in big
data projects. As businesses accumulated massive collections of data during the early years of the
big data trend, they needed a way to quickly and easily get an overview of their data.
Visualization tools were a natural fit.
Visualization is central to advanced analytics for similar reasons. When a data scientist is writing
advanced predictive analytics or machine learning (ML) algorithms, it becomes important to
visualize the outputs to monitor results and ensure that models are performing as intended. This is
because visualizations of complex algorithms are generally easier to interpret than numerical
outputs.
A timeline depicting the history of data visualization
Why is data visualization important?
Data visualization provides a quick and effective way to communicate information in a universal
manner using visual information. The practice can also help businesses identify which factors
affect customer behavior; pinpoint areas that need to be improved or need more attention; make
data more memorable for stakeholders; understand when and where to place specific products;
and predict sales volumes.
Other benefits of data visualization include the following:
• the ability to absorb information quickly, improve insights and make faster decisions;
• an increased understanding of the next steps that must be taken to improve the organization;
• an improved ability to maintain the audience's interest with information they can understand;
• an easy distribution of information that increases the opportunity to share insights with
everyone involved;
• eliminate the need for data scientists since data is more accessible and understandable; and
• an increased ability to act on findings quickly and, therefore, achieve success with greater
speed and less mistakes.
Data visualization and big data
The increased popularity of big data and data analysis projects have made visualization more
important than ever. Companies are increasingly using machine learning to gather massive
amounts of data that can be difficult and slow to sort through, comprehend and explain.
Visualization offers a means to speed this up and present information to business owners and
stakeholders in ways they can understand.
Big data visualization often goes beyond the typical techniques used in normal visualization, such
as pie charts, histograms and corporate graphs. It instead uses more complex representations, such
as heat maps and fever charts. Big data visualization requires powerful computer systems to
collect raw data, process it and turn it into graphical representations that humans can use to
quickly draw insights.
While big data visualization can be beneficial, it can pose several disadvantages to organizations.
They are as follows:
• To get the most out of big data visualization tools, a visualization specialist must be hired.
This specialist must be able to identify the best data sets and visualization styles to guarantee
organizations are optimizing the use of their data.
• Big data visualization projects often require involvement from IT, as well as management,
since the visualization of big data requires powerful computer hardware, efficient storage
systems and even a move to the cloud.
• The insights provided by big data visualization will only be as accurate as the information
being visualized. Therefore, it is essential to have people and processes in place to govern and
control the quality of corporate data, metadata and data sources.
Examples of data visualization
In the early days of visualization, the most common visualization technique was using a Microsoft
Excel spreadsheet to transform the information into a table, bar graph or pie chart. While these
visualization methods are still commonly used, more intricate techniques are now available,
including the following:
• infographics
• bubble clouds
• bullet graphs
• heat maps
• fever charts
• time series charts
Some other popular techniques are as follows.
Line charts. This is one of the most basic and common techniques used. Line charts display how
variables can change over time.
Area charts. This visualization method is a variation of a line chart; it displays multiple values in
a time series -- or a sequence of data collected at consecutive, equally spaced points in time.
Scatter plots. This technique displays the relationship between two variables. A scatter plot takes
the form of an x- and y-axis with dots to represent data points.
Treemaps. This method shows hierarchical data in a nested format. The size of the rectangles
used for each category is proportional to its percentage of the whole. Treemaps are best used when
multiple categories are present, and the goal is to compare different parts of a whole.
Population pyramids. This technique uses a stacked bar graph to display the complex social
narrative of a population. It is best used when trying to display the distribution of a population.
Common data visualization use cases
Common use cases for data visualization include the following:
Sales and marketing. Research from the media agency Magna predicts that half of all global
advertising dollars will be spent online by 2020. As a result, marketing teams must pay close
attention to their sources of web traffic and how their web properties generate revenue. Data
visualization makes it easy to see traffic trends over time as a result of marketing efforts.
Politics. A common use of data visualization in politics is a geographic map that displays the
party each state or district voted for.
Healthcare. Healthcare professionals frequently use choropleth maps to visualize important
health data. A choropleth map displays divided geographical areas or regions that are assigned a
certain color in relation to a numeric variable. Choropleth maps allow professionals to see how a
variable, such as the mortality rate of heart disease, changes across specific territories.
Scientists. Scientific visualization, sometimes referred to in shorthand as SciVis, allows scientists
and researchers to gain greater insight from their experimental data than ever before.
Finance. Finance professionals must track the performance of their investment decisions when
choosing to buy or sell an asset. Candlestick charts are used as trading tools and help finance
professionals analyze price movements over time, displaying important information, such as
securities, derivatives, currencies, stocks, bonds and commodities. By analyzing how the price has
changed over time, data analysts and finance professionals can detect trends.
Logistics. Shipping companies can use visualization tools to determine the best global shipping
routes.
Data scientists and researchers. Visualizations built by data scientists are typically for the
scientist's own use, or for presenting the information to a select audience. The visual
representations are built using visualization libraries of the chosen programming languages and
tools. Data scientists and researchers frequently use open source programming languages -- such
as Python -- or proprietary tools designed for complex data analysis. The data visualization
performed by these data scientists and researchers helps them understand data sets and identify
patterns and trends that would have otherwise gone unnoticed.
1.4 Software Engineering trends and technique
IEEE, in its standard 610.12-1990, defines software engineering as —
Software Engineering is the application of a systematic, disciplined, which is a computable
approach for the development, operation, and maintenance of software.
In simple words, Software Engineering is the process of analyzing user requirements and then
designing, building, and testing software application which will satisfy those requirements.
In terms of software engineering, there not some rigid process but an approach to develop
software. This process is divided mainly into five tasks– Communication, Planning, Modelling,
Construction, and Deployment
Communication: It is mainly about communicating with your customers to understand their
requirements.
Task 1. Communication:
As stated above this is mainly about getting customer requirements. Now requirements over
here may be of customers, supervisor, etc. But say you have a dataset on which you want to do a
data science project on so where are the requirements.
Requirements are ones that you start the project with like—What do you gain from doing this
project? How applicable this project is in the real world? etc.
Task 2. Planning:
This step is mostly about planning your data science project. Like how much time are you going
to require to do it? or what dataset do you require(if you don’t have it) or whether you are going
to use supervised learning or unsupervised or reinforcement learning? etc. comes into this step.
Decide how much time you are going to spend on data pre-processing, model evaluation, etc.
Making a rough timeline chart can help because you would have a rough deadline for the
completion of your project.
Task 3. Modelling:
In this step, you would be doing data preparation and gaining insights from the data which you
are using. In short, Data Pre-processing and EDA or Exploratory Data Analysis come in this
step.
There could be only two challenges — Proper Data Processing and EDA. For Data Pre-
processing, do thorough pre-processing. Because the more clean your data is, the better your
model is going to be.
There is not much challenge for EDA. But as far as I have seen, doing a more in-depth EDA
leads to a better Data Science project. Just a suggestion but you could do basic and quick EDA
in Excel, Tableau, or Power Bi to understand some trends in data, and more in-depth in python
and R.
Task 4. Construction:
Your actual data science project happens in this step.
Like what model you are using? How much accuracy of the model should be? etc.
What could be challenged in this step?
There are many challenges and errors in this step but the most important one would be–
Choosing the Right Algorithm.
Now, most of the small challenges and errors could be resolved by googling or stack-
overflowing them (which we do when in doubt). It mostly depends on what type of relationship
your data has between the feature and the target variables. It mostly helps if you try various
models to find out which works the best.
Task 5. Deployment:
You have now finished your project and you wish to send it to the client. This step mainly
involves showing your client what you have done to improve their business.
And based on their feedback should you revise your model or not.