Unit - One QB
Unit - One QB
QUESTION BANK
FOR IV SEMESTER
[REGULATION - 2021]
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/2
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
QUESTION BANK
REGULATION : 2021
YEAR/SEMESTER : II / IV
A r ed by: Name:
p o Date:
p v
.
MISSION
DM 1 : To develop and implement AI solutions that prioritize human values, ethics and social
responsibilities.
DM 2 : To provide cutting-edge AI research, driving innovation and advancing the state-of-the-art
in AI technology.
DM 3 : To provide industrial standards by means of collaborations for artificial intelligence and
data science.
DM 4 :To provide an excellent infrastructure that keeps up with modern trends and technologies
for professional entrepreneurship.
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/4
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
2 Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
3 Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
4 Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of
the information to provide valid conclusions.
5 Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.
6 The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.
7 Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
8 Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9 Individual and team work: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings.
10 Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/5
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/6
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
SYLLABUS
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS LTPC
3003
COURSE OBJECTIVES:
To understand the techniques and processes of data science
To apply descriptive data analytics
To visualize data for various applications
To understand inferential data analytics
To analysis and build predictive models from data
UNIT I INTRODUCTION TO DATA SCIENCE 08
Need for data science – benefits and uses – facets of data – data science process – setting the research
goal – retrieving data – cleansing, integrating, and transforming data – exploratory data analysis – build
the models – presenting and building applications.
UNIT II DESCRIPTIVE ANALYTICS 10
Frequency distributions – Outliers –interpreting distributions – graphs – averages – describing variability
– interquartile range – variability for qualitative and ranked data - Normal distributions – z scores
–correlation – scatter plots – regression – regression line – least squares regression line – standard error of
estimate – interpretation of r2 – multiple regression equations – regression toward the mean.
UNIT III INFERENTIAL STATISTICS 09
Populations – samples – random sampling – Sampling distribution- standard error of the mean -
Hypothesis testing – z-test – z-test procedure –decision rule – calculations – decisions – interpretations -
one-tailed and two-tailed tests – Estimation – point estimate – confidence interval – level of confidence –
effect of sample size.
UNIT IV ANALYSIS OF VARIANCE 09
t-test for one sample – sampling distribution of t – t-test procedure – t-test for two independent samples –
p-value – statistical significance – t-test for two related samples. F-test – ANOVA – Two-factor
experiments – three f-tests – two-factor ANOVA –Introduction to chi-square tests.
UNIT V PREDICTIVE ANALYTICS 09
Linear least squares – implementation – goodness of fit – testing a linear model – weighted resampling.
Regression using StatsModels – multiple regression – nonlinear relationships – logistic regression –
estimating parameters – Time series analysis – moving averages – missing values – serial correlation –
autocorrelation. Introduction to survival analysis.
TOTAL: 45 PERIODS
COURSE OUTCOMES:
Upon successful completion of this course, the students will be able to:
CO1: Explain the data analytics pipeline
CO2: Describe and visualize data
CO3: Perform statistical inferences from data
CO4: Analyze the variance in the data
CO5: Build models for predictive analytics
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/7
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
TEXT BOOKS
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning
Publications, 2016. (first two chapters for Unit I).
2. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
3. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.
REFERENCES
1. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press, 2014.
2. Sanjeev J. Wagh, Manisha S. Bhende, Anuradha D. Thakare, “Fundamentals of Data Science”,
CRC Press, 2022.
3. Chirag Shah, “A Hands-On Introduction to Data Science”, Cambridge University Press, 2020.
4. Vineet Raina, Srinath Krishnamurthy, “Building an Effective Data Science Practice: A Framework
to Bootstrap and Manage a Successful Data Science Practice”, Apress, 2021.
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/8
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
In this case, the numbers are steadily decreasing decade by decade, so this is
a downward trend
US life expectancy from 1920-2000:
In this case, the numbers are steadily increasing decade by decade, so this
an upward trend
classifications etc.
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/10
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Population Sample
Advertisements for IT jobs in the The top 50 search results for advertisements
Netherlands for IT jobs in the Netherlands on May 1,
4. Difference between population and2020
sample with an example CO1
Songs from the Eurovision Song Winning songs from the Eurovision Song
Contest Contest that were performed in English
Undergraduate students in the 300 undergraduate students from three
Netherlands Dutch universities who volunteer for Our
psychology research study
All countries of the world Countries with published data available on
birth rates and GDP since 2000
5. What are the main phases of data science life cycle? CO1
Discovery
Data preparation
Model planning
Model building
Operationalize
Communicate results
6. What are the tools required for data science? CO1
Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R.Studio,MATLAB,
Excel,
Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS
Redshift
Data Visualization tools: R, Jupyter, Tableau, Cognos.
Machine learning tools: Spark, Mahout, Azure ML studio.
7. What is Data modeling? CO1
Using machine learning and statistical techniques is the step to further achieve
our project goal and predict future trends. By working with clustering algorithms,
we can build models to uncover trends in the data that were not distinguishable in
graphs and stats. These create groups of similar events (or clusters) and more or
less explicitly express what feature is decisive in these results.
8. What are the facets of data science? CO1
Identifying the structure of data
Cleaning, filtering, reorganizing, augmenting, and aggregating data
Visualizing data
Data analysis, statistics, and modeling
Machine Learning
9. List the process steps in a data science CO1
Ideate
Explore
Model
Validate
Display
Operate
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/11
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/12
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
23. How does confusion matrix define the performance of classification CO1 Apr/
algorithm? May
A confusion matrix is a matrix that summarizes the performance of a machine 2024
learning model on a set of test data. It is a means of displaying the number of
accurate and inaccurate instances based on the model’s predictions. It is often used
to measure the performance of classification models, which aim to predict a
categorical label for each input instance.
A 2X2 Confusion matrix is shown below for the image recognition having a Dog
image or Not Dog image.
24. What is brushing and linking in Exploratory data analysis? CO1 Apr/
Brushing and linking are interactive tools used in exploratory data analysis (EDA) May
to visualize data and explore relationships between different visualizations: 2023
Brushing
Interactively select a subset of data in one visualization by dragging a mouse or
using a bounding shape. This can highlight the selected data in other
visualizations.
Linking
Apply user interactions in one visualization to other visualizations. For example,
selecting observations in one visualization will highlight the same observations in
other visualizations.
PART B
1. Explain the data science process life cycle. CO1 Nov/De
The life-cycle of data science is explained as below diagram. c-22
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEM/QB/12
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
The main phases of data science life cycle are given below:
1. Discovery: The first phase is discovery, which involves asking the right
questions. When We start any data science project, We need to determine what are
the basic requirements, priorities, and project budget. In this phase, we need to
determine all the requirements of the project such as the number of people,
technology, time, data, an end goal, and then we can frame the business problem
on first hypothesis level.
After performing all the above tasks, we can easily use this data for our further
processes.
3. Model Planning: In this phase, we need to determine the various methods and
techniques to establish the relation between input variables. We will apply
Exploratory data analytics (EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what
data can inform us. Common tools used for model planning are:
SQL Analysis Services
R
SAS
Python
5. Operationalize: In this phase, we will deliver the final reports of the project,
along with briefings, code, and technical documents. This phase provides We a
clear overview of complete project performance and other components on a small
scale before the full deployment.
6. Communicate results: In this phase, we will check if we reach the goal, which
we have set on the initial phase. We will communicate the findings and final result
with the business team.
2. Elaborate the data science components. CO1
The main components of Data Science are given below:
1. Statistics: Statistics is one of the most important components of data science.
Statistics is a way to collect and analyze the numerical data in a large amount and
finding meaningful insights from it.
2. Domain Expertise: In data science, domain expertise binds data science
together. Domain expertise means specialized knowledge or skills of a particular
area. In data science, there are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves
acquiring, storing, retrieving, and transforming the data. Data engineering also
includes metadata (data about data) to the data.
4. Visualization: Data visualization is meant by representing data in a visual
context so that people can easily understand the significance of data. Data
visualization makes it easy to access the huge amount of data in visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing.
Advanced computing involves designing, writing, debugging, and maintaining the
source code of computer programs.
Very large amount of data will generate in big data and data science. These data is
various types and main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data
• Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data.
• The term structured data refers to data that is identifiable because it is organized
in a structure. The most common form of structured data or records is a database
where specific information is stored based on a methodology of columns and
rows.
• Structured data is also searchable by data type within content. Structured data is
understood by computers and is also efficiently organized for human readers.
• An Excel table is an example of structured data.
Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and
columns are not used for unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured
data.
• Even today in most of the organizations more than 80 % of the data are in
unstructured form. This carries lots of information. But extracting information
from these various sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data.
5. Since there is no structural binding for unstructured data, it is unpredictable in
nature.
Natural Language
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/15
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/16
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
• Nodes represent entities, which can be of any object type that is relevant to our
problem domain. By connecting nodes with edges, we will end up with a graph
(network) of nodes.
• A graph database stores nodes and relationships instead of tables or documents.
Data is stored just like we might sketch ideas on a whiteboard. Our data is stored
without restricting it to a predefined model, allowing a very flexible way of
thinking about and using it.
• Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
• Graph databases are capable of sophisticated fraud prevention. With graph
databases, we can use relationships to process financial and purchase transactions
in near-real time. With fast graph queries, we are able to detect that, for example,
a potential purchaser is using the same email address and credit card as included in
a known fraud case.
• Graph databases can also help user easily detect relationship patterns such as
multiple people associated with a personal email address or multiple people
sharing the same IP address but residing in different physical addresses.
• Graph databases are a good choice for recommendation applications. With graph
databases, we can store in a graph relationships between information categories
such as customer interests, friends and purchase history. We can use a highly
available graph database to make product recommendations to a user based on
which products are purchased by others who follow the same sport and have
similar purchase history.
• Graph theory is probably the main method in social network analysis in the early
history of the social network concept. The approach is applied to social network
analysis in order to determine important features of the network such as the nodes
and links (for example influencers and the followers).
• Influencers on social network have been identified as users that have impact on
the activities or opinion of other users by way of followership or influence on
decision made by other users on the network as shown in Fig. 1.2.1.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/17
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/18
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
data, machine learning and data mining solutions to store, handle and analyze such
heterogeneous data.
Streaming Data
Streaming data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously and in small sizes (order
of Kilobytes).
• Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game
player activity, information from social networks, financial trading floors or
geospatial services and telemetry from connected devices or instrumentation in
data centers.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/19
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Process in detail
The amount of time needed for each stage varies, but as a whole, information
retrieval is a time-consuming process, so We should start it as early as possible.
To avoid doing needless work and unnecessary mistakes, We should plan Our
information retrieval well and make notes of Our plan, including the different
stages of the process. A good information retrieval plan is especially important
when the area of Our research is broad. A well-documented plan is easy to return
to later on when needed.
7. Elucidate the data cleaning, integrating and transforming data in detail. CO1 April
Cleaning: /May
Data cleansing is a subprocess of the data science process that focuses on 2023
removing errors in Our data so Our data becomes a true and consistent
representation of the processes it originates from.
The first type is the interpretation error, such as incorrect use of
terminologies, like saying that a person’s age is greater than 300 years.
The second type of error points to inconsistencies between data sources or
against Our company’s standardized values. An example of this class of
errors is putting “Female” in one table and “F” in another when they
represent the same thing: that the person is female.
Steps to Perform Data Cleanliness
Performing data cleaning involves a systematic process to identify and
rectify errors, inconsistencies, and inaccuracies in a dataset. The following
are essential steps to perform data cleaning.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/20
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Data Cleaning
Removal of Unwanted Observations: Identify and eliminate irrelevant or
redundant observations from the dataset. The step involves scrutinizing
data entries for duplicate records, irrelevant information, or data points
that do not contribute meaningfully to the analysis. Removing unwanted
observations streamlines the dataset, reducing noise and improving the
overall quality.
Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity
in data representation. Fixing structure errors enhances data consistency
and facilitates accurate analysis and interpretation.
Managing Unwanted outliers: Identify and manage outliers, which are
data points significantly deviating from the norm. Depending on the
context, decide whether to remove outliers or transform them to minimize
their impact on analysis. Managing outliers is crucial for obtaining more
accurate and reliable insights from the data.
Handling Missing Data: Devise strategies to handle missing data
effectively. This may involve imputing missing values based on statistical
methods, removing records with missing values, or employing advanced
imputation techniques. Handling missing data ensures a more complete
dataset, preventing biases and maintaining the integrity of analyses.
Integrating:
Combining Data from different Data Sources.
The data comes from several different places, and in this sub step we
focus on integrating these different sources.
The data can perform two operations to combine information from
different data sets. The first operation is joining and the second operation
is appending or stacking.
Joining Tables:
Joining tables allows We to combine the information of one observation
found in one table with the information that We find in another table.
Appending Tables:
Appending or stacking tables is effectively adding observations from one
table to another table.
Transforming Data
Data transformation is the process of converting, cleansing, and structuring data
into a usable format that can be analyzed to support decision making processes,
and to propel the growth of an organization.
Data transformation techniques
There are several data transformation techniques that are used to clean data and
structure it before it is stored in a data warehouse or analyzed for business
intelligence. Not all of these techniques work with all types of data, and
sometimes more than one technique may be applied. Nine of the most common
techniques are:
1. Revising
Revising ensures the data supports its intended use by organizing it in the required
and correct way. It does this in a range of ways.
Dataset normalization revises data by eliminating redundancies in the data set. The
data model becomes more precise and legible while also occupying less space.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/21
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
This process, however, does involve a lot of critical thinking, investigation and
reverse engineering.
Data cleansing ensures the formatting capability of data.
Format conversion changes the data types to ensure compatibility.
Key structuring converts values with built-in meanings to generic identifiers to be
used as unique keys.
Deduplication identifies and removes duplicates.
Data validation validates records and removes the ones that are incomplete.
Repeated and unused columns can be removed to improve overall performance
and legibility of the data set.
2. Manipulation
This involves creation of new values from existing ones or changing current data
through computation. Manipulation is also used to convert unstructured data into
structured data that can be used by machine learning algorithms.
Derivation, which is cross column calculations
Summarization that aggregates values
Pivoting which involves converting columns values into rows and vice versa
Sorting, ordering and indexing of data to enhance search performance
Scaling, normalization and standardization that helps in comparing dissimilar
numbers by putting them on a consistent scale
Vectorization which helps convert non-numerical data into number arrays that are
often used for machine learning applications
3. Separating
This involves dividing up the data values into its parts for granular analysis.
Splitting involves dividing up a single column with several values into separate
columns with each of those values. This allows for filtering on the basis of certain
values.
4. Combining/ integrating
Records from across tables and sources are combined to acquire a more holistic
view of activities and functions of an organization. It couples data from multiple
tables and datasets and combines records from multiple tables.
5. Data smoothing
This process removes meaningless, noisy, or distorted data from the data set. By
removing outliers, trends are most easily identified.
6. Data aggregation
This technique gathers raw data from multiple sources and turns it into a summary
form which can be used for analysis. An example is the raw data providing
statistics such as averages and sums.
7. Discretization
With the help of this technique, interval labels are created in continuous data in an
attempt to enhance its efficiency and easier analysis. The decision tree algorithms
are utilized by this process to transform large datasets into categorical data.
8. Generalization
Low level data attributes are transformed into high level attributes by using the
concept of hierarchies and creating layers of successive summary data. This helps
in creating clear data snapshots.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/22
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
9. Attribute construction
In this technique, a new set of attributes is created from an existing set to facilitate
the mining process.
8. Detail the term Exploratory data analysis. CO1
During exploratory data analysis We take a deep dive into the data. Information
becomes much easier to grasp when shown in a picture, therefore We mainly use
graphical techniques to gain an understanding of Our data and the interactions
between variables. This phase is about exploring data, so keeping Our mind open
and Our eyes peeled is essential during the exploratory data analysis phase. The
goal isn’t to cleanse the data, but it’s common that We’ll still discover anomalies
We missed before, forcing We to take a step back and fix them.
The visualization techniques We use in this phase range from simple line graphs
or histograms, to more complex diagrams such as Sankey and network graphs.
Sometimes it’s useful to compose a composite graph from simple graphs to get
even more insight into the data. Other times the graphs can be animated or made
interactive to make it easier and, let’s admit it, way more fun.
A bar chart, a line plot, and a distribution are some of the graphs used in
exploratory analysis.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/23
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Figure 2.16. Drawing multiple plots together can help We understand the
structure of Our data over multiple variables.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/24
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
The above figure shows another technique: brushing and linking. With brushing
and linking We combine and link different graphs and tables (or views) so
changes in one graph are automatically transferred to the other graphs.. This
interactive exploration of data facilitates the discovery of new insights.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/25
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Link and brush allows We to select observations in one plot and highlight the
same observations in the other plots.
The above figure shows the average score per country for questions. Not only does
this indicate a high correlation between the answers, but it’s easy to see that when
We select several points on a subplot, the points will correspond to similar points
on the other graphs. In this case the selected points on the left graph correspond to
points on the middle and right graphs, although they correspond better in the
middle and right graphs.
Figure 2.20. Example boxplot: each user category has a distribution of the
appreciation each has for a certain picture on a photography website.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/26
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Model building is a crucial phase in the data science process, where data is
transformed into actionable insights and predictions. In this blog, I will provide
you with a step-by-step guide to model building, equipping you with the essential
techniques to develop accurate and reliable predictive models.
Clearly define the problem you aim to solve and establish measurable objectives.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/27
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Understand the scope, constraints, and desired outcomes of your model. This step
ensures that your model aligns with the problem at hand and provides meaningful
insights.
Collect the relevant data required for model building. Clean and preprocess the
data, handling missing values, outliers, and inconsistencies. Perform feature
engineering and selection to extract meaningful predictors and ensure data quality.
Split your data into training and testing sets. The training set is used to train the
model, while the testing set serves as an unseen dataset for evaluating the model’s
performance. Consider techniques like cross-validation for robust model
assessment.
Select the appropriate machine learning algorithm based on your problem type
(e.g., classification, regression) and data characteristics. Consider popular
algorithms like linear regression, decision trees, random forests, support vector
machines, or deep learning models.
Fit the selected algorithm to the training data. Adjust the model’s parameters and
hyperparameters to optimize its performance. Use techniques like grid search or
Bayesian optimization to find the best parameter settings.
Understand and interpret the model’s output to gain insights into the underlying
patterns and relationships in the data. Analyze feature importances, coefficients, or
decision boundaries to explain the model’s behavior. Communicate the results
effectively to stakeholders.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/28
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
1. Improved Decision-Making
o
Use: Data science helps organizations make better decisions by analyzing historical data to
predict future trends. For example, companies can use customer data to anticipate demand and
optimize their inventory.
o
Benefit: Companies like Amazon use predictive analytics to recommend products to users
based on past behavior, leading to increased sales and customer satisfaction.
o
Counterexample: Over-reliance on data-driven decisions without considering external factors
or human insights can lead to poor outcomes. For instance, a stock trading algorithm that
solely focuses on historical trends might miss critical political events, leading to massive
losses.
2. Enhanced Customer Experiences
o
Use: Data science enables personalized recommendations and targeted marketing strategies.
Retailers can use customer data to create personalized shopping experiences.
o
Benefit: Netflix uses data to recommend shows and movies tailored to individual preferences,
significantly boosting user engagement.
o
Counterexample: If the data is biased or incomplete, personalization can fail. For example, a
recommendation system might suggest irrelevant products or content, annoying users and
driving them away.
3. Operational Efficiency
o
Use: Companies can optimize their operations by identifying bottlenecks or inefficiencies
through data analysis. For example, manufacturers can use data to improve production
schedules and reduce downtime.
o
Benefit: UPS uses data science to optimize delivery routes, saving millions of gallons of fuel
annually.
o
Counterexample: A purely data-driven focus might lead to short-term gains but long-term
problems. For example, a company might use data to cut costs excessively, leading to a decline
in employee satisfaction and productivity over time.
4. Fraud Detection and Risk Management
o
Use: Data science is used extensively in finance to detect fraudulent activities and assess credit
risks. Machine learning algorithms can spot patterns that humans might miss.
o
Benefit: Banks use data science to detect unusual transactions and prevent credit card fraud,
saving millions in potential losses.
o
Counterexample: Relying solely on algorithms for fraud detection might lead to false
positives. Genuine customers might have their accounts flagged or frozen erroneously, leading
to dissatisfaction.
5. Predictive Maintenance
o
Use: In industries such as manufacturing and aviation, data science can predict equipment
failures before they happen, allowing for timely maintenance.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/29
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
o
Benefit: Airlines use predictive maintenance to keep planes running safely an d efficiently by
analyzing data from engine sensors.
o
Counterexample: Predictive models might be inaccurate if the input data is poor or
incomplete, potentially leading to unnecessary maintenance costs or missed critical repairs.
6. Advances in Healthcare
o Use: Data science allows for personalized medicine, predictive diagnostics, and drug
discovery. By analyzing patient data, doctors can provide better, more person alized treatments.
o
Benefit: Cancer treatment can be tailored to a patient’s genetic profile, improving survival
rates and reducing side effects.
o
Counterexample: Misinterpretation of medical data or incorrect use of algorithms can have
dire consequences. For instance, over-reliance on AI-driven diagnostic tools might lead to
misdiagnoses, putting patients at risk.
7. Cost Reduction
o
Use: Organizations can identify areas to reduce costs by analyzing data on operational
inefficiencies, customer behavior, and more.
o
Benefit: Data science helps optimize supply chains, reduce waste, and streamline operations,
reducing overall costs for companies like Walmart.
o
Counterexample: Cost-cutting driven purely by data can sometimes be shortsighted. For
example, a company might focus on reducing employee benefits to cut costs, only to face high
turnover and decreased employee morale.
8. Scientific Discovery and Innovation
o
Use: Data science facilitates advancements in fields like genomics, climate science, and
physics by analyzing large datasets that were previously too vast for human analysis.
o
Benefit: Climate scientists use data models to predict future environmental conditions and plan
mitigation strategies.
o
Counterexample: Over-reliance on models can lead to inaccurate conclusions. For instance,
predictions based on incomplete or flawed climate data could lead to ineffective or misdirected
policy decisions.
1. Bias in Algorithms
o
Example: In hiring, some companies have used data-driven models to screen candidates. If the
data is biased (e.g., historical data that favored a particular demographic), the algorithm will
perpetuate that bias, resulting in discriminatory hiring practices.
2. Data Privacy Issues
o
Example: Companies using data science for targeted advertising may inadvertently violate
user privacy by collecting too much personal information. The Cambridge Analytica scandal,
where user data from Facebook was used without consent, is a prime example of data misuse.
3. Ethical Concerns
o
Example: Predictive policing, where law enforcement uses data to anticipate where crimes
might occur, raises concerns about racial profiling and the invasion of privacy. In some cases,
communities have been unfairly targeted based on biased data.
4. Overfitting Models
o
Example: In machine learning, if a model is too closely tailored to the training data, it may
perform poorly on new data (overfitting). For instance, a stock market model that perfectly
predicts past performance may fail when applied to future data because the market conditions
have changed.
5. Inability to Capture Complex Human Behavior
o
Example: Customer behavior is often influenced by emotional, social, and contextual factors
that data alone cannot capture. A retail company might use past purchasing data to predict
future behavior, but it might miss major shifts in preferences due to social trends, rendering the
model inaccurate.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/30
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Illustrate in detail about different facets of data with examples. CO1 Apr/Ma
y 2024
Facets of Data
• Very large amount of data will generate in big data and data science. These data is various types and main
categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
Structured Data
• Structured data is arranged in rows and column format. It helps for application to retrieve and process data
easily. Database management system is used for storing structured data.
• The term structured data refers to data that is identifiable because it is organized in a structure. The most
common form of structured data or records is a database where specific information is stored based on a
methodology of columns and rows.
• Structured data is also searchable by data type within content. Structured data is understood by computers and
is also efficiently organized for human readers.
Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and columns are not used for
unstructured data. Therefore it is difficult to retrieve required information. Unstructured data has no identifiable
structure.
• The unstructured data can be in the form of Text: (Documents, email messages, customer feedbacks), audio,
video, images. Email is an example of unstructured data.
• Even today in most of the organizations more than 80 % of the data are in unstructured form. This carries lots
of information. But extracting information from these various sources is a very big challenge.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/31
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize characters, words and sentences, then apply
meaning and understanding to that information. This helps machines to understand language as humans do.
• Natural language processing is the driving force behind machine intelligence in many modern real-world
applications. The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion and sentiment analysis.
•For natural language processing to help machines understand human language, it must go through speech
recognition, natural language understanding and machine translation. It is an iterative process comprised of
several layers of text analysis.
• Machine data contains a definitive record of all activity and behavior of our customers, users, transactions,
applications, servers, networks, factory machinery and so on.
• It's configuration data, data from APIs and message queues, change events, the output of diagnostic
commands and call detail records, sensor data from remote equipment and more.
• Examples of machine data are web server logs, call detail records, network event logs and telemetry.
• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate machine data.
Machine data is generated continuously by every processor-based system, as well as many consumer-oriented
systems.
• It can be either structured or unstructured. In recent years, the increase of machine data has surged. The
expansion of mobile devices, virtual servers and desktops, as well as cloud- based services and RFID
technologies, is making IT infrastructures more complex.
• Nodes represent entities, which can be of any object type that is relevant to our problem domain. By
connecting nodes with edges, we will end up with a graph (network) of nodes.
• A graph database stores nodes and relationships instead of tables or documents. Data is stored just like we
might sketch ideas on a whiteboard. Our data is stored without restricting it to a predefined model, allowing a
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/32
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
• Graph databases are used to store graph-based data and are queried with specialized query languages such as
SPARQL.
• Graph databases are capable of sophisticated fraud prevention. With graph databases, we can use relationships
to process financial and purchase transactions in near-real time. With fast graph queries, we are able to detect
that, for example, a potential purchaser is using the same email address and credit card as included in a known
fraud case.
• Graph databases can also help user easily detect relationship patterns such as multiple people associated with
a personal email address or multiple people sharing the same IP address but residing in different physical
addresses.
• Graph databases are a good choice for recommendation applications. With graph databases, we can store in a
graph relationships between information categories such as customer interests, friends and purchase history.
We can use a highly available graph database to make product recommendations to a user based on which
products are purchased by others who follow the same sport and have similar purchase history.
• Graph theory is probably the main method in social network analysis in the early history of the social network
concept. The approach is applied to social network analysis in order to determine important features of the
network such as the nodes and links (for example influencers and the followers).
• Influencers on social network have been identified as users that have impact on the activities or opinion of
other users by way of followership or influence on decision made by other users on the network as shown in
Fig. 1.2.1.
• Graph theory has proved to be very effective on large-scale datasets such as social network data. This is
because it is capable of by-passing the building of an actual visual representation of the data to run directly on
data matrices.
•The terms audio and video commonly refers to the time-based media storage format for sound/music and
moving pictures information. Audio and video digital recording, also referred as audio and video codecs, can
be uncompressed, lossless compressed or lossy compressed depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important sources of information and
knowledge; the integration, transformation and indexing of multimedia data bring significant challenges in data
management and analysis. Many challenges have to be addressed including big data, multidisciplinary nature
of Data Science and heterogeneity.
• Data Science is playing an important role to address these challenges in multimedia data. Multimedia data
usually contains various forms of media, such as text, image, video, geographic coordinates and even pulse
waveforms, which come from multiple sources. Data Science can be a key instrument covering big data,
machine learning and data mining solutions to store, handle and analyze such heterogeneous data.
Streaming Data
Streaming data is data that is generated continuously by thousands of data sources, which typically send in the
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/33
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
• Streaming data includes a wide variety of data such as log files generated by customers using your mobile or
web applications, ecommerce purchases, in-game player activity, information from social networks, financial
trading floors or geospatial services and telemetry from connected devices or instrumentation in data centers
12. Draw and outline step by step activities in data science process. CO1 Apr/Ma
y 2024
Example
Suppose a retail company wants to increase its sales by identifying the factors that influence customer purchase
decisions. The data science team will identify the problem and plan the project by determining the data sources
(e.g., transaction data, customer data), the data collection process (e.g., data cleaning, data transformation), and
the analytical methods (e.g., regression analysis, decision trees) that will be used to analyze the data.
Example
In the retail company example, the data science team will collect data on customer demographics, transaction
history, and product information.
Example
In the retail company example, the data science team will remove any duplicate or missing data from the
customer and transaction datasets. They may also merge the datasets to create a single dataset that can be
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/34
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
analyzed.
Example
In the retail company example, the data science team may use regression analysis to identify the factors that
influence customer purchase decisions. They may also use clustering to segment customers based on their
purchase behavior.
Example
In the retail company example, the data science team may build a predictive model that can be used to predict
customer purchase behavior based on demographic and product information.
Example
In the retail company example, the data science team may test the predictive model using a validation dataset to
ensure that it accurately predicts customer purchase behavior.
Example
In the retail company example, the data science team may deploy the predictive model into the company’s
customer relationship management (CRM) system so that it can be used to make targeted marketing campaigns.
Part-C
Answer all the questions
1 Outline the purpose of data cleansing. How missing and nullified attributes CO1
are handled and modified during preprocessing stage?
Data cleansing, also known as data cleaning or data scrubbing, is a crucial process in data preprocessing that
aims to improve the quality and accuracy of data by correcting or removing erroneous, incomplete, or irrelevant
data. The purpose of data cleansing includes:
Improving Data Quality: Ensuring that the data is accurate, complete, and relevant to the analysis.
Enhancing Data Consistency: Ensuring that data is uniform across different sources and systems.
Removing Redundancy: Eliminating duplicate records to ensure efficiency in processing.
Facilitating Accurate Analysis: Clean data helps in producing more reliable insights during data
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/35
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
During preprocessing, missing and nullified attributes are handled through various techniques depending on the
nature of the data and the problem at hand. These techniques include:
While working with data it is a common scenario for the data scientists to deal with missing values. Handling
these missing values could be very important as most of the machine algorithms do not support missing values.
And even though if some algorithms like KNN and Naive Bayes handle missing values the results could be
skewed. Hence handling it efficiently is very important as it affects the performance of the model.
As any other exploratory data analysis method, there is no one good method that fits all. There are different
approaches for different kinds of problem like — time series, ML, Regression etc. and so it is difficult to
provide a general solution. In this blog we shall go through the types of missing values and ways of handling
them.
Types of missing values
Missing values in a dataset can occur for various reasons, and understanding the types of missing values can
help in choosing appropriate strategies for handling them.
1. Missing Completely at Random (MCAR):
In this scenario missing values completely occur at random and there is no relationship between the missing
data and any other values in the dataset. That is there is no pattern. We can say that the probability of data
being missing is the same for all observations.
E.g., Suppose that after the customer service call ends, we would be asked to give ratings to the customer
representative. Not all the customers would do this and any customer who decides to give feedback is
completely random irrespective of their experience or any other factors. In this case the missing values in the
feedback column would be an MCAR.
2. Missing at Random (MAR)
In this scenario, missing values do not occur at random but the pattern of missingness could be explained by
other observations. That is, the likelihood of a value missing in the dataset could possibly be due to some other
variables in the dataset.
For e.g., suppose a survey is taken at a dermatology clinic where the gender and their skincare routine is asked.
Assume that most of the females answer the survey whereas men are less likely to answer. So here, why the
data is missing could be explained by the factor, that is gender. In this case, the missing data in the skincare
routine column is MAR.
3. Missing Not at Random (MNAR):
In this case, the missing values are not random and cannot be explained by the observed data. This could be
challenging case as possible reason for the missingness are related to the unobserved data.
For example people having more income may refuse to share the exact information in a survey or
questionnaire.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/36
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/37
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
No doubt it is one of the quickest techniques one can use to deal with missing data but we also have to keep in
mind that there is loss of information. This technique should only be used when majority of the values in a
column has NULL values.
2. Deleting the row with missing data
In this method we are deleting rows which has at least one NULL value. This is not the best practice because of
the fact that data is information. Even though other values are non null we delete the entire row if there is at
least one NULL value. For instance, if every row has some (column) value missing, you might end up deleting
the whole data.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/38
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Out of the 891 rows, 177 rows has age as NULL and 2 rows with embark_town as NULL. On deletion of those
rows we get 712 rows as a result.
3. Imputing missing values with mean/median
Columns in the dataset which are having numeric continuous values can be replaced with the mean, median, or
mode of remaining values in the column. This method can prevent the loss of data compared to the earlier
method. Replacing the above two approximations (mean, median) is a statistical approach to handle the missing
values.
This approach is popularly used when there are small number of missing values in the data. However, when
there are many missing values, mean or median results can result in a loss of variation in the data.
Mean and median imputation can provide a good estimate of the missing values, respectively for normally
distributed data, and skewed data.
The downside of this approach is that it cannot be applied for categorical columns. Also the mean imputation is
sensitive to outliers and may not be a good representation of the central tendency of the data.
df['age'] = df['age'].fillna(df['age'].mean(), inplace=True)
3.1 Imputing missing values with mean/median of group
We can fill the missing values using group level statistics in the following manner.
#Mean
df['age'] = df['age'].fillna(df.groupby('class')['age'].transform('mean'))
#Median
df['age'] = df['age'].fillna(df.groupby('class')['age'].transform('median'))
In this method we have filled NULL values of age by taking the mean of age at ‘class’ group level. We are
performing this method for filling the NULL values in age column assuming the fact that similar age group
people would have booked particular class of tickets in the ship.
4. Imputation method for categorical columns
When missing values is from categorical columns (string or numerical) then the missing values can be replaced
with the most frequent category. If the number of missing values is very large then it can be replaced with a
new category.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/39
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
df['deck'].value_counts()
df['deck'] = df['deck'].fillna('H')
In our dataset the column deck has almost 688 values which are NULL. Hence we are creating a new column
called ‘H’ and substituting it with the NULL values.
5. Forward Fill and Backward Fill
Forward fill (ffill) and backward fill (bfill) are methods used to fill missing values by carrying forward the last
observed non-missing value (for ffill) or by carrying backward the next observed non-missing value (for bfill).
If missing values should be filled with the most recent non-missing value, use ffill. If missing values should be
filled with the next non-missing value, use bfill.
6. Interpolation
Interpolation is a technique used to fill missing values based on the values of adjacent datapoints. This
technique is mainly used in case of time series data or in situation where the missing data points are expected to
vary smoothly or follow a certain trend. It is also used in cases where it is regularly sampled data.
Interpolation can be understood as a weighted average. The weights are inversely related to the distance to its
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/40
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
neighboring points.
7. Model Based Imputation (Regression Model)
In the earlier methods to handle missing values, we do not use the correlation advantage of the variable
containing the missing value and other variables.
In this method we used predictive models to impute missing values based on other features in the dataset.
The regression or classification model can be used for the prediction of missing values depending on the nature
(categorical or continuous) of the feature having missing value.
Here 'Age' column contains missing values.
So for prediction of null values the spliting of data will be
8. Multiple Imputation
The Iterative Imputer is a method for imputing missing values in a dataset. It belongs to the scikit-learn library
and implements the Multiple Imputation by Chained Equations (MICE) algorithm. MICE is an iterative
imputation approach that imputes missing values one variable at a time, conditioned on the other variables.
Suppose the feature ‘age’ is well correlated with the feature ‘Fare’ such that people with lower fares are also
younger and people with higher fares are also older. In that case, it would make sense to impute low age for
low fare values and high age for high fare values. So here, we are taking multiple features into account by
following a multivariate approach.
9. K-Nearest Neighbors Imputations (KNNImputer)
Imputing missing values using k-Nearest Neighbors (KNN) is a technique where missing values are estimated
based on the values of their nearest neighbors in the feature space.
The idea is to find the k nearest data points and use their values to impute the missing values.
2 How would you adapt the data science process to analyze the real time CO1
Twitter data during breaking news event (eg. Natural disaster) to understand
user sentiments and environmental patterns?
Adapting the data science process to analyze real-time Twitter data during a breaking news event, such as a
natural disaster, involves the following steps:
1. Problem Definition
Goal: Understand user sentiments and identify environmental patterns during a natural disaster.
Key Questions:
o
What are the prevailing sentiments (fear, anxiety, hope) in user tweets?
o
How are environmental conditions (floods, fires, etc.) being reported by users?
o
Can we identify emerging trends, keywords, or areas affected by the disaster?
Constraints:
o
Real-time data collection and processing.
o
Potential data noise and misinformation.
2. Data Collection
Real-time Twitter API: Use the Twitter Streaming API to collect live tweets.
o
Filters: Use keywords or hashtags related to the event (e.g., #flood, #earthquake, #hurricane).
o
Location-Based Data: Include geotagged tweets for location analysis to identify where users
are reporting from.
External Data: Supplement Twitter data with external environmental data sources, such as weather
reports, satellite imagery, or government agencies’ feeds.
3. Data Preprocessing
Text Preprocessing:
o
Cleaning: Remove unnecessary characters (URLs, hashtags, mentions).
o
Tokenization: Break tweets into individual tokens (words).
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/41
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
o
Stopwords Removal: Remove common words like "the" and "is."
o
Lemmatization: Convert words to their base forms (e.g., "running" to "run").
Filtering Misinformation:
o
Detect and filter out spam, rumors, or irrelevant information using pre-built classifiers or rules.
Geotagging:
o
Enrich non-geotagged tweets by extracting location references from the text (e.g., "New
York," "coastal city").
Sentiment Distribution:
o
Use sentiment analysis models (e.g., VADER, TextBlob, or transformers-based models like
BERT) to classify tweets as positive, negative, or neutral.
o
Visualize sentiment trends over time to understand shifts in emotions as the event unfolds.
Environmental Patterns:
o
Use keyword extraction or topic modeling (e.g., LDA) to identify patterns in the data related to
the disaster (e.g., flood levels, wind damage).
o
Correlate emerging topics with the location data to track the affected regions.
7. Evaluation
Model Evaluation:
o
Measure the performance of sentiment analysis and topic models using metrics like accuracy,
precision, recall, and F1-score.
Continuous Validation:
o
Perform real-time error analysis by cross-referencing sentiment trends with external
reports (e.g., government alerts, official news).
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/42
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
8. Communication of Results
Reporting:
o
Create summaries and insights for stakeholders (e.g., government agencies, disaster response
teams) that include real-time sentiment shifts, affected areas, and trending environmental
patterns.
Visualization:
o
Use dynamic visualizations (graphs, heatmaps, word clouds) to make the data easily
interpretable and actionable.
By following these steps, you can develop a comprehensive, real-time system for understanding public
sentiment and environmental patterns during a natural disaster through Twitter data.
3 Explain Data analytics life cycle. Brief about regression Analysis. CO1
Data Analytics Lifecycle :
The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is iterative
to represent real project. To address the distinct requirements for performing analysis on Big Data, step–by–
step methodology is needed to organize the activities and tasks involved with acquiring, processing, analyzing,
and repurposing data.
Phase 1: Discovery –
The data science team learns and investigates the problem.
Develop context and understanding.
Come to know about data sources needed and available for the project.
The team formulates the initial hypothesis that can be later tested with data.
Phase 2: Data Preparation –
Steps to explore, preprocess, and condition data before modeling and analysis.
It requires the presence of an analytic sandbox, the team executes, loads, and transforms, to get data into
the sandbox.
Data preparation tasks are likely to be performed multiple times and not in predefined order.
Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.
Phase 3: Model Planning –
The team explores data to learn about relationships between variables and subsequently, selects key
variables and the most suitable models.
In this phase, the data science team develops data sets for training, testing, and production purposes.
Team builds and executes models based on the work done in the model planning phase.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/43
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Several tools commonly used for this phase are – Matlab and STASTICA.
Phase 4: Model Building –
Team develops datasets for testing, training, and production purposes.
Team also considers whether its existing tools will suffice for running the models or if they need more
robust environment for executing models.
Free or open-source tools – Rand PL/R, Octave, WEKA.
Commercial tools – Matlab and STASTICA.
Phase 5: Communication Results –
After executing model team need to compare outcomes of modeling to criteria established for success and
failure.
Team considers how best to articulate findings and outcomes to various team members and stakeholders,
taking into account warning, assumptions.
Team should identify key findings, quantify business value, and develop narrative to summarize and
convey findings to stakeholders.
Phase 6: Operationalize –
The team communicates benefits of project more broadly and sets up pilot project to deploy work in
controlled way before broadening the work to full enterprise of users.
This approach enables team to learn about performance and related constraints of the model in production
environment on small scale which make adjustments before full deployment.
The team delivers final reports, briefings, codes.
Free or open source tools – Octave, WEKA, SQL, MADlib.
Regression Analysis:
Regression is a statistical technique that relates a dependent variable to one or more independent
variables.
A regression model is able to show whether changes observed in the dependent variable are associated
with changes in one or more of the independent variables.
It does this by essentially determining a best-fit line and seeing how the data is dispersed around this
line.
Regression helps economists and financial analysts in things ranging from asset valuation to making
predictions.
For regression results to be properly interpreted, several assumptions about the data and the model
itself must hold.
Two different types of regression:
The main difference between simple and multiple regression is the number of independent variables used:
Simple regression
Also known as linear regression, this technique uses one independent variable to predict a dependent
variable. For example, predicting weight based on height is a simple regression.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/44
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
M ultiple regression
This technique uses two or more independent variables to predict a dependent variable. For example,
predicting height based on age and weight is a multiple regression.
4 How do you handle ordinal and nominal data in a dataset? Explain how to CO1
integrate two data frames?
In data science, handling categorical data and integrating data frames are crucial for effective analysis. Let's
explore these in the context of a data science workflow.
Categorical data must be transformed to ensure compatibility with machine learning models. These models
typically require numerical inputs, making it essential to handle ordinal and nominal data properly.
Ordinal data retains an inherent order, so preserving this order during data transformation is key.
Label encoding or custom mapping assigns numerical values in a way that reflects the rank or order of
the categories.
Example Use Case: In a customer satisfaction survey, the responses "poor", "average", and "excellent"
are ranked, so they can be encoded as 1, 2, and 3, respectively, preserving their order for analysis.
Nominal data has no inherent order, and transforming it requires methods that don’t impose any
artificial ranking.
One-hot encoding is widely used in data science to convert categorical variables into binary vectors.
This ensures that no ordinal relationship is implied between categories.
Example Use Case: If analyzing different car brands ("Ford", "BMW", "Toyota"), one-hot encoding
would create separate columns for each brand.
# Ordinal encoding
data = pd.DataFrame({'Satisfaction': ['Low', 'Medium', 'High']})
label_encoder = LabelEncoder()
data['Satisfaction_encoded'] = label_encoder.fit_transform(data['Satisfaction'])
# One-hot encoding
data = pd.DataFrame({'Car_Brand': ['Ford', 'BMW', 'Toyota']})
one_hot_encoded_data = pd.get_dummies(data, columns=['Car_Brand'])
In data science, merging and concatenating data frames is essential when working with multiple data sources or
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/45
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
when splitting and aggregating datasets. Proper integration allows for a unified analysis.
1. Merging DataFrames
Merging is commonly used when datasets share a common key, and you want to combine them based on that
key. In data science projects, merging might involve linking demographic data with behavioral data on a
customer ID or combining sales data from multiple regions on a common region column.
Example: Merging two data sets on customer ID for further analysis of sales and demographics.
python
Copy code
df_sales = pd.DataFrame({'CustomerID': [1, 2, 3], 'Sales': [100, 200, 300]})
df_demographics = pd.DataFrame({'CustomerID': [1, 2, 4], 'Age': [30, 40, 50]})
df_merged = pd.merge(df_sales, df_demographics, on='CustomerID', how='inner')
2. Concatenating DataFrames
Concatenation is used when stacking datasets either vertically or horizontally. In data science, this could mean
adding more observations (vertical stacking) or adding more features (horizontal stacking).
Vertical Concatenation: Used when two datasets have the same columns but different rows, such as
appending data from different months or years.
python
Copy code
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df_combined = pd.concat([df1, df2], axis=0)
Horizontal Concatenation: Used when adding more features to the dataset (e.g., adding new attributes to the
existing data).
python
Copy code
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df_combined = pd.concat([df1, df2], axis=1)
Merging customer sales data with web interaction logs based on a customer ID, allowing you to
analyze sales trends with customer behavior.
Concatenating monthly data from an e-commerce platform to analyze overall trends across a year by
stacking datasets row-wise.
These steps ensure that categorical data is transformed correctly, and data frames are integrated efficiently,
allowing for more accurate and scalable analyses.
R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/46