0% found this document useful (0 votes)
38 views48 pages

Unit - One QB

The document is a question bank for the course AD3491 - Fundamentals of Data Science and Analytics at Akshaya College of Engineering and Technology, intended for the II Year IV Semester B.Tech. degree course under the 2021 regulation. It outlines the course objectives, program outcomes, educational objectives, specific outcomes, and syllabus, detailing various units covering topics such as data science processes, descriptive analytics, inferential statistics, and predictive analytics. Additionally, it includes a list of textbooks and references, along with Bloom's taxonomy levels for assessment.

Uploaded by

mekalar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views48 pages

Unit - One QB

The document is a question bank for the course AD3491 - Fundamentals of Data Science and Analytics at Akshaya College of Engineering and Technology, intended for the II Year IV Semester B.Tech. degree course under the 2021 regulation. It outlines the course objectives, program outcomes, educational objectives, specific outcomes, and syllabus, detailing various units covering topics such as data science processes, descriptive analytics, inferential statistics, and predictive analytics. Additionally, it includes a list of textbooks and references, along with Bloom's taxonomy levels for assessment.

Uploaded by

mekalar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 48

AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

AD3491 - FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS

QUESTION BANK

FOR IV SEMESTER

B.TECH. DEGREE COURSE

[REGULATION - 2021]

R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/2
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND DATA SCIENCE

QUESTION BANK

SUBJECT CODE : AD3491

SUBJECT NAME : FUNDAMENTALS OF DATA SCIENCE AND


ANALYTICS

REGULATION : 2021

ACADEMIC YEAR : 2024– 2025

YEAR/SEMESTER : II / IV

BATCH : 2023 - 2027

Prepared by: Date:


Name: Verified by: Name: Date:
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/3
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

A r ed by: Name:
p o Date:
p v

VISION AND MISSION OF THE INSTITUTION


VISION

Emerge as a Premier Institute, producing globally competent engineers.


MISSION
Achieve Academic diligence through effective and innovative teaching-learning processes, using
ICT Tools.
Make students employable through rigorous career guidance and training programs.
Strengthen Industry Institute Interaction through MOUs and Collaborations.
Promote Research & Development by inculcating creative thinking through innovative projects
incubation.

VISION AND MISSION OF THE DEPARTMENT


VISIO
N To revolutionize the quality of AI technology by creating a state-of-the-art environment where
innovation, sustainability, and social impact converge.

.
MISSION
DM 1 : To develop and implement AI solutions that prioritize human values, ethics and social
responsibilities.
DM 2 : To provide cutting-edge AI research, driving innovation and advancing the state-of-the-art
in AI technology.
DM 3 : To provide industrial standards by means of collaborations for artificial intelligence and
data science.
DM 4 :To provide an excellent infrastructure that keeps up with modern trends and technologies
for professional entrepreneurship.

PROGRAM OUTCOMES (POS)


1 Engineering knowledge: Apply the knowledge of mathematics, science, engineering
fundamentals, and an engineering specialization to the solution of complex engineering
problems.

R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/4
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

2 Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences.
3 Design/development of solutions: Design solutions for complex engineering problems and
design system components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and environmental
considerations.
4 Conduct investigations of complex problems: Use research-based knowledge and research
methods including design of experiments, analysis and interpretation of data, and synthesis of
the information to provide valid conclusions.
5 Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities
with an understanding of the limitations.
6 The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to
the professional engineering practice.
7 Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
8 Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
9 Individual and team work: Function effectively as an individual, and as a member or leader
in diverse teams, and in multidisciplinary settings.
10 Communication: Communicate effectively on complex engineering activities with the
engineering community and with society at large, such as, being able to comprehend and write
effective reports and design documentation, make effective presentations, and give and receive
clear instructions.

11 Project management and finance: Demonstrate knowledge and understanding of the


engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
12 Life-long learning: Recognize the need for, and have the preparation and ability to engage
in independent and life-long learning in the broadest context of technological changes.

R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/5
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

PROGRAM EDUCATIONAL OBJECTIVES (PEOS)


PEO 1 : Apply the knowledge of basic sciences, mathematics, Artificial Intelligence, data science
and statistics to build a system that requires in analysis of huge volumes of data.
PEO 2 : Product Development:Design a model using Artificial Intelligence to solve the critical
problems in real world.
PEO 3 : Higher Studies: To enable the students to think logically and pursue life-long learning and
collaborate with an ethical attitude in a multidisciplinary team.

PROGRAMME SPECIFIC OUTCOME (PSOS)


PSO 1 :Create, select and apply the knowledge of AI and Data Science to solve societal problems.
PSO 2: Develop data analytics and data visualization skills, skills pertaining to knowledge
acquisition, knowledge representation and knowledge engineering, and hence be capable of
coordinating complex projects.

R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/6
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

SYLLABUS
AD3491 FUNDAMENTALS OF DATA SCIENCE AND ANALYTICS LTPC
3003
COURSE OBJECTIVES:
 To understand the techniques and processes of data science
 To apply descriptive data analytics
 To visualize data for various applications
 To understand inferential data analytics
 To analysis and build predictive models from data
UNIT I INTRODUCTION TO DATA SCIENCE 08
Need for data science – benefits and uses – facets of data – data science process – setting the research
goal – retrieving data – cleansing, integrating, and transforming data – exploratory data analysis – build
the models – presenting and building applications.
UNIT II DESCRIPTIVE ANALYTICS 10
Frequency distributions – Outliers –interpreting distributions – graphs – averages – describing variability
– interquartile range – variability for qualitative and ranked data - Normal distributions – z scores
–correlation – scatter plots – regression – regression line – least squares regression line – standard error of
estimate – interpretation of r2 – multiple regression equations – regression toward the mean.
UNIT III INFERENTIAL STATISTICS 09
Populations – samples – random sampling – Sampling distribution- standard error of the mean -
Hypothesis testing – z-test – z-test procedure –decision rule – calculations – decisions – interpretations -
one-tailed and two-tailed tests – Estimation – point estimate – confidence interval – level of confidence –
effect of sample size.
UNIT IV ANALYSIS OF VARIANCE 09
t-test for one sample – sampling distribution of t – t-test procedure – t-test for two independent samples –
p-value – statistical significance – t-test for two related samples. F-test – ANOVA – Two-factor
experiments – three f-tests – two-factor ANOVA –Introduction to chi-square tests.
UNIT V PREDICTIVE ANALYTICS 09
Linear least squares – implementation – goodness of fit – testing a linear model – weighted resampling.
Regression using StatsModels – multiple regression – nonlinear relationships – logistic regression –
estimating parameters – Time series analysis – moving averages – missing values – serial correlation –
autocorrelation. Introduction to survival analysis.
TOTAL: 45 PERIODS

COURSE OUTCOMES:
Upon successful completion of this course, the students will be able to:
CO1: Explain the data analytics pipeline
CO2: Describe and visualize data
CO3: Perform statistical inferences from data
CO4: Analyze the variance in the data
CO5: Build models for predictive analytics

R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/7
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

TEXT BOOKS
1. David Cielen, Arno D. B. Meysman, and Mohamed Ali, “Introducing Data Science”, Manning
Publications, 2016. (first two chapters for Unit I).
2. Robert S. Witte and John S. Witte, “Statistics”, Eleventh Edition, Wiley Publications, 2017.
3. Jake VanderPlas, “Python Data Science Handbook”, O’Reilly, 2016.

REFERENCES
1. Allen B. Downey, “Think Stats: Exploratory Data Analysis in Python”, Green Tea Press, 2014.
2. Sanjeev J. Wagh, Manisha S. Bhende, Anuradha D. Thakare, “Fundamentals of Data Science”,
CRC Press, 2022.
3. Chirag Shah, “A Hands-On Introduction to Data Science”, Cambridge University Press, 2020.
4. Vineet Raina, Srinath Krishnamurthy, “Building an Effective Data Science Practice: A Framework
to Bootstrap and Manage a Successful Data Science Practice”, Apress, 2021.

BLOOM TAXANOMY LEVELS (BTL)



LEVEL 1 – REMEMBERING (BTL1)

LEVEL 2 – UNDERSTANDING (BTL2)

LEVEL 3 – APPLYING (BTL3)

LEVEL 4 – ANALYZING (BTL4)

LEVEL 5 – EVALUATING (BTL5)

LEVEL 6 - CREATING(BTL6)

R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/8
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

UNIT I INTRODUCTION TO DATA SCIENCE


Need for data science – benefits and uses – facets of data – data science process – setting the research goal –
retrieving data – cleansing, integrating, and transforming data – exploratory data analysis – build the models –
presenting and building applications.
PART - A
S.No Question & Answer CO Univ
Qp
Month/
Year)
1. What is Data Science and need for Data Science? CO1
Data –Raw Facts
Data can be something simple and seemingly random and useless until it is
organized. Data can be anything text, image audio, video etc.
Example: 100
Data science uses mathematical algorithms, rules, and artificial intelligence in
dealing with the collection, refining, aligning, storage, manipulation, and
utilization of data.
Data science is the ability to process and interpret data. This enables companies to
make informed decisions around growth, optimization, and performance.
2. Define the term Pattern and trends in data? CO1
A pattern means that the data (visual or not) are correlated that they have a
relationship and that they are predictable.
A trending quantity is a number that is generally increasing or decreasing.
Consider this data on babies per woman in India from 1955-2015: In this case, the
numbers are steadily decreasing decade by decade, so this is a downward trend

In this case, the numbers are steadily decreasing decade by decade, so this is
a downward trend
US life expectancy from 1920-2000:

In this case, the numbers are steadily increasing decade by decade, so this
an upward trend

3. Why statistics play a major role in Data Science? CO1


Statistics is a branch of Mathematics that deals with collecting, analyzing and
interpreting large amount of data and allows us to derive knowledge from large
datasets and this knowledge can then be used to make predictions, decisions,
R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/9
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

classifications etc.

R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/10
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY
Population Sample
Advertisements for IT jobs in the The top 50 search results for advertisements
Netherlands for IT jobs in the Netherlands on May 1,
4. Difference between population and2020
sample with an example CO1
Songs from the Eurovision Song Winning songs from the Eurovision Song
Contest Contest that were performed in English
Undergraduate students in the 300 undergraduate students from three
Netherlands Dutch universities who volunteer for Our
psychology research study
All countries of the world Countries with published data available on
birth rates and GDP since 2000

5. What are the main phases of data science life cycle? CO1
 Discovery
 Data preparation
 Model planning
 Model building
 Operationalize
 Communicate results
6. What are the tools required for data science? CO1
Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R.Studio,MATLAB,
Excel,
 Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS
Redshift
 Data Visualization tools: R, Jupyter, Tableau, Cognos.
 Machine learning tools: Spark, Mahout, Azure ML studio.
7. What is Data modeling? CO1
Using machine learning and statistical techniques is the step to further achieve
our project goal and predict future trends. By working with clustering algorithms,
we can build models to uncover trends in the data that were not distinguishable in
graphs and stats. These create groups of similar events (or clusters) and more or
less explicitly express what feature is decisive in these results.
8. What are the facets of data science? CO1
Identifying the structure of data
Cleaning, filtering, reorganizing, augmenting, and aggregating data
Visualizing data
Data analysis, statistics, and modeling
Machine Learning
9. List the process steps in a data science CO1
 Ideate
 Explore
 Model
 Validate
 Display
 Operate

R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/11
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

10. What is meant by Data Smoothing? CO1


Data smoothing is a process that is used to remove noise from the dataset using
some algorithms. It allows for highlighting important features present in the
dataset. It helps in predicting the patterns. When collecting data, it can be
manipulated to eliminate or reduce any variance or any other noise form.
11. What is exploratory data analysis? CO1
Exploratory data analysis (EDA) is used by data scientists to analyze and
investigate data sets and summarize their main characteristics, often employing
data visualization methods. It helps determine how best to manipulate data sources
to get the answers We need, making it easier for data scientists to discover
patterns, spot anomalies, test a hypothesis, or check assumptions.
12. What is SMART goals framework? CO1
Specific: be clear on Our what's and how's
Measurable: identify the metrics that define success
→ Achievable: make it challenging but still be reachable Relevant: it should be
important or interesting to We
Time-bound: give it a deadline
13. State the process of Retrieving data CO1
Information or data retrieval is often a continuous process during which We will
consider, reconsider and refine Our research problem, use various different
information resources, information retrieval techniques and library services and
evaluate the information We find.
14. What is Data Cleaning? CO1
Data cleaning is the process of identifying and fixing incorrect data. It can be in
incorrect format, duplicates, corrupt, inaccurate, incomplete, or irrelevant. Various
fixes can be made to the data values representing incorrectness in the data. The
data cleaning and validation steps undertaken for any data science project are
implemented using a data pipeline. Each stage in a data pipeline consumes input
and produces output.
15. List out the common steps in the data cleaning process CO1
Removing duplicates Remove irrelevant data
Standardize capitalization
Convert data type
Handling outliers → Fix errors
Language Translation
Handle missing values
16. State the method of Handling Outliers. CO1
An outlier is a data point in statistics that dramatically deviates from other
observations. An outlier may reflect measurement variability, or it may point to an
experimental error; the latter is occasionally removed from the data set.
17. How do We handle missing values? CO1
During cleaning and munging in data science, handling missing values is one of
the most common tasks. The real-life data might contain missing values which
need a fix before the data can be used for analysis. We can handle missing values
by:
Either removing the records that have missing value or Filling the missing values
using some statistical technique or by gathering data understanding.
18. List out the Data Cleaning Tools. CO1
1. Microsoft Excel (Popular data cleaning tool)
2. Programming languages (Python, Ruby, SQL)
3. Data Visualizations (To spot errors in Our dataset) 4. Proprietary software
(OpenRefine, Trifacta etc.,)

R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEMESTER/QB/12
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

19 Define Data Integration. CO1


Data integration is the process of merging data from several disparate sources.
While performing data integration, We must work on data redundancy,
inconsistency, duplicity, etc. In data mining, data integration is a record
preprocessing method that includes merging data from a couple of the
heterogeneous data sources into coherent data to retam and provide a unified
perspective of the data.
20. Why is the Data Integration Important? CO1
One of the most common applications for data integration services and technologies
is market and consumer data collection. Data integration supports queries in these
vast datasets, benefiting from corporate intelligence and consumer data analytics to
stimulate real-time information delivery. Enterprise data integration feeds integrated
data into data centers to enable enterprise reporting, predictive analytics, and business
intelligence.
21. Define Data transformation. CO1
Data transformation is an essential data preprocessing technique that must be
performed on the data before data mining to provide patterns that are easier to
understand. Data transformation changes the format, structure, or values of the
data and converts them into clean, usable data. Data may be transformed at two
stages of the data pipeline for data analytics projects.
22. Difference between structured and unstructured data. CO1 Apr/
May
Basis of Structured data Unstructured data
2024

Techn It is based on character and binary


It is based on a relational
ology data.
database.

Structured data is less flexible


Flexibi lity and schema- dependent. There is an absence of
schema, so it is more flexible.

Scalab ility It is hard to scale database It is more scalable.


schema.
Robust It is less robust.
It is very robust.
ness
It is qualitative, as it cannot be
Nature Structured data is quantitative,
i.e., it consists of hard processed and analyzed using
numbers or things that can be conventional tools.
counted.
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

23. How does confusion matrix define the performance of classification CO1 Apr/
algorithm? May
A confusion matrix is a matrix that summarizes the performance of a machine 2024
learning model on a set of test data. It is a means of displaying the number of
accurate and inaccurate instances based on the model’s predictions. It is often used
to measure the performance of classification models, which aim to predict a
categorical label for each input instance.
A 2X2 Confusion matrix is shown below for the image recognition having a Dog
image or Not Dog image.

Predicted Dog Predicted Not Dog

True Positive (TP) False Negative (FN)


Actual Dog

False Positive (FP) True Negative (TN)


Actual Not Dog
 True Positive (TP): It is the total counts having both predicted and actual
values are Dog.
 True Negative (TN): It is the total counts having both predicted and actual
values are Not Dog.
 False Positive (FP): It is the total counts having prediction is Dog while
actually Not Dog.
 False Negative (FN): It is the total counts having prediction is Not Dog
while actually, it is Dog.

24. What is brushing and linking in Exploratory data analysis? CO1 Apr/
Brushing and linking are interactive tools used in exploratory data analysis (EDA) May
to visualize data and explore relationships between different visualizations: 2023
 Brushing
Interactively select a subset of data in one visualization by dragging a mouse or
using a bounding shape. This can highlight the selected data in other
visualizations.
 Linking
Apply user interactions in one visualization to other visualizations. For example,
selecting observations in one visualization will highlight the same observations in
other visualizations.

25. Define confusion matrix.


A confusion matrix is a table that is used to define the performance of a
classification algorithm. A confusion matrix visualizes and summarizes the
performance of a classification algorithm.
26. What are the benefits and uses of data science?
1. Healthcare:-Predictive Analytics, Medical Imaging, Personalized
Medicine
2. Finance:-Risk Management, Fraud Detection,Algorithmic Trading
3. Marketing:-Customer Segmentation, Sentiment Analysis, Predictive
Analytics
4. Retail:-

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEA


AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

5. Inventory Management, Recommendation Systems, Price Optimization


6. Transportation:-Route Optimization, Predictive Maintenance,
Autonomous Vehicles
7. Education:-Personalized Learning, Academic Analytics, Curriculum
Development
8. Entertainment:-Content Recommendation, Audience Analytics,
Production Analytics,
9. Manufacturing:-Quality Control,Supply Chain Optimization ,Process
Automation
10. Government:-Public Safety,Urban Planning,Policy Making

27. Differentiate Big data and Data science.


Aspect Big Data Data Science

Handling and processing vast Extracting insights and


Definition
amounts of data knowledge from data

Efficient storage, processing, Analyzing data to inform


Objective
and management of data decisions and predict trends

Volume, velocity, and variety Analytical methods, models,


Focus
of data and algorithms

Primary Collection, storage, and Data analysis, modeling, and


Tasks processing of data interpretation

Tools/Tec Hadoop, Spark, NoSQL Python, R, TensorFlow,


hnologies databases (e.g., MongoDB) Scikit-Learn

PART B
1. Explain the data science process life cycle. CO1 Nov/De
The life-cycle of data science is explained as below diagram. c-22

R2021 / AI&DS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / IV SEM/QB/12
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

The main phases of data science life cycle are given below:

1. Discovery: The first phase is discovery, which involves asking the right
questions. When We start any data science project, We need to determine what are
the basic requirements, priorities, and project budget. In this phase, we need to
determine all the requirements of the project such as the number of people,
technology, time, data, an end goal, and then we can frame the business problem
on first hypothesis level.

2. Data preparation: Data preparation is also known as Data Munging. In this


phase, we need to perform the following tasks:
o
Data cleaning
o
Data Reduction
o
Data integration
o
Data transformation,

After performing all the above tasks, we can easily use this data for our further
processes.

3. Model Planning: In this phase, we need to determine the various methods and
techniques to establish the relation between input variables. We will apply
Exploratory data analytics (EDA) by using various statistical formula and
visualization tools to understand the relations between variable and to see what
data can inform us. Common tools used for model planning are:
 SQL Analysis Services
 R
 SAS
 Python

4. Model-building: In this phase, the process of model building starts. We will


create datasets for training and testing purpose. We will apply different techniques
such as association, classification, and clustering, to build the model.

Following are some common Model building tools:


 SAS Enterprise Miner
 WEKA
 SPCS Modeler
 MATLAB

5. Operationalize: In this phase, we will deliver the final reports of the project,
along with briefings, code, and technical documents. This phase provides We a
clear overview of complete project performance and other components on a small
scale before the full deployment.

6. Communicate results: In this phase, we will check if we reach the goal, which
we have set on the initial phase. We will communicate the findings and final result
with the business team.
2. Elaborate the data science components. CO1
The main components of Data Science are given below:
1. Statistics: Statistics is one of the most important components of data science.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEA


R2021 / AIEM/QB/13
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

Statistics is a way to collect and analyze the numerical data in a large amount and
finding meaningful insights from it.
2. Domain Expertise: In data science, domain expertise binds data science
together. Domain expertise means specialized knowledge or skills of a particular
area. In data science, there are various areas for which we need domain experts.
3. Data engineering: Data engineering is a part of data science, which involves
acquiring, storing, retrieving, and transforming the data. Data engineering also
includes metadata (data about data) to the data.
4. Visualization: Data visualization is meant by representing data in a visual
context so that people can easily understand the significance of data. Data
visualization makes it easy to access the huge amount of data in visuals.
5. Advanced computing: Heavy lifting of data science is advanced computing.
Advanced computing involves designing, writing, debugging, and maintaining the
source code of computer programs.

6. Mathematics: Mathematics is the critical part of data science. Mathematics


involves the study of quantity, structure, space, and changes. For a data scientist,
knowledge of good mathematics is essential.
7. Machine learning: Machine learning is backbone of data science. Machine
learning is all about to provide training to a machine so that it can act as a human
brain. In data science, we use various machine learning algorithms to solve the
problems.
3. What is facets of data? Explain in detail. CO1 Apr/
May
Facets of Data 2023

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEA


R2021 / AIEM/QB/14
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

Very large amount of data will generate in big data and data science. These data is
various types and main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images

Structured Data
• Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data.
• The term structured data refers to data that is identifiable because it is organized
in a structure. The most common form of structured data or records is a database
where specific information is stored based on a methodology of columns and
rows.
• Structured data is also searchable by data type within content. Structured data is
understood by computers and is also efficiently organized for human readers.
• An Excel table is an example of structured data.

Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and
columns are not used for unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured
data.
• Even today in most of the organizations more than 80 % of the data are in
unstructured form. This carries lots of information. But extracting information
from these various sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data.
5. Since there is no structural binding for unstructured data, it is unpredictable in
nature.

Natural Language

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/15
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

• Natural language is a special type of unstructured data.


• Natural language processing enables machines to recognize characters, words
and sentences, then apply meaning and understanding to that information. This
helps machines to understand language as humans do.
• Natural language processing is the driving force behind machine intelligence in
many modern real-world applications. The natural language processing
community has had success in entity recognition, topic recognition,
summarization, text completion and sentiment analysis.
•For natural language processing to help machines understand human language, it
must go through speech recognition, natural language understanding and machine
translation. It is an iterative process comprised of several layers of text analysis.

Machine - Generated Data


• Machine-generated data is an information that is created without human
interaction as a result of a computer process or application activity. This means
that data entered manually by an end-user is not recognized to be machine-
generated.
• Machine data contains a definitive record of all activity and behavior of our
customers, users, transactions, applications, servers, networks, factory machinery
and so on.
• It's configuration data, data from APIs and message queues, change events, the
output of diagnostic commands and call detail records, sensor data from remote
equipment and more.
• Examples of machine data are web server logs, call detail records, network event
logs and telemetry.
• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions
generate machine data. Machine data is generated continuously by every
processor-based system, as well as many consumer-oriented systems.
• It can be either structured or unstructured. In recent years, the increase of
machine data has surged. The expansion of mobile devices, virtual servers and
desktops, as well as cloud- based services and RFID technologies, is making IT
infrastructures more complex.
Graph-based or Network Data
•Graphs are data structures to describe relationships and interactions between
entities in complex systems. In general, a graph contains a collection of entities
called nodes and another collection of interactions between a pair of nodes called
edges.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/16
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

• Nodes represent entities, which can be of any object type that is relevant to our
problem domain. By connecting nodes with edges, we will end up with a graph
(network) of nodes.
• A graph database stores nodes and relationships instead of tables or documents.
Data is stored just like we might sketch ideas on a whiteboard. Our data is stored
without restricting it to a predefined model, allowing a very flexible way of
thinking about and using it.
• Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
• Graph databases are capable of sophisticated fraud prevention. With graph
databases, we can use relationships to process financial and purchase transactions
in near-real time. With fast graph queries, we are able to detect that, for example,
a potential purchaser is using the same email address and credit card as included in
a known fraud case.
• Graph databases can also help user easily detect relationship patterns such as
multiple people associated with a personal email address or multiple people
sharing the same IP address but residing in different physical addresses.
• Graph databases are a good choice for recommendation applications. With graph
databases, we can store in a graph relationships between information categories
such as customer interests, friends and purchase history. We can use a highly
available graph database to make product recommendations to a user based on
which products are purchased by others who follow the same sport and have
similar purchase history.
• Graph theory is probably the main method in social network analysis in the early
history of the social network concept. The approach is applied to social network
analysis in order to determine important features of the network such as the nodes
and links (for example influencers and the followers).
• Influencers on social network have been identified as users that have impact on
the activities or opinion of other users by way of followership or influence on
decision made by other users on the network as shown in Fig. 1.2.1.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/17
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

• Graph theory has proved to be very effective on large-scale datasets such as


social network data. This is because it is capable of by-passing the building of an
actual visual representation of the data to run directly on data matrices.

Audio, Image and Video


• Audio, image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.
•The terms audio and video commonly refers to the time-based media storage
format for sound/music and moving pictures information. Audio and video digital
recording, also referred as audio and video codecs, can be uncompressed, lossless
compressed or lossy compressed depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important
sources of information and knowledge; the integration, transformation and
indexing of multimedia data bring significant challenges in data management and
analysis. Many challenges have to be addressed including big data,
multidisciplinary nature of Data Science and heterogeneity.
• Data Science is playing an important role to address these challenges in
multimedia data. Multimedia data usually contains various forms of media, such
as text, image, video, geographic coordinates and even pulse waveforms, which
come from multiple sources. Data Science can be a key instrument covering big

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/18
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

data, machine learning and data mining solutions to store, handle and analyze such
heterogeneous data.

Streaming Data
Streaming data is data that is generated continuously by thousands of data sources,
which typically send in the data records simultaneously and in small sizes (order
of Kilobytes).
• Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game
player activity, information from social networks, financial trading floors or
geospatial services and telemetry from connected devices or instrumentation in
data centers.

Difference between Structured and Unstructured Data

4. Exemplify the steps of retrieving data in data analytics. CO1

Information retrieval can be described as a process that can be divided into


different stages (see figure below). The figure below implies that the stages follow
each other during the process, but in reality they are often active simultaneously
and may have to be repeated during the same information retrieval process.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/19
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

Process in detail

1. Problem / topic: an information need occurs when more information is required


to solve a problem
2. Information retrieval plan: define Our information need and choose Our
information resources, retrieval techniques and search terms
3. Information retrieval: perform Our planned information retrieval (information
retrieval techniques)
4. Evaluating the results: evaluate the results of Our information retrieval
(number and relevance of search results)
5. Locating publications: find out where and how the required publication, e.g.
article, can be acquired
6. Using and evaluating the information: evaluate the final results of the process
(critical and ethical evaluation of the information and information resources)

The amount of time needed for each stage varies, but as a whole, information
retrieval is a time-consuming process, so We should start it as early as possible.

To avoid doing needless work and unnecessary mistakes, We should plan Our
information retrieval well and make notes of Our plan, including the different
stages of the process. A good information retrieval plan is especially important
when the area of Our research is broad. A well-documented plan is easy to return
to later on when needed.
7. Elucidate the data cleaning, integrating and transforming data in detail. CO1 April
Cleaning: /May
 Data cleansing is a subprocess of the data science process that focuses on 2023
removing errors in Our data so Our data becomes a true and consistent
representation of the processes it originates from.
 The first type is the interpretation error, such as incorrect use of
terminologies, like saying that a person’s age is greater than 300 years.
 The second type of error points to inconsistencies between data sources or
against Our company’s standardized values. An example of this class of
errors is putting “Female” in one table and “F” in another when they
represent the same thing: that the person is female.
 Steps to Perform Data Cleanliness
 Performing data cleaning involves a systematic process to identify and
rectify errors, inconsistencies, and inaccuracies in a dataset. The following
are essential steps to perform data cleaning.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/20
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

 Data Cleaning
 Removal of Unwanted Observations: Identify and eliminate irrelevant or
redundant observations from the dataset. The step involves scrutinizing
data entries for duplicate records, irrelevant information, or data points
that do not contribute meaningfully to the analysis. Removing unwanted
observations streamlines the dataset, reducing noise and improving the
overall quality.
 Fixing Structure errors: Address structural issues in the dataset, such as
inconsistencies in data formats, naming conventions, or variable types.
Standardize formats, correct naming discrepancies, and ensure uniformity
in data representation. Fixing structure errors enhances data consistency
and facilitates accurate analysis and interpretation.
 Managing Unwanted outliers: Identify and manage outliers, which are
data points significantly deviating from the norm. Depending on the
context, decide whether to remove outliers or transform them to minimize
their impact on analysis. Managing outliers is crucial for obtaining more
accurate and reliable insights from the data.
 Handling Missing Data: Devise strategies to handle missing data
effectively. This may involve imputing missing values based on statistical
methods, removing records with missing values, or employing advanced
imputation techniques. Handling missing data ensures a more complete
dataset, preventing biases and maintaining the integrity of analyses.
Integrating:
 Combining Data from different Data Sources.
 The data comes from several different places, and in this sub step we
focus on integrating these different sources.
The data can perform two operations to combine information from
different data sets. The first operation is joining and the second operation
is appending or stacking.
Joining Tables:
 Joining tables allows We to combine the information of one observation
found in one table with the information that We find in another table.
Appending Tables:
 Appending or stacking tables is effectively adding observations from one
table to another table.
Transforming Data
Data transformation is the process of converting, cleansing, and structuring data
into a usable format that can be analyzed to support decision making processes,
and to propel the growth of an organization.
Data transformation techniques
There are several data transformation techniques that are used to clean data and
structure it before it is stored in a data warehouse or analyzed for business
intelligence. Not all of these techniques work with all types of data, and
sometimes more than one technique may be applied. Nine of the most common
techniques are:
1. Revising
Revising ensures the data supports its intended use by organizing it in the required
and correct way. It does this in a range of ways.
Dataset normalization revises data by eliminating redundancies in the data set. The
data model becomes more precise and legible while also occupying less space.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/21
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

This process, however, does involve a lot of critical thinking, investigation and
reverse engineering.
Data cleansing ensures the formatting capability of data.
Format conversion changes the data types to ensure compatibility.
Key structuring converts values with built-in meanings to generic identifiers to be
used as unique keys.
Deduplication identifies and removes duplicates.
Data validation validates records and removes the ones that are incomplete.
Repeated and unused columns can be removed to improve overall performance
and legibility of the data set.
2. Manipulation
This involves creation of new values from existing ones or changing current data
through computation. Manipulation is also used to convert unstructured data into
structured data that can be used by machine learning algorithms.
Derivation, which is cross column calculations
Summarization that aggregates values
Pivoting which involves converting columns values into rows and vice versa
Sorting, ordering and indexing of data to enhance search performance
Scaling, normalization and standardization that helps in comparing dissimilar
numbers by putting them on a consistent scale
Vectorization which helps convert non-numerical data into number arrays that are
often used for machine learning applications
3. Separating
This involves dividing up the data values into its parts for granular analysis.
Splitting involves dividing up a single column with several values into separate
columns with each of those values. This allows for filtering on the basis of certain
values.
4. Combining/ integrating
Records from across tables and sources are combined to acquire a more holistic
view of activities and functions of an organization. It couples data from multiple
tables and datasets and combines records from multiple tables.
5. Data smoothing
This process removes meaningless, noisy, or distorted data from the data set. By
removing outliers, trends are most easily identified.
6. Data aggregation
This technique gathers raw data from multiple sources and turns it into a summary
form which can be used for analysis. An example is the raw data providing
statistics such as averages and sums.
7. Discretization
With the help of this technique, interval labels are created in continuous data in an
attempt to enhance its efficiency and easier analysis. The decision tree algorithms
are utilized by this process to transform large datasets into categorical data.
8. Generalization
Low level data attributes are transformed into high level attributes by using the
concept of hierarchies and creating layers of successive summary data. This helps
in creating clear data snapshots.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/22
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

9. Attribute construction
In this technique, a new set of attributes is created from an existing set to facilitate
the mining process.
8. Detail the term Exploratory data analysis. CO1

During exploratory data analysis We take a deep dive into the data. Information
becomes much easier to grasp when shown in a picture, therefore We mainly use
graphical techniques to gain an understanding of Our data and the interactions
between variables. This phase is about exploring data, so keeping Our mind open
and Our eyes peeled is essential during the exploratory data analysis phase. The
goal isn’t to cleanse the data, but it’s common that We’ll still discover anomalies
We missed before, forcing We to take a step back and fix them.

Figure 2.14. Step 4: Data exploration

The visualization techniques We use in this phase range from simple line graphs
or histograms, to more complex diagrams such as Sankey and network graphs.
Sometimes it’s useful to compose a composite graph from simple graphs to get
even more insight into the data. Other times the graphs can be animated or made
interactive to make it easier and, let’s admit it, way more fun.

A bar chart, a line plot, and a distribution are some of the graphs used in
exploratory analysis.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/23
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

Figure 2.16. Drawing multiple plots together can help We understand the
structure of Our data over multiple variables.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/24
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

Overlaying several plots is common practice. In figure 2.17 we combine simple


graphs into a Pareto diagram, or 80-20 diagram.

Figure 2.17. A Pareto diagram is a combination of the values and a


cumulative distribution. It’s easy to see from this diagram that the first 50%
of the countries contain slightly less than 80% of the total amount. If this
graph represented customer buying power and we sell expensive products, we
probably don’t need to spend our marketing budget in every country; we
could start with the first 50%.

The above figure shows another technique: brushing and linking. With brushing
and linking We combine and link different graphs and tables (or views) so
changes in one graph are automatically transferred to the other graphs.. This
interactive exploration of data facilitates the discovery of new insights.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/25
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

Link and brush allows We to select observations in one plot and highlight the
same observations in the other plots.

The above figure shows the average score per country for questions. Not only does
this indicate a high correlation between the answers, but it’s easy to see that when
We select several points on a subplot, the points will correspond to similar points
on the other graphs. In this case the selected points on the left graph correspond to
points on the middle and right graphs, although they correspond better in the
middle and right graphs.

Example histogram: the number of people in the age-groups of 5-year


intervals

Figure 2.20. Example boxplot: each user category has a distribution of the
appreciation each has for a certain picture on a photography website.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/26
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

In a histogram a variable is cut into discrete categories and the number of


occurrences in each category are summed up and shown in the graph. The boxplot,
on the other hand, doesn’t show how many observations are present but does offer
an impression of the distribution within categories. It can show the maximum,
minimum, median, and other characterizing measures at the same time.
9. Explain in detail about build the model with its component. CO1
1. Selection of a modeling technique and variables to enter in the model
2. Execution of the model
3. Diagnosis and model comparison

Step 5: Data modeling

Model building is a crucial phase in the data science process, where data is
transformed into actionable insights and predictions. In this blog, I will provide
you with a step-by-step guide to model building, equipping you with the essential
techniques to develop accurate and reliable predictive models.

Step 1: Define the Problem and Set Objectives:

Clearly define the problem you aim to solve and establish measurable objectives.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/27
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

Understand the scope, constraints, and desired outcomes of your model. This step
ensures that your model aligns with the problem at hand and provides meaningful
insights.

Step 2: Gather and Prepare the Data:

Collect the relevant data required for model building. Clean and preprocess the
data, handling missing values, outliers, and inconsistencies. Perform feature
engineering and selection to extract meaningful predictors and ensure data quality.

Step 3: Split the Data:

Split your data into training and testing sets. The training set is used to train the
model, while the testing set serves as an unseen dataset for evaluating the model’s
performance. Consider techniques like cross-validation for robust model
assessment.

Step 4: Choose the Right Algorithm:

Select the appropriate machine learning algorithm based on your problem type
(e.g., classification, regression) and data characteristics. Consider popular
algorithms like linear regression, decision trees, random forests, support vector
machines, or deep learning models.

Step 5: Train the Model:

Fit the selected algorithm to the training data. Adjust the model’s parameters and
hyperparameters to optimize its performance. Use techniques like grid search or
Bayesian optimization to find the best parameter settings.

Step 6: Evaluate the Model:

Assess the model’s performance using appropriate evaluation metrics, such as


accuracy, precision, recall, or mean squared error. Compare the model’s
predictions with the actual values in the testing dataset. Consider additional
techniques like ROC curves or confusion matrices for classification problems.

Step 7: Fine-tune and Optimize:

Iteratively refine your model to enhance its performance. Experiment with


different parameter settings, feature selections, or ensemble techniques to improve
accuracy and generalization. Regularize the model to avoid overfitting and ensure
it performs well on unseen data.

Step 8: Interpret the Results:

Understand and interpret the model’s output to gain insights into the underlying
patterns and relationships in the data. Analyze feature importances, coefficients, or
decision boundaries to explain the model’s behavior. Communicate the results
effectively to stakeholders.

Step 9: Deploy and Monitor:

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/28
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

Deploy your model in a production environment to make predictions or support


decision-making processes. Continuously monitor the model’s performance and
assess its impact on business outcomes. Update the model periodically as new
data becomes available.
10. Explain in detail about benefits and use of data science with counter
examples CO1 Apr/Ma
y 2024
Data science is the interdisciplinary field that uses scientific methods, algorithms, and systems to extract
insights and knowledge from structured and unstructured data. Its applications span multiple industries,
including healthcare, finance, marketing, and even entertainment. Here’s a detailed look at the benefits and
uses of data science, along with counterexamples to give a balanced perspective.

Benefits of Data Science

1. Improved Decision-Making
o
Use: Data science helps organizations make better decisions by analyzing historical data to
predict future trends. For example, companies can use customer data to anticipate demand and
optimize their inventory.
o
Benefit: Companies like Amazon use predictive analytics to recommend products to users
based on past behavior, leading to increased sales and customer satisfaction.
o
Counterexample: Over-reliance on data-driven decisions without considering external factors
or human insights can lead to poor outcomes. For instance, a stock trading algorithm that
solely focuses on historical trends might miss critical political events, leading to massive
losses.
2. Enhanced Customer Experiences
o
Use: Data science enables personalized recommendations and targeted marketing strategies.
Retailers can use customer data to create personalized shopping experiences.
o
Benefit: Netflix uses data to recommend shows and movies tailored to individual preferences,
significantly boosting user engagement.
o
Counterexample: If the data is biased or incomplete, personalization can fail. For example, a
recommendation system might suggest irrelevant products or content, annoying users and
driving them away.
3. Operational Efficiency
o
Use: Companies can optimize their operations by identifying bottlenecks or inefficiencies
through data analysis. For example, manufacturers can use data to improve production
schedules and reduce downtime.
o
Benefit: UPS uses data science to optimize delivery routes, saving millions of gallons of fuel
annually.
o
Counterexample: A purely data-driven focus might lead to short-term gains but long-term
problems. For example, a company might use data to cut costs excessively, leading to a decline
in employee satisfaction and productivity over time.
4. Fraud Detection and Risk Management
o
Use: Data science is used extensively in finance to detect fraudulent activities and assess credit
risks. Machine learning algorithms can spot patterns that humans might miss.
o
Benefit: Banks use data science to detect unusual transactions and prevent credit card fraud,
saving millions in potential losses.
o
Counterexample: Relying solely on algorithms for fraud detection might lead to false
positives. Genuine customers might have their accounts flagged or frozen erroneously, leading
to dissatisfaction.
5. Predictive Maintenance
o
Use: In industries such as manufacturing and aviation, data science can predict equipment
failures before they happen, allowing for timely maintenance.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/29
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

o
Benefit: Airlines use predictive maintenance to keep planes running safely an d efficiently by
analyzing data from engine sensors.
o
Counterexample: Predictive models might be inaccurate if the input data is poor or
incomplete, potentially leading to unnecessary maintenance costs or missed critical repairs.
6. Advances in Healthcare
o Use: Data science allows for personalized medicine, predictive diagnostics, and drug
discovery. By analyzing patient data, doctors can provide better, more person alized treatments.
o
Benefit: Cancer treatment can be tailored to a patient’s genetic profile, improving survival
rates and reducing side effects.
o
Counterexample: Misinterpretation of medical data or incorrect use of algorithms can have
dire consequences. For instance, over-reliance on AI-driven diagnostic tools might lead to
misdiagnoses, putting patients at risk.
7. Cost Reduction
o
Use: Organizations can identify areas to reduce costs by analyzing data on operational
inefficiencies, customer behavior, and more.
o
Benefit: Data science helps optimize supply chains, reduce waste, and streamline operations,
reducing overall costs for companies like Walmart.
o
Counterexample: Cost-cutting driven purely by data can sometimes be shortsighted. For
example, a company might focus on reducing employee benefits to cut costs, only to face high
turnover and decreased employee morale.
8. Scientific Discovery and Innovation
o
Use: Data science facilitates advancements in fields like genomics, climate science, and
physics by analyzing large datasets that were previously too vast for human analysis.
o
Benefit: Climate scientists use data models to predict future environmental conditions and plan
mitigation strategies.
o
Counterexample: Over-reliance on models can lead to inaccurate conclusions. For instance,
predictions based on incomplete or flawed climate data could lead to ineffective or misdirected
policy decisions.

Counterexamples Where Data Science Falls Short

1. Bias in Algorithms
o
Example: In hiring, some companies have used data-driven models to screen candidates. If the
data is biased (e.g., historical data that favored a particular demographic), the algorithm will
perpetuate that bias, resulting in discriminatory hiring practices.
2. Data Privacy Issues
o
Example: Companies using data science for targeted advertising may inadvertently violate
user privacy by collecting too much personal information. The Cambridge Analytica scandal,
where user data from Facebook was used without consent, is a prime example of data misuse.
3. Ethical Concerns
o
Example: Predictive policing, where law enforcement uses data to anticipate where crimes
might occur, raises concerns about racial profiling and the invasion of privacy. In some cases,
communities have been unfairly targeted based on biased data.
4. Overfitting Models
o
Example: In machine learning, if a model is too closely tailored to the training data, it may
perform poorly on new data (overfitting). For instance, a stock market model that perfectly
predicts past performance may fail when applied to future data because the market conditions
have changed.
5. Inability to Capture Complex Human Behavior
o
Example: Customer behavior is often influenced by emotional, social, and contextual factors
that data alone cannot capture. A retail company might use past purchasing data to predict
future behavior, but it might miss major shifts in preferences due to social trends, rendering the
model inaccurate.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/30
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

Illustrate in detail about different facets of data with examples. CO1 Apr/Ma
y 2024
Facets of Data

• Very large amount of data will generate in big data and data science. These data is various types and main
categories of data are as follows:

a) Structured

b) Natural language

c) Graph-based

d) Streaming

e) Unstructured

f) Machine-generated

g) Audio, video and images

Structured Data
• Structured data is arranged in rows and column format. It helps for application to retrieve and process data
easily. Database management system is used for storing structured data.

• The term structured data refers to data that is identifiable because it is organized in a structure. The most
common form of structured data or records is a database where specific information is stored based on a
methodology of columns and rows.

• Structured data is also searchable by data type within content. Structured data is understood by computers and
is also efficiently organized for human readers.

• An Excel table is an example of structured data.

Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and columns are not used for
unstructured data. Therefore it is difficult to retrieve required information. Unstructured data has no identifiable
structure.

• The unstructured data can be in the form of Text: (Documents, email messages, customer feedbacks), audio,
video, images. Email is an example of unstructured data.

• Even today in most of the organizations more than 80 % of the data are in unstructured form. This carries lots
of information. But extracting information from these various sources is a very big challenge.

• Characteristics of unstructured data:

1. There is no structural restriction or binding for the data.

2. Data can be of any type.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/31
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

3. Unstructured data does not follow any structural rules.

4. There are no predefined formats, restriction or sequence for unstructured data.

5. Since there is no structural binding for unstructured data, it is unpredictable in nature.

Natural Language
• Natural language is a special type of unstructured data.

• Natural language processing enables machines to recognize characters, words and sentences, then apply
meaning and understanding to that information. This helps machines to understand language as humans do.

• Natural language processing is the driving force behind machine intelligence in many modern real-world
applications. The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion and sentiment analysis.

•For natural language processing to help machines understand human language, it must go through speech
recognition, natural language understanding and machine translation. It is an iterative process comprised of
several layers of text analysis.

Machine - Generated Data


• Machine-generated data is an information that is created without human interaction as a result of a computer
process or application activity. This means that data entered manually by an end-user is not recognized to be
machine-generated.

• Machine data contains a definitive record of all activity and behavior of our customers, users, transactions,
applications, servers, networks, factory machinery and so on.

• It's configuration data, data from APIs and message queues, change events, the output of diagnostic
commands and call detail records, sensor data from remote equipment and more.

• Examples of machine data are web server logs, call detail records, network event logs and telemetry.

• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions generate machine data.
Machine data is generated continuously by every processor-based system, as well as many consumer-oriented
systems.

• It can be either structured or unstructured. In recent years, the increase of machine data has surged. The
expansion of mobile devices, virtual servers and desktops, as well as cloud- based services and RFID
technologies, is making IT infrastructures more complex.

Graph-based or Network Data


•Graphs are data structures to describe relationships and interactions between entities in complex systems. In
general, a graph contains a collection of entities called nodes and another collection of interactions between a
pair of nodes called edges.

• Nodes represent entities, which can be of any object type that is relevant to our problem domain. By
connecting nodes with edges, we will end up with a graph (network) of nodes.

• A graph database stores nodes and relationships instead of tables or documents. Data is stored just like we
might sketch ideas on a whiteboard. Our data is stored without restricting it to a predefined model, allowing a

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/32
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

very flexible way of thinking about and using it.

• Graph databases are used to store graph-based data and are queried with specialized query languages such as
SPARQL.

• Graph databases are capable of sophisticated fraud prevention. With graph databases, we can use relationships
to process financial and purchase transactions in near-real time. With fast graph queries, we are able to detect
that, for example, a potential purchaser is using the same email address and credit card as included in a known
fraud case.

• Graph databases can also help user easily detect relationship patterns such as multiple people associated with
a personal email address or multiple people sharing the same IP address but residing in different physical
addresses.

• Graph databases are a good choice for recommendation applications. With graph databases, we can store in a
graph relationships between information categories such as customer interests, friends and purchase history.
We can use a highly available graph database to make product recommendations to a user based on which
products are purchased by others who follow the same sport and have similar purchase history.

• Graph theory is probably the main method in social network analysis in the early history of the social network
concept. The approach is applied to social network analysis in order to determine important features of the
network such as the nodes and links (for example influencers and the followers).

• Influencers on social network have been identified as users that have impact on the activities or opinion of
other users by way of followership or influence on decision made by other users on the network as shown in
Fig. 1.2.1.

• Graph theory has proved to be very effective on large-scale datasets such as social network data. This is
because it is capable of by-passing the building of an actual visual representation of the data to run directly on
data matrices.

Audio, Image and Video


• Audio, image and video are data types that pose specific challenges to a data scientist. Tasks that are trivial
for humans, such as recognizing objects in pictures, turn out to be challenging for computers.

•The terms audio and video commonly refers to the time-based media storage format for sound/music and
moving pictures information. Audio and video digital recording, also referred as audio and video codecs, can
be uncompressed, lossless compressed or lossy compressed depending on the desired quality and use cases.

• It is important to remark that multimedia data is one of the most important sources of information and
knowledge; the integration, transformation and indexing of multimedia data bring significant challenges in data
management and analysis. Many challenges have to be addressed including big data, multidisciplinary nature
of Data Science and heterogeneity.

• Data Science is playing an important role to address these challenges in multimedia data. Multimedia data
usually contains various forms of media, such as text, image, video, geographic coordinates and even pulse
waveforms, which come from multiple sources. Data Science can be a key instrument covering big data,
machine learning and data mining solutions to store, handle and analyze such heterogeneous data.

Streaming Data
Streaming data is data that is generated continuously by thousands of data sources, which typically send in the

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/33
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

data records simultaneously and in small sizes (order of Kilobytes).

• Streaming data includes a wide variety of data such as log files generated by customers using your mobile or
web applications, ecommerce purchases, in-game player activity, information from social networks, financial
trading floors or geospatial services and telemetry from connected devices or instrumentation in data centers
12. Draw and outline step by step activities in data science process. CO1 Apr/Ma
y 2024

Step 1: Problem Identification and Planning


The first step in the data science project life cycle is to identify the problem that needs to be solved. This
involves understanding the business requirements and the goals of the project. Once the problem has been
identified, the data science team will plan the project by determining the data sources, the data collection
process, and the analytical methods that will be used.

Example
Suppose a retail company wants to increase its sales by identifying the factors that influence customer purchase
decisions. The data science team will identify the problem and plan the project by determining the data sources
(e.g., transaction data, customer data), the data collection process (e.g., data cleaning, data transformation), and
the analytical methods (e.g., regression analysis, decision trees) that will be used to analyze the data.

Step 2: Data Collection


The second step in the data science project life cycle is data collection. This involves collecting the data that
will be used in the analysis. The data science team must ensure that the data is accurate, complete, and relevant
to the problem being solved.

Example
In the retail company example, the data science team will collect data on customer demographics, transaction
history, and product information.

Step 3: Data Preparation


The third step in the data science project life cycle is data preparation. This involves cleaning and transforming
the data to make it suitable for analysis. The data science team will remove any duplicates, missing values, or
irrelevant data from the dataset. They will also transform the data into a format that is suitable for analysis.

Example
In the retail company example, the data science team will remove any duplicate or missing data from the
customer and transaction datasets. They may also merge the datasets to create a single dataset that can be

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/34
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

analyzed.

Step 4: Data Analysis


The fourth step in the data science project life cycle is data analysis. This involves applying analytical methods
to the data to extract insights and patterns. The data science team may use techniques such as regression
analysis, clustering, or machine learning algorithms to analyze the data.

Example
In the retail company example, the data science team may use regression analysis to identify the factors that
influence customer purchase decisions. They may also use clustering to segment customers based on their
purchase behavior.

Step 5: Model Building


The fifth step in the data science project life cycle is model building. This involves building a predictive model
that can be used to make predictions based on the data analysis. The data science team will use the insights and
patterns from the data analysis to build a model that can predict future outcomes.

Example
In the retail company example, the data science team may build a predictive model that can be used to predict
customer purchase behavior based on demographic and product information.

Step 6: Model Evaluation


The sixth step in the data science project life cycle is model evaluation. This involves evaluating the
performance of the predictive model to ensure that it is accurate and reliable. The data science team will test the
model using a validation dataset to determine its accuracy and performance.

Example
In the retail company example, the data science team may test the predictive model using a validation dataset to
ensure that it accurately predicts customer purchase behavior.

Step 7: Model Deployment


The final step in the data science project life cycle is model deployment. This involves deploying the predictive
model into production so that it can be used to make predictions in real-world scenarios. The deployment
process involves integrating the model into the existing business processes and systems to ensure that it can be
used effectively.

Example
In the retail company example, the data science team may deploy the predictive model into the company’s
customer relationship management (CRM) system so that it can be used to make targeted marketing campaigns.

Part-C
Answer all the questions
1 Outline the purpose of data cleansing. How missing and nullified attributes CO1
are handled and modified during preprocessing stage?
Data cleansing, also known as data cleaning or data scrubbing, is a crucial process in data preprocessing that
aims to improve the quality and accuracy of data by correcting or removing erroneous, incomplete, or irrelevant
data. The purpose of data cleansing includes:

 Improving Data Quality: Ensuring that the data is accurate, complete, and relevant to the analysis.
 Enhancing Data Consistency: Ensuring that data is uniform across different sources and systems.
 Removing Redundancy: Eliminating duplicate records to ensure efficiency in processing.
 Facilitating Accurate Analysis: Clean data helps in producing more reliable insights during data

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/35
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

analysis or machine learning modeling.


 Mitigating Errors: Reducing the occurrence of mistakes in data-driven decisions.
Handling and Modifying Missing and Nullified Attributes During Preprocessing

During preprocessing, missing and nullified attributes are handled through various techniques depending on the
nature of the data and the problem at hand. These techniques include:

While working with data it is a common scenario for the data scientists to deal with missing values. Handling
these missing values could be very important as most of the machine algorithms do not support missing values.
And even though if some algorithms like KNN and Naive Bayes handle missing values the results could be
skewed. Hence handling it efficiently is very important as it affects the performance of the model.
As any other exploratory data analysis method, there is no one good method that fits all. There are different
approaches for different kinds of problem like — time series, ML, Regression etc. and so it is difficult to
provide a general solution. In this blog we shall go through the types of missing values and ways of handling
them.
Types of missing values
Missing values in a dataset can occur for various reasons, and understanding the types of missing values can
help in choosing appropriate strategies for handling them.
1. Missing Completely at Random (MCAR):
In this scenario missing values completely occur at random and there is no relationship between the missing
data and any other values in the dataset. That is there is no pattern. We can say that the probability of data
being missing is the same for all observations.
E.g., Suppose that after the customer service call ends, we would be asked to give ratings to the customer
representative. Not all the customers would do this and any customer who decides to give feedback is
completely random irrespective of their experience or any other factors. In this case the missing values in the
feedback column would be an MCAR.
2. Missing at Random (MAR)
In this scenario, missing values do not occur at random but the pattern of missingness could be explained by
other observations. That is, the likelihood of a value missing in the dataset could possibly be due to some other
variables in the dataset.
For e.g., suppose a survey is taken at a dermatology clinic where the gender and their skincare routine is asked.
Assume that most of the females answer the survey whereas men are less likely to answer. So here, why the
data is missing could be explained by the factor, that is gender. In this case, the missing data in the skincare
routine column is MAR.
3. Missing Not at Random (MNAR):
In this case, the missing values are not random and cannot be explained by the observed data. This could be
challenging case as possible reason for the missingness are related to the unobserved data.
For example people having more income may refuse to share the exact information in a survey or
questionnaire.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/36
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

1. Deleting the column with missing data


If a certain column has many missing values i.e., if majority of the datapoints has NULL value for a particular
column then we can just simply drop the entire column.
In our example the deck column has 688 null values out of the total 891 datapoints. So more than half of the
values are null and hence we can simply choose to delete the column.
df = df.drop(['deck'],axis=1)
df.isnull().sum()

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/37
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

No doubt it is one of the quickest techniques one can use to deal with missing data but we also have to keep in
mind that there is loss of information. This technique should only be used when majority of the values in a
column has NULL values.
2. Deleting the row with missing data
In this method we are deleting rows which has at least one NULL value. This is not the best practice because of
the fact that data is information. Even though other values are non null we delete the entire row if there is at
least one NULL value. For instance, if every row has some (column) value missing, you might end up deleting
the whole data.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/38
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

Out of the 891 rows, 177 rows has age as NULL and 2 rows with embark_town as NULL. On deletion of those
rows we get 712 rows as a result.
3. Imputing missing values with mean/median
Columns in the dataset which are having numeric continuous values can be replaced with the mean, median, or
mode of remaining values in the column. This method can prevent the loss of data compared to the earlier
method. Replacing the above two approximations (mean, median) is a statistical approach to handle the missing
values.
This approach is popularly used when there are small number of missing values in the data. However, when
there are many missing values, mean or median results can result in a loss of variation in the data.
Mean and median imputation can provide a good estimate of the missing values, respectively for normally
distributed data, and skewed data.
The downside of this approach is that it cannot be applied for categorical columns. Also the mean imputation is
sensitive to outliers and may not be a good representation of the central tendency of the data.
df['age'] = df['age'].fillna(df['age'].mean(), inplace=True)
3.1 Imputing missing values with mean/median of group
We can fill the missing values using group level statistics in the following manner.
#Mean
df['age'] = df['age'].fillna(df.groupby('class')['age'].transform('mean'))
#Median
df['age'] = df['age'].fillna(df.groupby('class')['age'].transform('median'))
In this method we have filled NULL values of age by taking the mean of age at ‘class’ group level. We are
performing this method for filling the NULL values in age column assuming the fact that similar age group
people would have booked particular class of tickets in the ship.
4. Imputation method for categorical columns
When missing values is from categorical columns (string or numerical) then the missing values can be replaced
with the most frequent category. If the number of missing values is very large then it can be replaced with a
new category.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/39
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

df['deck'].value_counts()

#creating a new category 'H' as number of missing values is very big


df['deck'] = df['deck'].cat.add_categories(['H'])

df['deck'] = df['deck'].fillna('H')
In our dataset the column deck has almost 688 values which are NULL. Hence we are creating a new column
called ‘H’ and substituting it with the NULL values.
5. Forward Fill and Backward Fill
Forward fill (ffill) and backward fill (bfill) are methods used to fill missing values by carrying forward the last
observed non-missing value (for ffill) or by carrying backward the next observed non-missing value (for bfill).

If missing values should be filled with the most recent non-missing value, use ffill. If missing values should be
filled with the next non-missing value, use bfill.
6. Interpolation
Interpolation is a technique used to fill missing values based on the values of adjacent datapoints. This
technique is mainly used in case of time series data or in situation where the missing data points are expected to
vary smoothly or follow a certain trend. It is also used in cases where it is regularly sampled data.
Interpolation can be understood as a weighted average. The weights are inversely related to the distance to its

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/40
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

neighboring points.
7. Model Based Imputation (Regression Model)
In the earlier methods to handle missing values, we do not use the correlation advantage of the variable
containing the missing value and other variables.
In this method we used predictive models to impute missing values based on other features in the dataset.
The regression or classification model can be used for the prediction of missing values depending on the nature
(categorical or continuous) of the feature having missing value.
Here 'Age' column contains missing values.
So for prediction of null values the spliting of data will be
8. Multiple Imputation
The Iterative Imputer is a method for imputing missing values in a dataset. It belongs to the scikit-learn library
and implements the Multiple Imputation by Chained Equations (MICE) algorithm. MICE is an iterative
imputation approach that imputes missing values one variable at a time, conditioned on the other variables.
Suppose the feature ‘age’ is well correlated with the feature ‘Fare’ such that people with lower fares are also
younger and people with higher fares are also older. In that case, it would make sense to impute low age for
low fare values and high age for high fare values. So here, we are taking multiple features into account by
following a multivariate approach.
9. K-Nearest Neighbors Imputations (KNNImputer)
Imputing missing values using k-Nearest Neighbors (KNN) is a technique where missing values are estimated
based on the values of their nearest neighbors in the feature space.
The idea is to find the k nearest data points and use their values to impute the missing values.

2 How would you adapt the data science process to analyze the real time CO1
Twitter data during breaking news event (eg. Natural disaster) to understand
user sentiments and environmental patterns?
Adapting the data science process to analyze real-time Twitter data during a breaking news event, such as a
natural disaster, involves the following steps:

1. Problem Definition

 Goal: Understand user sentiments and identify environmental patterns during a natural disaster.
 Key Questions:
o
What are the prevailing sentiments (fear, anxiety, hope) in user tweets?
o
How are environmental conditions (floods, fires, etc.) being reported by users?
o
Can we identify emerging trends, keywords, or areas affected by the disaster?
 Constraints:
o
Real-time data collection and processing.
o
Potential data noise and misinformation.

2. Data Collection

 Real-time Twitter API: Use the Twitter Streaming API to collect live tweets.
o
Filters: Use keywords or hashtags related to the event (e.g., #flood, #earthquake, #hurricane).
o
Location-Based Data: Include geotagged tweets for location analysis to identify where users
are reporting from.
 External Data: Supplement Twitter data with external environmental data sources, such as weather
reports, satellite imagery, or government agencies’ feeds.

3. Data Preprocessing

 Text Preprocessing:
o
Cleaning: Remove unnecessary characters (URLs, hashtags, mentions).
o
Tokenization: Break tweets into individual tokens (words).

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/41
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

o
Stopwords Removal: Remove common words like "the" and "is."
o
Lemmatization: Convert words to their base forms (e.g., "running" to "run").
 Filtering Misinformation:
o
Detect and filter out spam, rumors, or irrelevant information using pre-built classifiers or rules.
 Geotagging:
o
Enrich non-geotagged tweets by extracting location references from the text (e.g., "New
York," "coastal city").

4. Exploratory Data Analysis (EDA)

 Sentiment Distribution:
o
Use sentiment analysis models (e.g., VADER, TextBlob, or transformers-based models like
BERT) to classify tweets as positive, negative, or neutral.
o
Visualize sentiment trends over time to understand shifts in emotions as the event unfolds.
 Environmental Patterns:
o
Use keyword extraction or topic modeling (e.g., LDA) to identify patterns in the data related to
the disaster (e.g., flood levels, wind damage).
o
Correlate emerging topics with the location data to track the affected regions.

5. Real-Time Data Processing & Monitoring

 Stream Processing Framework:


o
Use tools like Apache Kafka or Apache Spark Streaming to handle the continuous flow of data.
 Sentiment Dashboard:
o
Build a dashboard (e.g., using Streamlit or Plotly Dash) to visualize real-time sentiment
analysis, keywords, and geospatial distribution of tweets.
 Alerts & Insights:
o
Set up alerts based on specific conditions (e.g., a sudden spike in negative sentiments or
frequent mentions of critical areas).

6. Modeling and Analysis

 Sentiment Analysis Model:


o
Train a machine learning or deep learning model for sentiment classification (e.g., BERT,
LSTM).
o
Fine-tune the model on a labeled dataset of disaster-related tweets.
 Geospatial Analysis:
o
Use clustering algorithms (e.g., DBSCAN) on geotagged tweets to detect hotspots of activity
or areas where people report the most damage.
o
Visualize these clusters on interactive maps (e.g., using Folium or Plotly).
 Topic Modeling:
o
Apply techniques like Latent Dirichlet Allocation (LDA) or Non-Negative Matrix
Factorization (NMF) to detect trending topics related to the disaster.

7. Evaluation

 Model Evaluation:
o
Measure the performance of sentiment analysis and topic models using metrics like accuracy,
precision, recall, and F1-score.
 Continuous Validation:
o
Perform real-time error analysis by cross-referencing sentiment trends with external
reports (e.g., government alerts, official news).

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/42
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

8. Communication of Results

 Reporting:
o
Create summaries and insights for stakeholders (e.g., government agencies, disaster response
teams) that include real-time sentiment shifts, affected areas, and trending environmental
patterns.
 Visualization:
o
Use dynamic visualizations (graphs, heatmaps, word clouds) to make the data easily
interpretable and actionable.

9. Deployment & Feedback Loop

 Deploy the Analysis System:


o
Deploy the real-time data processing and visualization platform, ensuring the system is robust
enough to handle the data surge during breaking news events.
 Feedback Loop:
o
Continuously improve the models using new data from ongoing events, tuning them to enhance
accuracy and relevance.

Tools and Technologies

 Twitter API for real-time data collection.


 Natural Language Processing (NLP) libraries like NLTK, spaCy, and transformers for text
processing and sentiment analysis.
 Geospatial Libraries like Geopandas and Folium for mapping.
 Apache Kafka/Spark for real-time data streaming.
 Visualization Tools like Plotly Dash, Streamlit, or Tableau for dashboards.

By following these steps, you can develop a comprehensive, real-time system for understanding public
sentiment and environmental patterns during a natural disaster through Twitter data.
3 Explain Data analytics life cycle. Brief about regression Analysis. CO1
Data Analytics Lifecycle :
The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is iterative
to represent real project. To address the distinct requirements for performing analysis on Big Data, step–by–
step methodology is needed to organize the activities and tasks involved with acquiring, processing, analyzing,
and repurposing data.
Phase 1: Discovery –
The data science team learns and investigates the problem.
Develop context and understanding.
Come to know about data sources needed and available for the project.
The team formulates the initial hypothesis that can be later tested with data.
Phase 2: Data Preparation –
Steps to explore, preprocess, and condition data before modeling and analysis.
It requires the presence of an analytic sandbox, the team executes, loads, and transforms, to get data into
the sandbox.
Data preparation tasks are likely to be performed multiple times and not in predefined order.
Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.
Phase 3: Model Planning –
The team explores data to learn about relationships between variables and subsequently, selects key
variables and the most suitable models.
In this phase, the data science team develops data sets for training, testing, and production purposes.
Team builds and executes models based on the work done in the model planning phase.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/43
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

Several tools commonly used for this phase are – Matlab and STASTICA.
Phase 4: Model Building –
Team develops datasets for testing, training, and production purposes.
Team also considers whether its existing tools will suffice for running the models or if they need more
robust environment for executing models.
Free or open-source tools – Rand PL/R, Octave, WEKA.
Commercial tools – Matlab and STASTICA.
Phase 5: Communication Results –
After executing model team need to compare outcomes of modeling to criteria established for success and
failure.
Team considers how best to articulate findings and outcomes to various team members and stakeholders,
taking into account warning, assumptions.
Team should identify key findings, quantify business value, and develop narrative to summarize and
convey findings to stakeholders.
Phase 6: Operationalize –
The team communicates benefits of project more broadly and sets up pilot project to deploy work in
controlled way before broadening the work to full enterprise of users.
This approach enables team to learn about performance and related constraints of the model in production
environment on small scale which make adjustments before full deployment.
The team delivers final reports, briefings, codes.
Free or open source tools – Octave, WEKA, SQL, MADlib.

Regression Analysis:
 Regression is a statistical technique that relates a dependent variable to one or more independent
variables.
 A regression model is able to show whether changes observed in the dependent variable are associated
with changes in one or more of the independent variables.
 It does this by essentially determining a best-fit line and seeing how the data is dispersed around this
line.
 Regression helps economists and financial analysts in things ranging from asset valuation to making
predictions.
 For regression results to be properly interpreted, several assumptions about the data and the model
itself must hold.
Two different types of regression:
The main difference between simple and multiple regression is the number of independent variables used:
 Simple regression
Also known as linear regression, this technique uses one independent variable to predict a dependent
variable. For example, predicting weight based on height is a simple regression.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/44
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

 M ultiple regression
This technique uses two or more independent variables to predict a dependent variable. For example,
predicting height based on age and weight is a multiple regression.

4 How do you handle ordinal and nominal data in a dataset? Explain how to CO1
integrate two data frames?
In data science, handling categorical data and integrating data frames are crucial for effective analysis. Let's
explore these in the context of a data science workflow.

Handling Ordinal and Nominal Data in Data Science

Categorical data must be transformed to ensure compatibility with machine learning models. These models
typically require numerical inputs, making it essential to handle ordinal and nominal data properly.

1. Handling Ordinal Data in Data Science

 Ordinal data retains an inherent order, so preserving this order during data transformation is key.
 Label encoding or custom mapping assigns numerical values in a way that reflects the rank or order of
the categories.
 Example Use Case: In a customer satisfaction survey, the responses "poor", "average", and "excellent"
are ranked, so they can be encoded as 1, 2, and 3, respectively, preserving their order for analysis.

2. Handling Nominal Data in Data Science

 Nominal data has no inherent order, and transforming it requires methods that don’t impose any
artificial ranking.
 One-hot encoding is widely used in data science to convert categorical variables into binary vectors.
This ensures that no ordinal relationship is implied between categories.
 Example Use Case: If analyzing different car brands ("Ford", "BMW", "Toyota"), one-hot encoding
would create separate columns for each brand.

Example Code for Encoding:


python
Copy code
import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Ordinal encoding
data = pd.DataFrame({'Satisfaction': ['Low', 'Medium', 'High']})
label_encoder = LabelEncoder()
data['Satisfaction_encoded'] = label_encoder.fit_transform(data['Satisfaction'])

# One-hot encoding
data = pd.DataFrame({'Car_Brand': ['Ford', 'BMW', 'Toyota']})
one_hot_encoded_data = pd.get_dummies(data, columns=['Car_Brand'])

Integrating DataFrames in Data Science

In data science, merging and concatenating data frames is essential when working with multiple data sources or

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/45
AKSHAYA COLLEGE OF ENGINEERING AND TECHNOLOGY

when splitting and aggregating datasets. Proper integration allows for a unified analysis.

1. Merging DataFrames

Merging is commonly used when datasets share a common key, and you want to combine them based on that
key. In data science projects, merging might involve linking demographic data with behavioral data on a
customer ID or combining sales data from multiple regions on a common region column.

Example: Merging two data sets on customer ID for further analysis of sales and demographics.
python
Copy code
df_sales = pd.DataFrame({'CustomerID': [1, 2, 3], 'Sales': [100, 200, 300]})
df_demographics = pd.DataFrame({'CustomerID': [1, 2, 4], 'Age': [30, 40, 50]})
df_merged = pd.merge(df_sales, df_demographics, on='CustomerID', how='inner')

2. Concatenating DataFrames

Concatenation is used when stacking datasets either vertically or horizontally. In data science, this could mean
adding more observations (vertical stacking) or adding more features (horizontal stacking).

Vertical Concatenation: Used when two datasets have the same columns but different rows, such as
appending data from different months or years.
python
Copy code
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
df_combined = pd.concat([df1, df2], axis=0)

Horizontal Concatenation: Used when adding more features to the dataset (e.g., adding new attributes to the
existing data).
python
Copy code
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': [3, 4]})
df_combined = pd.concat([df1, df2], axis=1)

Practical Use Cases:

 Merging customer sales data with web interaction logs based on a customer ID, allowing you to
analyze sales trends with customer behavior.
 Concatenating monthly data from an e-commerce platform to analyze overall trends across a year by
stacking datasets row-wise.

These steps ensure that categorical data is transformed correctly, and data frames are integrated efficiently,
allowing for more accurate and scalable analyses.

R2021 / CSBS / AD3491 – Fundamentals of Data Science and Analytics/ II YEAR / III SEM/QB/46

You might also like