Datascience Sum.23sol
Datascience Sum.23sol
SETI-SEM4-IT-C2
ANS: Python is favoured in data science due to its readability, simplicity, and extensive libraries for
data manipulation, analysis, and visualization. Its versatility and ease of use allow data scientists to
focus on problem-solving rather than complex coding. Key libraries like Pandas, NumPy, and Scikit-
learn provide tools for data wrangling, numerical computation, and machine learning.
Key features:
Extensive Libraries
Versatility
ANS: Data science finds applications across numerous sectors, leveraging its analytical power to
drive better decision-making, improve efficiency, and enhance customer experience. Key areas
include healthcare, finance, marketing, logistics, retail, and technology.
Healthcare:
Data science enables advancements in diagnosis, treatment, drug discovery, and patient care.
Finance:
It's used for fraud detection, risk management, algorithmic trading, and personalized financial
advice.
Marketing:
Data science helps businesses understand customer behaviour, personalize marketing campaigns,
and improve advertising effectiveness.
Data-driven insights optimize routes, reduce costs, improve delivery times, and enhance overall
efficiency.
Technology:
In essence, data science provides valuable insights that businesses can use to:
Improve decision-making:
By analysing data, companies can make more informed decisions about their operations, products,
and services.
Increase efficiency:
Data science can help automate processes, optimize resources, and reduce costs.
Personalized recommendations, targeted advertising, and improved customer service are examples
of how data science can enhance customer experience.
(c) What is the process of data analytics? Explain each step in detail.
ANS: The data analytics process involves several key steps, starting with defining the problem,
collecting and preparing the data, conducting analysis, and interpreting and communicating
findings. Here's a more detailed look at each stage:
Determine the necessary data sources and gather the relevant information.
Clean and prepare the data by addressing issues like missing values, inconsistencies, and
duplicates.
Perform exploratory data analysis (EDA) to understand the data's characteristics and identify
potential patterns.
Draw meaningful conclusions from the analysis and identify key insights.
Communicate the findings clearly and effectively, often through data visualizations and
reports.
ANS: Prediction in data analysis refers to the process of using existing data to make informed
guesses or estimates about future or unknown outcomes. It involves applying statistical techniques
or machine learning models to analyse historical data patterns and then forecasting future values or
behaviours.
Key Points:
Purpose: To anticipate future trends, behaviours, or outcomes.
Used in: Business forecasting, weather prediction, stock market analysis, healthcare
diagnostics, etc.
Tools/Methods: Regression analysis, decision trees, neural networks, time series analysis,
etc.
Example:
If a company uses past sales data to estimate next month's sales, that’s prediction in data analysis.
ANS: Predictive analytics uses historical data to forecast future outcomes, while prescriptive
analytics builds upon predictions to recommend specific actions to optimize outcomes. Predictive
techniques include regression analysis, decision trees, and neural networks, while prescriptive
techniques utilize optimization algorithms, simulation, and game theory.
Regression Analysis:
A statistical method used to model the relationship between variables. It can predict the value of a
dependent variable based on the values of independent variables, according to Google Cloud.
Decision Trees:
A flowchart-like structure that visually represents decisions and their possible outcomes. Each branch
represents a decision, and the end nodes represent potential results, as explained by Google Cloud.
Neural Networks:
Inspired by the human brain, these models use interconnected nodes (neurons) to learn from data
and make predictions. They are particularly useful for complex patterns and can be trained to predict
future outcomes, according to Google Cloud.
Optimization:
Identifying the best course of action within given constraints to achieve specific goals. This involves
mathematical models and algorithms to find the optimal solution, according to TechTarget.
Simulation:
Creating a model of a system or process to test different scenarios and see how they might impact
the future. This allows for exploration of potential outcomes and identification of the best
strategies, according to TechTarget.
Game Theory:
A framework for analyzing strategic decision-making in situations involving multiple actors with
conflicting or cooperative goals. It can help identify optimal strategies for each actor, according to
TechTarget.
Quantitative techniques in EDA focus on numerical summaries and statistical measures that provide
insights into the data’s distribution, central tendency, spread, relationships, and patterns.
1. Descriptive Statistics
These are basic numerical summaries that describe the main features of a dataset.
Break the data into 100 (percentiles) or 4 (quartiles) parts to understand distribution.
2. Univariate Analysis
Techniques:
Frequency Tables
Summary Statistics
a) Covariance
b) Correlation
Shows frequency distribution of variables in matrix form (especially for categorical variables).
Techniques:
Z-score method: Data points with |Z| > 3 are usually considered outliers.
5. Normality Tests
Tests:
Shapiro-Wilk Test
Kolmogorov-Smirnov Test
Anderson-Darling Test
Although more advanced analysis comes later, some basic tests during EDA help identify early
relationships:
While quantitative techniques summarize data using numbers, graphical techniques use
visualizations to detect patterns, trends, relationships, and outliers in data. These visual methods
help you understand the structure and distribution of the data quickly and intuitively.
a) Histogram
c) Bar Chart
d) Pie Chart
e) Density Plot
a) Scatter Plot
b) Line Graph
a) Heatmap
c) Bubble Chart
Like a scatter plot but includes size of the bubbles to represent a third variable.
d) 3D Scatter Plot
Autocorrelation Plot: Visualizes correlation between time series and lagged versions.
5. Advanced and Interactive Visuals (Tools like Plotly, Tableau, Power BI)
Violin Plots: Combines box plot and KDE for a richer distribution view.
It’s an open-source library providing Python users with a high-performance capability of data
manipulation. Pandas is basically developed on the NumPy package’s top. Meaning, operating
Pandas would always require NumPy.
The term Pandas was originally derived from Panel Data. It refers to the Econometrics out of
Multidimensional data. Pandas was developed back in 2008 by Wes McKinney, and it is useful for the
Python language in terms of data analysis. Before Pandas came into the picture, Python was already
capable of data preparation, but the overall support that Python provided for data analysis was very
little.
Thus, Pandas was introduced for enhancing the data analysis capabilities of multi-folds. It performs
five major steps to process and analyse the available data, irrespective of its origin. These five steps
are loading, manipulation, preparation, modelling, and analysis.
What is NumPy?
NumPy is mainly Python’s extension module. C language is mostly used to write NumPy. It acts as a
Python package that performs the processing and numerical computations of single-dimensional and
multi-dimensional array elements. When we use NumPy arrays, the calculations become much faster
than that of the normal Python arrays.
Travis Oliphant created the NumPy package way back in 2005. It was developed by the addition of
the Numeric module’s functionalities (ancestor module) into another module named Num array. It
can handle a huge amount of data and information, and it is also very much convenient with data
reshaping and Matrix multiplication.
Q.4 (a) Differentiate quantitative data and qualitative data with an example.
ANS: Qualitative data describes qualities or characteristics, while quantitative data is numerical and
measurable. For example, qualitative data could include the color of a car or the taste of a fruit, while
quantitative data could include the number of cars in a parking lot or the height of a tree.
Elaboration:
Qualitative Data:
This type of data is descriptive and focuses on qualities, characteristics, and observations. It's often
collected through methods like interviews, observations, or open-ended surveys. Examples include:
Describing a person's appearance: "The person has brown hair and blue eyes".
Describing an event: "The party was lively and filled with music".
Quantitative Data:
This data is numerical and measurable. It's often collected through experiments, surveys, or other
methods that produce numerical results. Examples include:
analytics.
ANS: In data analytics, histograms, skewness, and kurtosis are essential tools for understanding data
distribution. Histograms visually represent data frequency, while skewness measures the asymmetry
of the distribution, and kurtosis quantifies the "tailedness" or peakedness compared to a normal
distribution.
1. Skewness:
Definition:
Skewness measures the asymmetry of a data distribution. It quantifies how much the distribution
deviates from a symmetrical bell curve.
Interpretation:
Positive Skewness: The right tail is longer, meaning there are more extreme values
on the higher end of the distribution. The mean is typically greater than the median.
Negative Skewness: The left tail is longer, indicating more extreme values on the
lower end of the distribution. The mean is typically less than the median.
Zero Skewness: Indicates a perfectly symmetrical distribution.
Significance:
2. Kurtosis:
Definition:
Interpretation:
Positive Kurtosis (Leptokurtic): The distribution has heavier tails and a sharper peak
than a normal distribution, meaning there are more extreme values in the tails
(outliers).
Negative Kurtosis (Platykurtic): The distribution has lighter tails and a flatter peak
than a normal distribution, meaning there are fewer extreme values in the tails.
Zero Kurtosis (Mesokurtic): The distribution has tails and a peak similar to a normal
distribution.
Significance:
Labeled axes: It has rows and columns, both of which have labels (row labels are called
index).
Heterogeneous data: Each column can hold different data types (e.g., integers, strings,
floats).
Size mutable: You can add or delete columns and rows.
(c OR) Explain Data Visualization with basic ideas and tools used for data visualization.
ANS: Data visualization is the process of converting raw data into graphical representations like
charts, graphs, and maps to make complex information easier to understand and analyse. This helps
identify trends, outliers, and patterns in data, facilitating better decision-making and
communication.
Basic Ideas:
Communication:
Data visualization makes data accessible to a wider audience, including those without technical
expertise, by presenting information in a visual and intuitive format.
Exploration:
It allows users to quickly explore large datasets, identify relationships between variables, and
uncover hidden insights that might be missed when analyzing raw data.
Actionable Insights:
By highlighting key trends and patterns, data visualization helps users make informed decisions based
on the insights revealed through the visual representations.
Tools Used:
Software:
Power BI: Another widely used tool, especially within the Microsoft ecosystem, for
creating visualizations, reports, and dashboards.
Google Charts: A free and powerful tool for creating interactive charts for
embedding online, particularly for web developers and designers.
Datastudio: A versatile tool for creating reports and dashboards, and is part of the
Google ecosystem.
Programming Languages/Libraries:
Python: Offers libraries like Matplotlib and Seaborn for creating a wide range of
visualizations, especially for data scientists and analysts.
R: Another strong programming language for data analysis and visualization, with
libraries like ggplot2, providing a rich ecosystem for creating visualizations.
Types of Visualizations:
Charts:
Bar charts, line charts, pie charts, histograms, and more, are used to represent data in various
ways.
Graphs:
Scatter plots, bubble charts, and network graphs are used to explore relationships between
variables.
Maps:
Geographic visualizations like heat maps and choropleth maps are used to display spatial data.
Infographics:
Visual representations that combine text and images to communicate complex information in a
concise and visually appealing way.
Used to showcase dynamic data and allow users to explore data interactively.
Q.5 (a) Explain any one Python bitwise operator with an example. (03 Marks)
The bitwise AND operator compares each bit of two numbers and returns 1 only if both
corresponding bits are 1, otherwise returns 0.
Example:
result = a & b
Here,
0101 (5)
= 0001 (1)
ANS: Feature Generation is the process of creating new input features from existing ones to improve
the performance of machine learning models.
Examples:
From a "Date of Birth", generate features like Age, Day of the week, Month, etc.
From text data, generate features like word count, TF-IDF scores, etc.
In short, Feature Generation transforms raw data into better representations for machine learning
models.
ANS: Feature Selection is the process of selecting the most important and relevant features (input
variables) from a dataset and removing irrelevant, redundant, or noisy data. It is a crucial step in the
preprocessing phase of building machine learning models.
2. Reduces Training Time: Fewer features mean faster computation and quicker model training.
3. Enhances Accuracy: By focusing only on useful features, the model can learn better.
1. Filter Methods:
Examples:
o Correlation
o Chi-square test
o Information gain
2. Wrapper Methods:
Example: Recursive Feature Elimination (RFE) – removes the least important feature at each
step.
3. Embedded Methods:
Q.5 (b OR) Discuss the use of Data Science in the Agricultural Field. (04 Marks)
ANS: Data science plays a significant role in modern agriculture by helping farmers and researchers
make better decisions through data-driven insights. Here are some key uses:
Using historical data, weather patterns, and soil conditions, data science helps predict crop
yields more accurately.
2. Precision Farming:
Uses data from sensors, drones, and satellites to monitor crops and soil health in real-time.
Farmers can apply fertilizers, water, or pesticides only where needed, reducing waste and
increasing efficiency.
Image processing and machine learning can identify early signs of pest attacks or plant
diseases, helping in quick response and reduced crop damage.
4. Market Forecasting:
Data analysis helps predict market demand, pricing trends, and helps farmers decide what to
grow and when to sell for better profits.
5. Weather Forecasting:
Predicting rain, drought, or extreme temperatures helps in planning irrigation and protecting
crops.
Conclusion:
Data science in agriculture promotes sustainability, efficiency, and profitability by turning raw farm
data into actionable insights.
Q.5 (c OR) Explain Data Science with Different Ethical Issues in Detail.
Data science has great potential, but it also raises several ethical issues that must be addressed to
ensure fairness, privacy, and transparency.
1. Data Privacy:
Collecting and using personal or sensitive data (e.g., health records, location) without user
consent can violate privacy rights.
Example: A loan approval model may deny loans to certain groups based on historical bias.
Many machine learning models are "black boxes" and difficult to interpret.
Users have the right to know how decisions affecting them are made.
4. Data Security:
5. Misuse of Data:
Example: Social media data used for political manipulation without user awareness.
6. Informed Consent:
Users must be informed about what data is being collected and how it will be used.
Conclusion:
While data science offers powerful tools and insights, it must be practiced responsibly. Ethical data
use requires transparency, fairness, privacy protection, and respect for human rights. Following
guidelines and ethical frameworks is essential for building trust in data science solutions.
BE(Minor) - SEMESTER– IV EXAMINATION – WINTER 2023 SOLUTIONS
ANS: Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of mathematics,
statistics, artificial intelligence, and computer engineering to analyse large amounts of data. This
analysis helps data scientists to ask and answer questions like what happened, why it happened,
what will happen, and what can be done with the results.
Data science is important because it combines tools, methods, and technology to generate meaning
from data. Modern organizations are inundated with data; there is a proliferation of devices that can
automatically collect and store information. Online systems and payment portals capture more data
in the fields of e-commerce, medicine, finance, and every other aspect of human life. We have text,
audio, video, and image data available in vast quantities.
ANS: Data analysis is a process of inspecting, cleansing, transforming, and modeling data to extract
meaningful insights, draw conclusions, and support decision-making. It involves several key steps,
from defining the problem to presenting the findings.
This sets the direction for the entire analysis and ensures the data collected and analyzed is
relevant.
Address issues like missing values, inconsistencies, and errors in the data.
Apply appropriate analytical techniques (e.g., descriptive statistics, regression analysis, data
mining) to identify patterns, trends, and relationships.
Use tools and techniques to explore the data and extract meaningful information.
Understand the implications of the findings and their relevance to the original question or
problem.
6. Present the Findings:
Communicate the results in a clear and concise manner, often using visualizations like charts
and graphs.
Ensure the findings are easily understandable and actionable for the intended audience.
o Python has a simple and clean syntax that mimics natural language, making it
beginner-friendly.
2. Interpreted Language:
3. Dynamically Typed:
o You don't need to declare variable types; Python determines them at runtime.
o Python comes with a rich set of modules and functions for various tasks (like math,
file I/O, web services, etc.).
5. Cross-platform Compatibility:
o Python code runs on different operating systems (Windows, macOS, Linux) without
modification.
o Python has a huge global community, so it’s easy to find solutions, libraries, and
frameworks for various applications.
8. Integration Capabilities:
o Python can easily integrate with other languages like C, C++, and Java, and can be
used for web and API development.
o Python allows quick prototyping and development, especially for startups and
research projects.
OR
ANS: Prediction in data analysis refers to the process of using historical data to make informed
guesses or estimates about future outcomes or unknown values.
It involves using statistical techniques or machine learning models to analyze existing data patterns
and apply them to new data to forecast results.
Key Points:
Based on patterns: Predictions are made by identifying trends and patterns in past data.
Uses models: Predictive models like linear regression, decision trees, or neural networks are
commonly used.
Common applications:
o Predicting sales
o Forecasting weather
o Diagnosing diseases
ANS: A conclusion in data analytics is a summary of insights gained from analyzing data. After
processing and examining data (using statistics, visualizations, and models), analysts draw key
findings that explain what the data reveals. These conclusions help answer questions like:
What happened?
Example: If a company analyzes customer data and sees that sales are higher in summer, the
conclusion might be:
"Sales increase during summer due to seasonal demand."
Prediction:
A prediction uses past data to forecast future outcomes. It often involves machine learning models or
statistical techniques to estimate what is likely to happen next.
Example: Using historical sales data, a company might predict:
"We expect a 20% increase in sales next July."
Summary:
Prediction Estimates future outcomes using data 20% sales growth expected next July
ANS: Data Science is used in many fields to extract insights, make predictions, and improve decision-
making. Here are some major applications of Data Science:
1. Healthcare
Personalized medicine
2. Finance
Fraud detection
Customer segmentation
Targeted advertising
5. Transportation
Route optimization
6. Manufacturing
7. Education
ANS: Feature generation, also known as feature construction or feature engineering, is the process
of creating new features from existing ones to improve the performance of a machine learning
model. It involves transforming and combining original features to generate new variables that better
capture relevant information for the target variable.
In brief:
Goal:
To enhance model accuracy and efficiency by creating features that are more informative and
relevant to the problem.
Process:
Examples:
Benefits:
Combine existing features to reveal interactions or nonlinear relationships that might be missed by
the original features alone.
Eliminate redundant or irrelevant features while retaining important information, making the model
more efficient and easier to interpret.
Provide the model with more informative features that better correlate with the target variable,
leading to higher predictive accuracy.
For instance, if we have two categorical features, we can generate a new feature by combining them
(e.g., "gender_city"). Or, we can create polynomial features from existing numerical features (e.g.,
squaring or cubing the values) to capture quadratic or cubic relationships. Feature generation can be
a manual process, relying on domain knowledge and experience, or it can be automated using
techniques like ExploreKit.
ANS: Feature selection in machine learning is the process of choosing a subset of relevant features
(variables) from a larger set to improve model performance, reduce overfitting, and enhance model
interpretability. It involves identifying and retaining the most informative and predictive features
while discarding irrelevant or redundant ones. This process aims to optimize model accuracy, reduce
computational costs, and improve model generalization.
By focusing on the most relevant features, models can learn more accurately and effectively.
Reduced Overfitting:
Overfitting occurs when a model learns the training data too well, including noise and irrelevant
details. Feature selection helps prevent this by focusing on the most important features, leading to
better generalization on unseen data.
A model with fewer, more meaningful features is easier to understand and explain.
Using fewer features can significantly speed up model training and reduce memory requirements.
Redundant or irrelevant features are discarded to avoid noise and unnecessary complexity.
Filter Methods:
These methods evaluate the features based on statistical properties, such as correlation with the
target variable, without using a specific machine learning algorithm.
Wrapper Methods:
These methods evaluate different combinations of features by training and evaluating models on
those subsets. The best performing subset is then selected.
Embedded Methods:
These methods incorporate feature selection into the model training process itself, allowing the
model to learn which features are most important during training.