Vaja Prince
Vaja Prince
At
SUBMITTED BY:
Sagar Jasani
DURATION OF INTERNSHIP: 27/06/2024 to 07/08/2024 (Six Weeks)
SUBMITTED TO
Campus (Gondal)
1
2024 – 2025 COMPLATION CERTIFICATE
2
INDUSTRY LETTERHEAD
3
DECLARATION
Date: - 27/06/2024
4
GUJARAT TECHNOLOGICAL UNIVERSITY
Syllabus for Master of Business Administration (Part-Time), 5th Semester
Subject Name: Summer Internship Project (SIP) Subject Code: 4350704
This is to certify that project work embodied in this report entitled Air Quality Index
was carried out by VajaPrince-229780307063 of Asiatic Institute of Science &
Technology-978.
This report is for the partial fulfilment of the requirement of the award of the degree of Master of
Business Administration offered by Gujarat Technological University.
(Examiner’s
Sign) Name of
Examiner:
Institute Name:
Institute Code:
Date:
5
COMPANY PROFILE
About: -
6
ACKNOWLEDGEMENT
First and foremost, I would like to thank Asiatic Institute of Science & Technology
I would also like to thank " BrainyBeam " for granting me permission to undergo
Vidhyaphith, Ashram Road, Ahmedabad” I would like to thank “Mr. Sagar Jasani” for
Company.
Mainly I would like to thank “Ms. Janvi Ramani & Lecturer.”, for providing
continuous guideline and support during the whole internship and preparing this report.
Without his/ her support this report would not have sin the Internship.
Head of Department
7
TABLE OF CONTENT
NO TOPICS PAGE NO
1. Introduction 08
2. Abstract 10
3. Objectives 11
4. Tools/Technology 12
7. Project Overview 40
8. Conclusion 54
8
Roles and responsibilities during internship
DATE ACTIVITIES
27/6/2024 Basic knowledge about
NUMPY
1/ 7/2024 Pandas Import Dataset Library
Multidaimational Array
28/7/2024 Pyplot,Catplot,Pairplot,Heatmap
9
1
INTRODUCTION
1.1 Introduction
10
1.2 Importance of Internship Programs
1. Skill Development: Interns gain valuable skills that are directly relevant to their
career interests, enhancing their technical and soft skills.
2. Industry Exposure: Exposure to industry practices, tools, and methodologies helps
interns understand the professional environment and expectations.
3. Networking Opportunities: Internships provide opportunities to connect with industry
professionals, mentors, and peers, which can be beneficial for future career prospects.
4. Enhanced Employability: Practical experience gained during internships makes
candidates more attractive to potential employers, increasing their chances of
securing full-time positions.
5. Career Clarity: Internships help students explore different roles and industries, aiding
them in making informed career choices.
Practical Experience:
Diverse Input: Interns often bring new approaches and ideas that can inspire
innovation and improvements within the organization.
11
2
ABSTRACT
Throughout the internship, I was involved in several key activities, including data
Preprocessing, feature engineering, model training, and evaluation. I utilized various machine
learning techniques and tools, such as [Linear Regression, Random Forest e.g., decision trees,
neural networks, Tensor Flow, or Scikit-Learn], to tackle [Air Quality Index on, e.g., a
customer churn prediction problem or a sentiment analysis task].
One of the significant contributions was [Conducted an in-depth analysis of the existing
dataset to identify potential areas for improvement improving the model’s accuracy by 15%
through hyper parameter tuning and feature selection]. The internship provided me with hands-
on experience in [Python Libraries, Mathematic Function e.g., applying advanced machine
learning algorithms, managing large datasets, and interpreting model results].
12
3
OBJECTIVE
SIP aims at widening the student's perspective by providing an exposure to real life
organizational environment and its various functional activities.
This will enable the students to explore an industry/organization, build a relationship with
a prospective employer, or simply hone their skills in a familiar field.
SIP also provides invaluable knowledge and networking experience to the students.
During the internship, the student has the chance to put whatever he/she learned in the 3
year of Diploma Engineering into practice while working on a business plan or trying out
a new industry, job function or organization.
The organization, in turn, benefits from the objective and unbiased perspective the
student provides based on concepts and skills imbibed in the third year at the Diploma
Engineering institute. The summer interns also serve as unofficial spokespersons of the
organization and help in image building on campus.
An additional benefit that organizations may derive is the unique opportunity to evaluate
the student from a long-term perspective. Thus the SIP can become a gateway for final
placement of the student.
The student should ensure that the data and other information used in the study report is
obtained with the permission of the institution concerned. The students should also
behave ethically and honestly with the organization.
13
4
During the internship, we downloaded and utilized several key tools, Machine
learning (ML) tools and technologies are diverse and continually evolving.
Here's a rundown of some of the most widely used When working on machine
learning projects, an editor can significantly impact your productivity and
workflow. Here’s a breakdown of popular editors and IDEs (Integrated
Development Environments) used in the field
Pandas:
o Overview: A Python lfor data manipulation and analysis.
o Key Features: Data structures like Data Frame for handling structured data.
o Use Cases: Data cleaning, manipulation, and analysis.
Apache Spark:
o Overview: An open-source unified analytics engine for large-scale data
processing.
o Key Features: Fast data processing, support for real-time analytics.
o Use Cases: Big data processing, ETL (Extract, Transform, Load) tasks.
Visualization Tools
Matplotlib:
o Overview: A Python plotting library for creating static, animated, and
interactive visualizations.
o Key Features: Versatile plotting capabilities.
o Use Cases: Creating graphs and plots to visualize data and results.
Seaborn:
o Overview: A Python library based on Matplotlib that provides a high-level
interface for drawing attractive statistical graphics.
o Key Features: Built-in themes and colour palettes.
o Use Cases: Statistical data visualization.
14
4.2Emerging Technologies
AutoML:
o Overview: Automated Machine Learning that simplifies the process of building
machine learning models.
o Key Features: Model selection, hyper parameter tuning automation.
o Use Cases: Making machine learning accessible to non-experts.
Explainable AI (XAI):
o Overview: Techniques to make machine learning models more interpretable
and understandable.
o Key Features: Model transparency, trust-building.
o Use Cases: Ensuring model decisions are understandable and trustworthy.
15
Tools
4.3 Visual Studio Code (VS Code):
VS Code is a popular, open-source code editor developed by Microsoft. It is
lightweight, highly customizable, and supports a wide range of programming
languages and frameworks through extensions.
Image:
16
Jupyter Notebook
Best Practices
1. Clear Documentation: Use Markdown cells to clearly explain each step and
provide context.
2. Interactive Elements: Leverage interactive widgets or plots to make the notebook
engaging.
3. Reproducibility: Ensure all code cells are self-contained and executable in
sequence.
4. Visualization: Include meaningful plots and charts to visually represent data and
results.
5. Comments: Add comments in code cells to explain what each part of the code is
doing
17
Machine learning with Python is highly popular due to its rich ecosystem of
libraries and frameworks that simplify the implementation of machine
learning algorithms. Python’s ease of use, extensive libraries, and strong
community support make it a preferred language for machine learning
projects. Here’s an overview of how to use Python for machine learning,
including key libraries, concepts, and a basic workflow.
1. NumPy
o Description: Fundamental package for scientific computing. It provides
support for large, multi-dimensional arrays and matrices, along with a
collection of mathematical functions.
o Usage: Data manipulation, numerical operations.
2. Pandas
3.Matplotlib
4. Seaborn
Supervised learning
18
Key Concepts in Supervised Learning
Labeled Data:
o Input Features: These are the variables or attributes used to make predictions.
For example, in a dataset predicting house prices, input features might include
square footage, number of bedrooms, and location.
o Output Labels: These are the known results or targets associated with each set
of input features. For example, in the house price prediction case, the label
would be the actual price of the house.
Unsupervised learning
o Unsupervised learning is a type of machine learning where the algorithm is
trained on data that is not labeled, meaning the model learns patterns and
relationships without any predefined categories or outputs. Here’s a detailed
overview
1. Unlabelled Data:
o Input Features: Data used in unsupervised learning doesn’t come with labels
or target values. Instead, the algorithm tries to discover the structure or patterns
within the data on its own.
2. Objectives:
o Clustering: Grouping similar data points together based on their features. For
example, clustering customers based on purchasing behaviour or segmenting
images into different categories.
19
Reinforcement learning
1. Agent:
o The entity that makes decisions and performs actions in an environment. The
goal of the agent is to learn the best strategies to achieve the highest
cumulative reward.
2 Environment:
o The context or setting in which the agent operates. The environment responds
to the agent's actions and provides feedback in the form of rewards or
penalties.
Game Playing: RL has been used to develop agents that can play games at
superhuman levels, such as Alpha Go for Go and OpenAI Five for Dota 2.
Robotics: Learning to control robots for tasks like grasping objects, navigation, and
manipulation through interactions with the physical world.
Autonomous Vehicles: Training self-driving cars to navigate and make driving
decisions based on sensor inputs and environmental conditions.
Recommendation Systems: Optimizing recommendation strategies based on user
interactions and feedback to improve engagement and satisfaction.
Finance: Algorithmic trading strategies and portfolio management that adapt to
changing market conditions.
3 Reward:
20
Scikit-Learn
o Scikit-learn is a popular and widely used library in Python for machine
learning. It provides simple and efficient tools for data analysis and modelling,
built on top of other scientific libraries like NumPy, SciPy, and Matplotlib.
Here’s an overview of Scikit-learn and its key features
Model Selection: Tools for selecting and tuning models. Examples include grid search,
cross-validation, and various metrics for evaluating model performance.
Preprocessing: Functions for scaling, normalizing, and transforming data. Examples include
standard scaling, normalization, and encoding categorical variables.
Ease of Use:
Consistent API: Scikit-learn uses a consistent and simple API for all models and tools,
making it easy to switch between different algorithms and techniques. Most classes
follow a common pattern for initialization, fitting, and predicting.
Documentation: Comprehensive and well-organized documentation, including user
guides, examples, and API references, making it accessible to both beginners and
experienced users.
Applications of Scikit-learn
Feature Selection and Engineering: Identifying important features and transforming data
for better model performance.
21
Anaconda Navigator
Package Management:
Image:
22
Image:
Jupyter Notebooks:
Installation:
Download: Visit the Anaconda website and download the installer appropriate for your
operating system.
Install: Follow the installation instructions provided on the website. The process
typically involves running the installer and following the prompts.
Launch: Open Anaconda Navigator from your applications menu or start menu.
Manage Environments: Use the "Environments" tab to create, clone, and delete
environments.
Install Packages: Use the "Environments" tab to install new packages into a specific
environment.
Launch Applications: Use the "Home" tab to launch applications like Jupyter
Notebook or Spyder.
23
o Anaconda is a powerful and user-friendly tool for managing data science and
machine learning workflows, making it a popular choice among researchers,
data scientists, and developers.
Excel files
if you’re looking to work with machine learning using data from Excel files
and want to know about handling that data for computer vision tasks (often
abbreviated as CV for Computer Vision), here's a step-by-step guide to get
you started
Computer Vision (CV) involves tasks where machines interpret and make decisions based on
visual input, such as images or videos. Common tasks include image classification, object
detection, and image segmentation.
Excel Data typically consists of tabular data and might not directly represent image data.
However, Excel files can include metadata or labels that could be relevant for machine
learning tasks involving images.
Summary
24
Extracting Data from Excel
Before you use Excel data for machine learning, you need to extract and pre-process it. Python
is a popular choice for this, particularly with libraries like pandas.
Load Data: Use pandas to read and inspect your Excel data.
Pre-process: Handle missing values, encode categorical variables, and scale features.
Split Data: Divide your data into training and testing sets.
Model Building: Train and evaluate a machine learning model using libraries like Scikit-
learn.
25
5
INTRODUCTION INTERNSHIP & PROJECT
Project: Air Quality Index (AQI) project using machine learning involves
developing models and systems to predict or analyze air quality based on various data
inputs. Here's a detailed breakdown of how you might approach such a project.
Image:
26
5.3. Project Objectives:
Objective: Create a machine learning model that accurately predicts AQI values based on
historical data and real-time inputs.
Objective: Develop a classification model that categorizes air quality into predefined
categories (e.g., Good, Moderate, Unhealthy, Hazardous).
Objective: Build a real-time monitoring system that uses machine learning to process live
data and provide current AQI predictions and alerts.
Objective: Use machine learning to identify and analyze sources of air pollution and their
impact on AQI.
Objective: Develop methods to handle missing, noisy, or incomplete data effectively, and
improve the quality of data used in the machine learning models.
Objective: Create user-friendly tools or applications that present AQI predictions and data in
an accessible and understandable format.
27
5.4. Project Phases:
Data Collection
Data Sources: Identify and gather data from sources like air quality monitoring
stations, weather data, traffic data, and other environmental factors.
Data Acquisition: Use APIs, public datasets, or build web scrapers if needed to collect
the necessary data.
Model Deployment
Integration: Deploy the model into a production environment where it can make real-
time predictions or analyses. This might involve integrating with a web service or API.
Monitoring: Continuously monitor the model’s performance to ensure it remains
accurate and relevant. Set up alerts for any significant deviations in performance.
Dataset .csv
28
Features:
Historical Data
Previous AQI Values: Historical AQI data for the same location can help identify
trends and patterns.
Historical Pollutant Levels: Past pollutant concentrations can provide context for
current levels.
29
Testing and Deployment in a Machine Learning Project for Air Quality Index
(AQI)
o Once you've developed and trained your machine learning model for predicting or
analyzing the Air Quality Index (AQI), the next critical steps are testing and
deployment. Here’s a detailed breakdown of each phase
Testing
1. Model Evaluation
Performance Metrics: Assess how well your model performs using metrics relevant
to your task:
o Regression Tasks: Mean Absolute Error (MAE), Root Mean Squared Error
(RMSE), R-squared.
o Classification Tasks: Accuracy, Precision, Recall, F1 Score, ROC-AUC.
Cross-Validation: Implement cross-validation to ensure that your model’s
performance is robust and not over fitted to a specific subset of the data.
Error Analysis: Examine the types of errors your model makes. For instance, does it
consistently overestimate or underestimate AQI? This analysis can help identify
patterns in prediction errors and guide further improvements.
Deployment
1. Model Integration
▪ Testing and deployment are critical phases that ensure your machine learning model
for AQI performs well in real-world scenarios and integrates seamlessly with other
systems. Thorough testing and careful deployment help maintain the model’s
accuracy, reliability, and utility, while ongoing monitoring and maintenance ensure
that it continues to meet the needs of users and stakeholders over time.
30
5.4 Purpose of the Internship
The primary goal of this internship is to provide hands-on experience in applying machine
learning techniques to real-world problems, specifically in analyzing and predicting the Air
Quality Index (AQI). This involves utilizing Python to develop, test, and deploy machine
learning models that can effectively predict or analyze AQI levels based on various
environmental and meteorological factors. Here’s a detailed breakdown
Skill Development
31
6
INTERNSHIP ACTIVITY AND PLANING
32
6.2 Internship Report weekly
1. NumPy
1. Creating Arrays
np. array (object, dtypes=None, copy=True)
o Creates a NumPy array from a Python list or tuple.
o Example: arr = np. array ([1, 2, 3])
Array Operations
np. reshape (a, newshape)
o Gives a new shape to an array without changing its data.
o Example: reshaped array = np. reshape (arr, (2, 3))
33
Jupyter Notebook
Jupyter Notebook is an open-source web application that allows you to create and
share documents containing live code, equations, visualizations, and narrative text. It's
especially popular in data analysis, machine learning, and academic research due to its
ability to mix code execution with rich text and visualizations.
Use Pip: -
1. Install Jupyter:
Code: -
Image:
Basic Usage
34
Advanced Features
1. Interactive Widgets:
o Use the ipywidgets library to create interactive widgets like sliders
and buttons.
2. Extensions:
o Install Jupyter extensions for additional functionality, such as
jupyter_contrib_nbextensions.
3. JupyterLab:
o JupyterLab is the next-generation interface for Jupyter, offering a
more flexible and powerful environment.
Jupyter Notebook is a powerful tool for data analysis, machine learning, and education,
providing an interactive and versatile environment for coding and documentation.
Image:
35
Week-2 Dates: 04/07/2024 - 10/07/2024
Pandas Library
Pandas is a powerful and versatile library for data manipulation and analysis in
Python. It provides data structures and functions needed to work with structured data
seamlessly. Here’s a detailed overview of Pandas, including its key features, core data
structures, and common functions.
Key Features
1. Data Structures:
o DataFrame: A 2D labeled data structure with columns of potentially different
types.
o Series: A 1D labeled array capable of holding any data type.
2. Data Manipulation:
o Data Cleaning: Handling missing data, duplicate removal, and data
transformation.
o Data Aggregation: Grouping, pivot tables, and summary statistics.
o Data Alignment: Automatic and explicit alignment of data.
Image:
36
Data Input and Output:
Reading/Writing Data: Support for various file formats like CSV, Excel, SQL, and
JSON.
Data Analysis:
Image:
DataFrame
A DataFrame is a core data structure in the Pandas library, designed for handling
and analyzing structured data in Python. It’s akin to a spreadsheet or SQL table and
allows you to store and manipulate tabular data efficiently. Here’s an in-depth look at
DataFrame, including their creation, basic operations, and common methods.
DataFrame are a central part of data manipulation and analysis in Pandas. They
provide a flexible, intuitive interface for working with structured data, allowing you to
perform a wide range of operations, from data cleaning and transformation to analysis
and visualization. Understanding how to use DataFrame effectively is crucial for data
analysis tasks in Python.
37
Week-3 Dates: 11/07/2024 - 17/07/2024
Normalization
Image:
38
Week-4 Dates: 18/07/2024 - 24/07/2024
Image:
Key Concepts
39
Week-5 Dates: 25/07/2024 - 31/07/2024
Matplotlib Pyplot
Matplotlib is a comprehensive library for creating static, interactive, and animated
visualizations in Python. The pyplot module, a part of Matplotlib, provides a MATLAB-
like interface for creating a wide range of plots and figures. It is designed to work seamlessly
with NumPy arrays and Pandas DataFrame, making it a powerful tool for data analysis and
visualization.
Basic Concepts
Image:
40
Week-6 Dates: 01/08/2024 - 07/08/2024
Project Submission
Deployment
Create a Web Application: Use frameworks like Flask or Django to build a web app
where users can input data and receive AQI predictions or classifications.
API Integration: Develop an API that provides real-time AQI information and
predictions.
41
7
PROJECT OVERVIEW
When working on a project related to the Air Quality Index (AQI) using machine learning,
your objective would generally be to leverage machine learning techniques to analyze and
predict air quality. Here’s a detailed breakdown of a typical project objective
Library import
Code: -
#import all the necessary
import NumPy as np
import pandas as pd
import Seaborn as Sns
import Matplotlib. pyplot as plot
# import folium
# import json
from sklearn. model selection import train_test_split
from sklearn. linear model imports Linear Regression
# from sklearn. Metrics import mean_squared_log_error
from sklearn. Metrics import mean_squared_error
from sklearn. Metrics import r2_score, mean_squared_error
# from sklearn. model selection imports KFold
from sklearn. model selection imports cross_val_score
from collections import defaultdict
pd. options. Mode. chained assignment = None # default='warn'
%Matplotlib inline
import warnings
warnings. filter warnings("ignore")
42
Import Data
Use the pd. read_csv () function to load the CSV file into a DataFrame. Replace
"your_file.csv" with the path to your CSV file
43
NumPy Library
NumPy is a fundamental library for numerical computing in Python, widely used in data
science, machine learning, and scientific computing. It provides support for large, multi-
dimensional arrays and matrices, along with a collection of mathematical functions to
operate on these arrays.
Pandas Library
Pandas is a powerful and versatile library in Python for data manipulation and analysis.
It’s widely used in data science and machine learning projects due to its ease of use and
efficiency in handling large datasets. Here's an overview of how you can use Pandas in the
context of an air quality index project
Matplotlib Library
Matplotlib is a widely-used plotting library for Python that provides a variety of ways
to create static, animated, and interactive visualizations. It is especially useful for
visualizing data, exploring trends, and presenting results. Here’s a guide on how to use
Matplotlib effectively, particularly in the context of an air quality index project.
Types
a = list(data['type'])
for I in range (0, Len(data)):
if str(a[I][0]) == 'R' and a[I][1] == 'e':
a[I] = 'Residential'
elif str(a[I][0]) == 'I':
a[I] = 'Industrial'
else:
a[i] = 'Other'
#the above code takes all the different types and changes them into 3 types (RESIDENTIAL,
INDUSTRIAL, OTHER)
data['type'] = a data['type']
value counts ()
44
g=Sns. Catplot (y="type", kind = "count", palette = "pastel", data = data, orient="h")
plot. show ()
Pie chart
# Pie chart
labels = ['Residential','Industrial','Other']
#colours
colors = ['#ff9999','#66b3ff','#99ff99']
explode = (0.02, 0.02, 0.1)
fig1, ax1 = plt. subplots ()
ax1.pie (df.per, colors = colors, labels=labels, explode=explode, autopct='%1.1f%%’,
shadow=True, startangle=90)
ax1.axis('equal')
plt. tight layout ()
plt. show ()
45
Catplot SO2
so2 = data [['so2', 'state']]. group by(['state']). median (). sort values ("so2", ascending = False)
no2
No2= data [['no2', 'state']]. group by(['state']). median (). sort values ("no2", ascending =
False)
ax = Sns. Catplot (x="no2", y=no2.head(10). index. tolist (), data=no2.head(10),
palette="flare", kind="bar”, height=12)
plt. xticks(rotation=90)
plt. show ()
46
PM10 = data [['RSPMi', 'state']]. group by(['state']). median (). sort values ("RSPMi",
ascending = False)
ax = Sns. Catplot (x="RSPMi", y=PM10.head(10). index. tolist (), data=PM10.head(10),
palette="flare", kind="bar", height=12)
plt. xticks(rotation=90)
plt. show ()
spm=data [['spm', 'state']]. groupby(['state']). median (). sort values ("spm", ascending =
False)
ax = Sns. Catplot (x="spm", y=spm. Head (10). index. tolist (), data=spm. Head (10),
kind="bar”, palette="flare”, height=12)
plt. xticks(rotation=90)
plt. show ()
47
pm2_5=data [['pm2_5', 'state']]. groupby(['state']). median (). sort values ("pm2_5",
ascending = False). head (10)
ax = Sns. Catplot (x="pm2_5", y=pm2_5. head (10). index. tolist (), data=pm2_5. head (10),
kind="bar”, palette="flare”, height=12)
plt. xticks(rotation=90)
plt. show ()
48
Pair Plot SO2,NO2,RSPM,SPM,Pm_2.5
plt. show ()
49
Heatmap Pivot
plt. show ()
50
CALCULATING AQI
def cal_SOi(so2):
si=0
if (so2<=40):
si= so2*(50/40)
elif (so2>1600):
return si
51
Subplot SO2,NO2,RSPM,SPM,Pm_2.5
plt. show ()
52
Heat Map
plt. show ()
53
Linear Regression
X_train, X_test, y_train, y_test = train_test_split (X, Y, test size=0.2, random state=101)
LR = Linear Regression ()
LR. intercept_
5.3231677877565176
LR. predict(X_test)
54
LR. score (X_test, y_test)
0.9794279918263086
R^2_Square:0.98
MSE:12.49
55
8
CONCLUTION
The internship program has been an invaluable experience, providing me with a
comprehensive understanding of Python Library and Machine learning. Over the course of
the internship, I have deepened my knowledge of Python Library, including core concepts
and advanced features, and applied these skills in developing a functional Air Quality Index
Machine learning.
the application of machine learning to the prediction and classification of air quality
indices has demonstrated significant potential for improving our understanding and
management of air quality. The model's performance metrics indicate that it effectively
predicts AQI values with reasonable accuracy, providing valuable insights into the key
factors influencing air quality. The analysis of feature importance reveals that meteorological
conditions and pollution sources are critical drivers of air quality variations.
I would like to express my sincere gratitude to the team at [Brainy Beam], especially
[Sagar Jasani] and for their guidance and support throughout the internship. Their mentorship
has been invaluable in my learning journey, providing insights into the industry and fostering
my growth as a programmer.
This internship has not only improved my technical skills but also enhanced my
problem-solving abilities and communication skills. I am thankful for this opportunity and
look forward to applying what I have learned in future endeavours.
56
THE – END
...
57