Lab Manual
Lab Manual
Place: __________
Date: __________
Preface
Main motto of any laboratory/practical/field work is for enhancing required skills as well as
creating ability amongst students to solve real time problem by developing relevant
competencies in psychomotor domain. By keeping in view, GTU has designed competency
focused outcome-based curriculum for engineering degree programs where sufficient weightage
is given to practical work. It shows importance of enhancement of skills amongst the students
and it pays attention to utilize every second of time allotted for practical amongst students,
instructors and faculty members to achieve relevant outcomes by performing the experiments
rather than having merely study type experiments. It is must for effective implementation of
competency focused outcome-based curriculum that every practical is keenly designed to serve
as a tool to develop and enhance relevant competency required by the various industry among
every student. These psychomotor skills are very difficult to develop through traditional chalk
and board content delivery method in the classroom. Accordingly, this lab manual is designed
to focus on the industry defined relevant outcomes, rather than old practice of conducting
practical to prove concept and theory.
By using this lab manual students can go through the relevant theory and procedure in advance
before the actual performance which creates an interest and students can have basic idea prior
to performance. This in turn enhances pre-determined outcomes amongst students. Each
experiment in this manual begins with competency, industry relevant skills, course outcomes as
well as practical outcomes (objectives). The students will also achieve safety and necessary
precautions to be taken while performing practical.
This manual also provides guidelines to faculty members to facilitate student centric lab
activities through each experiment by arranging and managing necessary resources in order that
the students follow the procedures with required safety and necessary precautions to achieve
the outcomes. It also gives an idea that how students will be assessed by providing rubrics.
Data Science is about data gathering, analysis and decision-making. Data Science is about finding
patterns in data, through analysis, and make future predictions. By using Data Science, companies
are able to make:
Data Science is used in many industries in the world today, e.g. banking, consultancy,
healthcare, and manufacturing. Python is an open-source, interpreted, high-level language and
provides a great approach to data science, machine learning, and research purposes. It is one of
the best languages for data science to use for various applications & projects. When it comes to
dealing with mathematical, statistical, and scientific functions, Python has great utility.
Utmost care has been taken while preparing this lab manual however always there is chances of
improvement. Therefore, we welcome constructive suggestions for improvement and removal
of errors if any.
Python for Data Science (3150713)
230210132026
Practical – Course Outcome matrix
Sr.
Objective(s) of Experiment CO1 CO2 CO3 CO4 CO5
No.
The following industry relevant competency are expected to be developed in the student by
undertaking the practical work of this laboratory.
1. Programming Languages
2. Mathematics, Statistical Analysis, and Probability
3. Data Mining
4. Machine Learning and AI
5. Data Visualization
10
11
12
13
14
Total
Python for Data Science (3150713)
230210132026
Experiment No: 1
Develop a program to understand the control structures of python.
Date:
Objectives: (a) To learn and understand the different control structures in Python, such as loops,
conditional statements, and functions.
Theory:
Conditional statements: Conditional statements in Python allow you to execute certain blocks of
code based on whether a certain condition is true or false. The two main types of conditional
statements in Python are "if" statements and "if-else" statements.
Loops: Loops in Python allow you to repeat a block of code multiple times, either for a fixed
number of times or until a certain condition is met. The two main types of loops in Python are
"for" loops and "while" loops.
Functions: Functions in Python allow you to encapsulate blocks of code and reuse them
throughout your program. Functions can accept parameters and return values, making them a
powerful tool for organizing and structuring your code.
Scope: Scope in Python refers to the region of your program where a variable or function is
visible and accessible. Understanding scope is critical for avoiding errors and ensuring that your
code is organized and easy to maintain.
Error handling: Error handling in Python involves detecting and responding to errors that may
occur during program execution. Proper error handling can help you avoid crashes and ensure that
your program continues to run smoothly.
1. Data validation.
2. Check the data types.
3. Input sanitization.
4. Error Handling and Secure coding practices.
5. Use comments.
6. Test your code.
Procedure:
1. Plan the program structure and flow: Develop a plan for the program structure, including
the control structures that will be included, and the flow of the program logic.
2. Implement the control structures in Python: Write the code to implement the different
control structures in Python, including conditional statements, loops, and functions.
3. Test and debug the program: Conduct thorough testing of the program to ensure that it is
functioning correctly and identify and troubleshoot any errors or bugs.
4. Refine and optimize the program: Refine the program as needed to improve performance
and optimize its functionality, based on user feedback and testing results.
6. Deploy and maintain the program: Deploy the program for use by users, and maintain it by
addressing any issues or bugs that arise and providing updates and new features as needed.
Code:
```python
# Function to determine if a number is even or odd
def check_even_odd(num):
if num % 2 == 0:
return "Even"
else:
return "Odd"
# Main program
def main():
try:
num = int(input("Enter a number: "))
print(f"The number is {check_even_odd(num)}.")
print_numbers(num)
print(f"Factorial of {num} is {factorial(num)}.")
except ValueError:
print("Invalid input. Please enter an integer.")
main()
Observations:
Conclusion:
Quiz:
1. What is a conditional statement in Python?
➤ A conditional statement allows the program to execute certain code blocks based on
whether a condition is True or False, using if, elif, and else.
2. What is a loop in Python?
➤ A loop is used to execute a block of code repeatedly until a condition is met. Python
supports for and while loops.
3. What is the difference between a "for" loop and a "while" loop in Python?
➤ A for loop is used when the number of iterations is known or definite. A while loop is
used when the condition must be checked continuously and we do not know how many
times to loop.
Python for Data Science (3150713)
230210132026
4. What is a function in Python?
➤ A function is a block of reusable code that performs a specific task. It is defined using
the def keyword and can take arguments and return results.
5. What is scope in Python?
➤ Scope refers to the region of the program where a variable or function is recognized.
There are four types: Local, Enclosing, Global, and Built-in scope.
Suggested Reference:
1. https://docs.python.org/3/library/
2. https://www.tutorialspoint.com/python/
3. https://www.geeksforgeeks.org/
4. https://realpython.com/
5. https://www.w3schools.com/python/
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
230210132026
Experiment No: 2
Date:
● Basic programming concepts: You should have a good grasp of basic programming
concepts such as variables, data types, conditional statements, loops, and functions.
● Python programming language: You should have a good understanding of Python syntax,
data structures, and standard library functions.
● Sequences: Sequences are ordered collections of elements that can be accessed by their
index or key. You should have a good understanding of the different types of sequences
such as string, tuple, list, dictionary, and set, and their respective properties.
● String manipulation: You should know how to manipulate them using methods such as
slicing, concatenation, and formatting.
● Collection manipulation: Collections such as lists, tuples, dictionaries, and sets can be
manipulated using methods such as append, insert, remove, pop, and sort.
● Iteration: You should know how to use for loops and list comprehensions to iterate over
sequences.
● Conditional statements: You should know how to use conditional statements to check for
specific conditions in sequences.
● Functions: You should know how to define functions that operate on sequences and return
values.
Objectives: (a) To learn how to manipulate and access their elements, iterate over them, perform
conditional operations on them, and use them in functions.
(b) To learn how to select the appropriate sequence type for a given task based on its properties
and performance characteristics.
Theory:
230210132026
1. In Python programming language, there are four built-in sequence types: strings, lists,
tuples, and ranges. Additionally, Python includes the set and dictionary data structures,
which are implemented as unordered collections of unique and key-value pairs,
respectively.
2. The string data type in Python represents a sequence of characters and is immutable,
meaning its contents cannot be changed once it is created. Strings can be manipulated
using various methods such as slicing, concatenation, and formatting.
3. Lists and tuples are similar in many ways, but tuples are immutable, whereas lists are
mutable. Lists and tuples can hold elements of any data type and can be indexed and sliced
like strings. However, lists offer additional methods such as append, insert, remove, and
pop that allow for manipulation of the list's contents.
4. Dictionaries are another important sequence type in Python and are implemented as
unordered collections of key-value pairs. Each element in a dictionary consists of a key
and a corresponding value. Dictionaries can be used to store and retrieve data quickly
based on the key.
5. Sets are collections of unique elements that are unordered and mutable. Sets are often used
to perform set operations such as union, intersection, and difference.
Procedure:
1. Create a string variable using single or double quotes.
Use string methods like upper(), lower(), strip(), split(), join(), and replace() to manipulate the
string as needed.
Use indexing and slicing to access specific characters or substrings within the string.
2. Create a tuple variable using parentheses.
Use indexing and slicing to access specific elements or subsets within the tuple.
Tuples are immutable, so you cannot add, remove or modify elements once created.
3. Create a list variable using square brackets.
Use indexing and slicing to access specific elements or subsets within the list.
Use list methods like append(), insert(), remove(), pop(), extend(), and sort() to modify the list
as needed.
Lists are mutable, so you can add, remove or modify elements once created.
4. Create a dictionary variable using curly braces or the dict() constructor.
Use keys to access values within the dictionary.
230210132026
Use dictionary methods like keys(), values(), and items() to access different parts of the
dictionary.
Use del or pop() to remove elements from the dictionary.
Use assignment to add or modify elements in the dictionary.
5. Create a set variable using curly braces or the set() constructor.
Use set methods like add(), remove(), pop(), union(), and intersection() to modify or perform
operations on the set.
Sets do not allow duplicate elements, so adding the same element multiple times will only add
it once.
Code :
# String operations
my_str = " Hello Python World! "
print("Original string:", my_str)
print("Uppercase:", my_str.upper())
print("Lowercase:", my_str.lower())
print("Stripped:", my_str.strip())
print("Split:", my_str.split())
print("Replace:", my_str.replace("Python", "Programming"))
# Tuple operations
my_tuple = (1, 2, 3, 4, 5)
print("Tuple:", my_tuple)
print("Tuple[2]:", my_tuple[2])
print("Tuple slice [1:4]:", my_tuple[1:4])
# List operations
my_list = [10, 20, 30]
my_list.append(40)
my_list.insert(1, 15)
my_list.remove(30)
print("List after operations:", my_list)
# Dictionary operations
my_dict = {"name": "Nishit", "branch": "ICT", "year": 3}
my_dict["college"] = "GEC Bhavnagar"
print("Dictionary:", my_dict)
print("Access by key:", my_dict["branch"])
my_dict.pop("year")
print("After popping 'year':", my_dict)
# Set operations
my_set = {1, 2, 3, 4}
my_set.add(5)
my_set.add(2) # Duplicate won't be added
my_set.remove(3)
print("Set after operations:", my_set)
230210132026
Observations:
Conclusion:
In this experiment, we successfully explored the core Python data structures: strings, lists, tuples,
dictionaries, and sets. We performed operations such as indexing, slicing, addition, deletion, and
modification. Each structure was manipulated using built-in Python methods, providing hands-on
understanding of their properties and use cases. This practical experience will be useful in
selecting and using the appropriate data structure in real-world applications.
Quiz:
1. What method can you use to convert a string to uppercase in Python?
➤ The `upper()` method.
2. What is the difference between a tuple and a list in Python?
➤ A list is mutable (can be changed), while a tuple is immutable (cannot be changed).
3. How do you add an element to a list in Python?
➤ Use the `append()` or `insert()` method.
4. How do you access a value in a dictionary using its key in Python?
➤ By using square brackets with the key: `dict[key]`.
5. What is a set in Python?
➤ A set is an unordered collection of unique elements used for membership testing and
eliminating duplicates.
Suggested Reference:
1. https://docs.python.org/3/library/
2. https://www.tutorialspoint.com/python/
3. https://www.geeksforgeeks.org/
4. https://realpython.com/
230210132026
5. https://www.w3schools.com/python/
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Experiment No: 3
Develop a program that reads a .csv dataset file using Pandas library and
display the following content of the dataset. a) First five rows of the dataset
230210132026
b) Complete data of the dataset
c) Summary or metadata of the dataset.
Date:
● Knowledge of Python programming language and its libraries, particularly the Pandas
library.
● Understanding of the structure of .csv files and how to read and manipulate them using
Pandas.
● Familiarity with the different methods and functions available in Pandas, such as "head()",
"print()", "display()", "info()", and "describe()".
● Ability to write and debug code, and troubleshoot errors that may arise when working with
datasets.
● Experience in working with datasets, including data cleaning, data wrangling, and data
analysis.
● Ability to understand the content and structure of datasets, and use them to derive insights
and information.
Practical skills:
● Writing code to load a .csv dataset file into a Pandas DataFrame using the "read_csv()"
function.
● Using the "head()" method to display the first five rows of the dataset.
● Using the "print()" function or "display()" method to display the complete data of the
dataset.
● Using the "info()" method or "describe()" method to display the summary or metadata of
the dataset.
● Handling errors and exceptions that may arise when working with datasets.
● Writing clean and efficient code that is easy to read and maintain.
● Testing the program with different datasets to ensure its accuracy and reliability.
Objectives: (a) To read and load the .csv dataset file into a Pandas DataFrame.
(b) To display the first five rows of the dataset using the "head()" method.
(c) To display the complete data of the dataset using the "print()" function or "display()" method.
(d) To display the summary or metadata of the dataset using the "info()" method or "describe()"
method.
Theory:
230210132026
Pandas is a popular data manipulation library for Python, widely used in data science and
machine learning. It provides a powerful and flexible toolset for working with structured data,
including loading, manipulating, and analyzing datasets in various formats, including .csv files
Procedure:
1. Import the Pandas library: To use the Pandas library in Python, it is essential to import it
into your program. You can do this by using the "import pandas as pd" statement.
2. Load the dataset: The next step is to load the dataset into a Pandas DataFrame using the
"read_csv()" function. This function takes the path to the .csv file as an argument and
returns a DataFrame object that contains the data from the file.
3. Display the first five rows: To display the first five rows of the dataset, you can use the
"head()" method. This method returns the first five rows of the DataFrame by default, but
you can specify the number of rows you want to display as an argument.
4. Display the complete data: To display the complete data of the dataset, you can use the
"print()" function or "display()" method. This will output the entire DataFrame to the
console or Jupyter Notebook.
5. Display summary or metadata: To display the summary or metadata of the dataset, you can
use the "info()" method or "describe()" method. The "info()" method provides information
about the DataFrame, including the number of rows and columns, data types, and memory
usage. The "describe()" method provides statistical summary of the dataset, including
count, mean, standard deviation, minimum, maximum, and quartiles for each column.
Code :
# Importing the pandas library
import pandas as pd
# Load the CSV file (replace 'data.csv' with your actual file name)
df = pd.read_csv('data.csv')
print("\nStatistical Summary:")
print(df.describe())
Observations:
Conclusion:
In this experiment, we successfully used the Pandas library to read a `.csv` file, view the first few
rows of the dataset using `head()`, examine the full dataset with `print()`/`to_string()`, and inspect
the metadata using `info()` and `describe()`. This helped build foundational skills in data handling
and analysis using Python and Pandas. Understanding how to load and analyze structured data is
essential for further learning in data science and machine learning.
Quiz:
1. What library should be used to read a .csv dataset file in Python?
➤ The Pandas library (`import pandas as pd`)
2. Which method is used to read a .csv file using Pandas library?
➤ `read_csv()` method.
3. How can you display the first five rows of the dataset using Pandas?
➤ Using the `head()` method.
230210132026
4. How can you display the complete data of the dataset using Pandas?
➤ Using the `print(df.to_string())` or `display(df)`.
5. How can you display the summary or metadata of the dataset using Pandas?
➤ Using the `info()` method for structure and `describe()` for statistical summary.
Suggested Reference:
1. Official Pandas documentation: https://pandas.pydata.org/docs/
2. "Python for Data Analysis" by Wes McKinney:
https://www.oreilly.com/library/view/python-for-data/9781491957653/
3. "Python Data Science Handbook" by Jake VanderPlas:
https://jakevdp.github.io/PythonDataScienceHandbook/
4. Pandas tutorial by DataCamp:
https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Experiment No: 4
Develop a program that shows application of slicing and dicing over the rows
and columns of the dataset.
Date:
230210132026
Objectives: (a) To gain insights into the dataset and extract meaningful information from it.
Theory:
Slicing and dicing are powerful operations that allow data analysts to manipulate data by selecting
specific subsets of data from a larger dataset. These operations are widely used in data analysis
and are a crucial aspect of data manipulation.
In the context of Python, slicing refers to extracting specific portions of data from a larger data
structure, such as a list, tuple, or DataFrame. Slicing is performed by specifying the start and end
indices of the portion of data to be extracted. For example, in a list of numbers, slicing can be
used to extract the first three numbers or the last five numbers. In a DataFrame, slicing can be
used to extract specific rows or columns based on specific conditions or criteria.
Dicing, on the other hand, refers to grouping and aggregating data based on specific criteria. This
involves dividing the data into smaller subsets based on specific categories or conditions and
performing aggregation functions on each subset. For example, in a dataset containing sales data,
dicing can be used to group the data by product type, region, or time period and calculate the total
sales for each group.
In Python, the Pandas library provides powerful tools for slicing and dicing data in a DataFrame.
The .loc and .iloc methods are used for slicing rows and columns based on specific conditions or
criteria. The .groupby method is used for grouping data based on specific categories, and
aggregation functions such as .sum(), .mean(), and .count() can be used to perform calculations
on each group. The .pivot_table method is used for creating pivot tables, which provide a
summarized view of the data by grouping and aggregating data based on specific categories.
230210132026
Safety and necessary Precautions:
Procedure:
1. Load the dataset: Load the dataset into Python using the Pandas library's read_csv
function.
2. Explore the dataset: Use the head, tail, and info functions to explore the dataset and get a
sense of its structure and contents.
3. Slice and dice the data: Use the Pandas DataFrame's indexing and slicing operations to
select specific rows and columns of the dataset. Examples of slicing operations include
loc, iloc, and [ ].
4. Apply filtering: Use Boolean indexing to filter rows of the dataset based on specific
criteria.
5. Aggregate the data: Use the groupby function to group the data by specific columns and
apply aggregation functions such as sum, mean, and count.
6. Visualize the data: Use visualization libraries such as Matplotlib or Seaborn to create
visualizations of the sliced and diced data.
7. Refine and iterate: Refine the analysis and iterate as needed based on the insights gained
from the analysis.
Suggested Reference:
1. "Python for Data Analysis" by Wes McKinney
2. "Python Data Science Handbook" by Jake VanderPlas
3. "Pandas User Guide" on the Pandas documentation website
4. "Data Wrangling with Pandas" course on DataCamp
5. "Data Manipulation with Pandas" course on Coursera
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
230210132026
Experiment No: 5
Develop a program that shows usage of aggregate function over the input
dataset. a) describe b) max c) min d) mean e) median f) count g) std h) Corr
Date:
● Knowledge of the input dataset format (e.g. CSV, Excel, JSON) and how to load it into a
data structure in Python using libraries like Pandas.
● Understanding of the different aggregate functions available in Pandas, such as describe,
max, min, mean, median, count, std, and corr.
● Familiarity with the syntax of Pandas functions for applying aggregate functions, such as
groupby, apply, and agg.
● Ability to interpret and analyze the results of the aggregate functions to gain insights about
the dataset.
Practical skills:
Objectives: (a) To understand the concept of aggregate functions and their usage in data analysis.
Theory:
In data analysis, aggregate functions are used to calculate summary statistics over a dataset. These
functions are applied to columns or rows of a dataset to calculate values like the maximum,
minimum, mean, median, count, standard deviation, and correlation.
a) describe: This function generates descriptive statistics that summarize the central tendency,
dispersion, and shape of a dataset's distribution.
b) max: This function is used to find the maximum value of a column or row.
c) min: This function is used to find the minimum value of a column or row.
d) mean: This function is used to find the average value of a column or row.
e) median: This function is used to find the median value of a column or row.
f) count: This function is used to count the number of non-null values in a column or row.
g) std: This function is used to calculate the standard deviation of a column or row.
h) Corr: This function is used to calculate the correlation between columns or rows of a dataset.
In Python, these aggregate functions can be applied using the Pandas library. The groupby()
function is used to group data based on a specified column, and the aggregate functions can then
be applied to the grouped data.
Procedure:
1. Import necessary libraries: You will need to import Pandas library to load the dataset and
perform various operations on it.
2. Load the dataset: Load the dataset in a Pandas dataframe using the read_csv() function.
Make sure the dataset is in a CSV format and is saved in your working directory.
3. Check the dataset: Print the first few rows of the dataset using the head() function to check
if the dataset is loaded correctly.
4. Describe the dataset: Use the describe() function to get the summary statistics of the
dataset, such as count, mean, standard deviation, minimum, and maximum values.
5. Apply aggregate functions: Apply the aggregate functions such as max(), min(), mean(),
median(), count(), std(), and corr() on the dataset.
6. Display the results: Display the results of the aggregate functions to the user.
Suggested Reference:
1. https://pandas.pydata.org/docs/
2. https://numpy.org/doc/stable/
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
230210132026
Experiment No: 6
Develop a program that applies split and merge operations on the datasets.
Date:
Practical skills:
Objectives: (a) To split large datasets into smaller ones for ease of handling and processing.
(b) To consolidate information and make it easier to analyze.
230210132026
Theory:
Python provides several built-in functions and libraries for performing split and merge operations
on datasets. Here are some examples:
Splitting a Dataset:
Using the pandas split() method: The split() method is a built-in function in Python that can be
used to split a string into a list of substrings based on a specified delimiter. This can be useful for
splitting a dataset into smaller chunks.
Using the numpy.array_split() function: The numpy.array_split() function can be used to split a
numpy array into smaller arrays of equal or nearly equal size.
Merging Datasets:
Using the pandas.concat() function: The pandas.concat() function can be used to concatenate
pandas dataframes along a specified axis.
Using the numpy concatenate() function: The concatenate() function can be used to merge two or
more arrays into a single array.
Procedure:
1. Define the input datasets: Determine the input datasets and their format. It could be CSV
files, Excel files, or other file types. Also, define the delimiter or separator character for
splitting the data.
2. Load the datasets: Load the datasets into the program using the appropriate libraries and
functions. Check that the data is loaded correctly and perform any necessary data cleaning
or formatting.
3. Split the datasets: Use the appropriate function or library to split the datasets into smaller
chunks. Specify the size or number of chunks to create and ensure that the resulting
datasets are consistent and valid.
4. Merge the datasets: Use the appropriate function or library to merge the datasets into a
single dataset. Specify the method of merging and ensure that the resulting dataset is
consistent and valid.
230210132026
5. Handle missing or duplicate data: Check for any missing or duplicate data in the merged
dataset and handle them appropriately. You can choose to remove the records with missing
data or impute the missing values.
6. Perform calculations or analysis: Once the datasets are merged, you can perform any
necessary calculations or analysis on the resulting dataset. This could include aggregating
data, calculating averages, or performing statistical analysis.
Suggested Reference:
1. https://docs.python.org/3/library/
2. "Python Data Science Handbook" by Jake VanderPlas.
3. "Python for Data Analysis" by Wes McKinney.
4. Pandas documentation.
5. NumPy documentation.
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
230210132026
Experiment No: 7
Develop a program that shows the various data cleaning tasks over the dataset.
a) Identifying the null values. b) Identifying the empty values c) Identifying the
incorrect timestamp
Date:
Practical skills:
Objectives: (a) To identify and handle missing or incomplete data in the dataset.
(b) To identify and handle invalid or incorrect data in the dataset.
(c) To remove duplicate data in the dataset.
(d) To standardize data formats and values to ensure consistency across the dataset.
(e) To handle outliers and extreme values that may skew data analysis results.
(f) To ensure data accuracy and completeness for reliable data analysis.
(g) To improve data quality by reducing errors and inconsistencies in the dataset.
(h) To prepare the dataset for further analysis and modeling..
Theory:
Data cleaning is an essential step in the data preparation process that involves identifying and
handling missing, incorrect, or inconsistent data in the dataset. In Python, data cleaning is
typically performed using libraries such as NumPy and Pandas, which provide functions for data
manipulation and analysis.
The theory behind data cleaning in Python involves several key steps:
Importing data: The first step in data cleaning is to import the data into Python using the
appropriate library and data format. Common data formats include CSV, Excel, and JSON.
Identifying missing data: Once the data is imported, the next step is to identify missing data in the
dataset. This can be done using the isnull() function in Pandas, which returns a Boolean value
indicating whether a value is missing or not.
Handling missing data: Once missing data is identified, the next step is to handle it appropriately.
This can be done by either removing the rows or columns with missing values or imputing the
missing values with a suitable value such as the mean or median of the column.
230210132026
Identifying incorrect data: After handling missing data, the next step is to identify incorrect data
in the dataset, such as values that are outside the expected range or format. This can be done
using statistical techniques such as data visualization and analysis.
Handling incorrect data: Once incorrect data is identified, the next step is to handle it
appropriately. This can be done by removing the outliers or replacing the incorrect values with a
suitable value such as the median or mode of the column.
Standardizing data formats and values: To ensure consistency across the dataset, it is often
necessary to standardize data formats and values. This can be done by converting data types,
renaming columns, or applying formatting rules.
Removing duplicates: Duplicate data can skew analysis results and should be removed from the
dataset. This can be done using the drop_duplicates() function in Pandas.
Quality control: The final step in data cleaning is to perform quality control checks to ensure that
the data is accurate, complete, and consistent. This involves comparing the cleaned dataset to the
original dataset and verifying that the data has been cleaned appropriately.
1. Backup data.
2. Use secure and updated software.
3. Access control.
4. Data privacy.
5. Data encryption
6. Error handling.
7. Test and validate.
Procedure:
1. Import the required libraries: Import the necessary libraries such as pandas, numpy, and
matplotlib to read, manipulate and visualize the dataset.
2. Load the dataset: Load the dataset into the program using a pandas dataframe.
3. Identify null values: Use the isnull() function to identify null values in the dataset. If any
null values are found, decide on a strategy to handle them. This could involve replacing
null values with a mean or median value, dropping the null values or imputing them with a
different value.
4. Identify empty values: Use the empty() function to identify empty values in the dataset.
Empty values are those values that contain nothing (not even null). If any empty values are
found, decide on a strategy to handle them. This could involve replacing empty values
with a mean or median value, dropping the empty values or imputing them with a different
value.
5. Identify incorrect timestamp: Use the to_datetime() function to convert the timestamp
column to a datetime object. This will identify any incorrect timestamp values. If any
incorrect timestamp values are found, decide on a strategy to handle them. This could
230210132026
involve dropping the rows with incorrect timestamp values or imputing them with a
different value.
6. Remove duplicates: Use the drop_duplicates() function to remove any duplicate rows in
the dataset.
7. Data normalization: Use the normalization technique to transform the data into a standard
format to make it more consistent and easier to analyze.
8. Data standardization: Use the standardization technique to transform the data into a
standard scale to make it more consistent and easier to analyze.
9. Save the cleaned dataset: Save the cleaned dataset to a new file for future use.
10. Visualize the cleaned dataset: Use matplotlib or other visualization libraries to create
visualizations of the cleaned dataset to better understand the data and identify any further
cleaning that may be required.
Suggested Reference:
1. Data Cleaning with Python" course on DataCamp.
2. "Data Cleaning in Python: A Complete Guide" on Towards Data Science.
3. "Data Cleaning with Python and Pandas: Detecting Missing Values" on Real Python.
4. "Cleaning Data with Python" on Kaggle.
5. "Data Cleaning Techniques in Python" on Analytics Vidhya
Rubrics 1 2 3 4 5 Total
Marks
Experiment No: 8
Develop a program that shows usage of following NumPy array operations: a)
any() b) all() c) isnan() d) isinf() e) isfinite() f) isinf() g) zeros() h) isreal() i)
iscomplex() j) isscalar() k) less() l) greater() m) less_equal() n) greater_equal()
Date:
Objectives: (a) To perform complex mathematical and logical operations on large arrays and
matrices efficiently.
Theory:
NumPy is a popular Python library for scientific computing that provides efficient and powerful
array operations. It enables users to work with multidimensional arrays and perform a variety of
mathematical and logical operations on them.
Here are the explanations of some of the NumPy array operations mentioned in the question:
a) any(): It returns True if any of the elements of an array evaluate to True, and False otherwise.
b) all(): It returns True if all the elements of an array evaluate to True, and False otherwise.
230210132026
c) isnan(): It returns an array of the same shape as the input array, with True where the
corresponding element of the input array is NaN (Not a Number), and False elsewhere.
d) isinf(): It returns an array of the same shape as the input array, with True where the
corresponding element of the input array is +/-inf (positive or negative infinity), and False
elsewhere.
e) isfinite(): It returns an array of the same shape as the input array, with True where the
corresponding element of the input array is finite (i.e., not NaN, +/-inf), and False elsewhere.
f) isinf(): It returns an array of the same shape as the input array, with True where the
corresponding element of the input array is +/-inf (positive or negative infinity), and False
elsewhere.
g) zeros(): It returns a new array of the specified shape and data type, filled with zeros.
h) isreal(): It returns an array of the same shape as the input array, with True where the
corresponding element of the input array is real, and False where it is complex.
i) iscomplex(): It returns an array of the same shape as the input array, with True where the
corresponding element of the input array is complex, and False where it is real.
j) isscalar(): It returns True if the input is a scalar (i.e., a single value, not an array), and False
otherwise.
k) less(): It returns an array of the same shape as the input arrays, with True where the
corresponding element of the first input array is less than the corresponding element of the
second input array, and False otherwise.
l) greater(): It returns an array of the same shape as the input arrays, with True where the
corresponding element of the first input array is greater than the corresponding element of the
second input array, and False otherwise.
m) less_equal(): It returns an array of the same shape as the input arrays, with True where the
corresponding element of the first input array is less than or equal to the corresponding element
of the second input array, and False otherwise.
n) greater_equal(): It returns an array of the same shape as the input arrays, with True where the
corresponding element of the first input array is greater than or equal to the corresponding
element of the second input array, and False otherwise.
Procedure:
1. Import the NumPy library: To use NumPy array operations, you need to import the
NumPy library into your Python environment. You can do this using the import statement.
2. Create a NumPy array: You need to create a NumPy array to perform the various
operations. You can create an array using the np.array() function.
3. Use the array operations: Once you have created the array, you can use various NumPy
array operations such as any(), all(), isnan(), isinf(), isfinite(), zeros(), isreal(),
iscomplex(), isscalar(), less(), greater(), less_equal(), and greater_equal().
4. Print the output: After performing the operations, you should print the output to see the
results.
Suggested Reference:
1. NumPy User Guide: https://numpy.org/doc/stable/user/index.html 2.
NumPy Tutorial: https://www.tutorialspoint.com/numpy/index.htm
3. NumPy Cheat Sheet:
https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet
.pdf
4. NumPy Array Operations: https://www.geeksforgeeks.org/numpy-array-manipulation-
python/
5. NumPy Array Operations and Functions:
https://www.w3schools.com/python/numpy_array_operators.asp
Rubrics 1 2 3 4 5 Total
Marks
Experiment No: 9
Develop a program that shows usage of following NumPy library vector
functions. a) arrange() b) reshape() c) linspace() d) randint() e) dot()
Date:
Theory:
Here is a brief theory for each of the NumPy vector functions:
a) arrange(): This function is used to create a one-dimensional array with evenly spaced
values within a specified range. The function takes in three arguments: start (optional), stop, and
step (optional). The start argument is the starting value of the sequence (inclusive), the stop
argument is the ending value of the sequence (exclusive), and the step argument is the step size
between values. For example, np.arange(0, 10, 2) creates an array with values [0, 2, 4, 6, 8].
b) reshape(): This function is used to reshape an array into a new shape without changing its
data. The function takes in one argument: the new shape of the array, specified as a tuple of
integers. For example, np.reshape(my_array, (3, 4)) reshapes the array my_array into a 3x4
matrix.
c) linspace(): This function is used to create a one-dimensional array with evenly spaced
values between a specified range. The function takes in three arguments: start, stop, and num
(optional). The start argument is the starting value of the sequence, the stop argument is the
ending value of the sequence, and the num argument is the number of values to generate. For
example, np.linspace(0, 1, 5) creates an array with values [0., 0.25, 0.5, 0.75, 1.].
d) randint(): This function is used to generate an array of random integers within a specified
range. The function takes in three arguments: low (optional), high, and size (optional). The low
argument is the lower bound of the range (inclusive), the high argument is the upper bound of the
range (exclusive), and the size argument is the shape of the output array. For example,
np.random.randint(0, 10, size=(2, 3)) generates a 2x3 array of random integers between 0 and 10.
e) dot(): This function is used to perform matrix multiplication between two arrays. The
function takes in two arguments: the two arrays to be multiplied. The arrays must have
compatible shapes for matrix multiplication. For example, if A is a 2x3 array and B is a 3x2 array,
np.dot(A, B) performs matrix multiplication between A and B and returns a 2x2 array.
Overall, these NumPy vector functions are commonly used for manipulating and analyzing arrays
in scientific computing and data analysis. By using these functions in a program, you can
efficiently perform operations on large arrays and matrices in Python.
Procedure:
1. Import the NumPy library: Begin your program by importing the NumPy library using the
import statement.
2. Create an array: Create an array using one of the NumPy functions such as arrange() or
linspace(). You can also create an array from an existing data source such as a CSV file.
3. Reshape the array: Use the reshape() function to reshape the array to the desired shape.
For example, you can reshape a one-dimensional array into a two-dimensional array.
4. Generate random numbers: Use the randint() function to generate an array of random
integers within a specified range.
5. Perform matrix multiplication: Use the dot() function to perform matrix multiplication
between two arrays.
6. Print the results: Print the resulting arrays to the console using the print() function.
Suggested Reference:
1. https://numpy.org/doc/stable/
2. https://numpy.org/doc/stable/user/index.html
3. https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet
.pdf
4. https://numpy.org/devdocs/user/quickstart.html
5. https://www.datacamp.com/community/tutorials/python-numpy-tutorial
Rubrics 1 2 3 4 5 Total
Marks
230210132026
Knowledge of Programming Team work (2) Communication Skill Ethics(2)
subject (2) Skill (2)
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Experiment No: 10
Write a program to display below plot using matplotlib library. For Values of
X:[1,2,3,...,49], Values of Y (thrice of X):[3,6,9,12,...,144,147]
230210132026
Date:
Practical skills:
● Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
● Ability to customize the appearance of plots including labels, colors, legends, and titles
● Ability to add text, annotations, and shapes to the plots
● Ability to work with multiple plots and subplots
● Ability to export plots in different file formats like png, pdf, svg, etc.
● Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.
Objectives: (a) To create informative and visually appealing data visualizations that enable users
to explore, understand, and communicate complex data.
Theory:
Matplotlib is a Python library that provides a variety of tools for creating high-quality data
visualizations. It is one of the most popular data visualization libraries due to its ease of use and
versatility. The library is built on NumPy and provides a range of options for creating different
types of plots and graphs, including line plots, scatter plots, bar charts, histograms, and many
more.
pyplot module: This is the main module of Matplotlib, which provides a simple interface for
creating plots and charts. It is a collection of functions that allow users to create plots with
minimal coding.
Figure and Axes objects: The Figure object is the top-level container for all the plot elements. It
represents the entire plot and contains one or more Axes objects. The Axes object is the individual
plot area where data is plotted.
Plotting functions: Matplotlib provides a range of plotting functions that can be used to create
different types of plots and charts. These functions include plot(), scatter(), bar(), hist(), and many
more.
230210132026
Customization options: Matplotlib allows users to customize the appearance of plots in various
ways, including changing the plot color, adding labels, titles, and legends, adjusting the axis
limits, and more.
To use Matplotlib, you first need to import the library and its pyplot module. Then, you can create
a figure object and one or more axes objects using the subplots() function. After that, you can use
the various plotting functions to create different types of plots and customize them as needed.
Overall, Matplotlib provides a powerful and flexible tool for creating data visualizations in
Python. With its wide range of options and customization features, it can be used for a variety of
data analysis and communication tasks.
Procedure:
1. Import the required libraries - Matplotlib and NumPy.
2. Create two NumPy arrays for X and Y values using np.arange() and multiplication.
3. Create a figure and an axis object using plt.subplots().
4. Use a x.plot() function to plot X and Y values as a line plot.
5. Customize the plot with axis labels and a title.
6. Display the plot using plt.show() function.
Suggested Reference:
1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Matplotlib Tutorial by Corey Schafer: https://www.youtube.com/playlist?list=PL-
osiE80TeTvipOqomVEeZ1HRrcEvtZB_
4. Python Data Science Handbook by Jake VanderPlas:
https://jakevdp.github.io/PythonDataScienceHandbook/
5. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-edition/9781800565547
230210132026
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
230210132026
Experiment No: 11
Write a program to display below bar plot using matplotlib library. For value
Languages = ['Java', 'Python', 'PHP', 'JavaScript', 'C#', 'C++']
Popularity = [22.2, 17.6, 8.8, 8, 7.7, 6.7]
Date:
Practical skills:
● Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
● Ability to customize the appearance of plots including labels, colors, legends, and titles
● Ability to add text, annotations, and shapes to the plots
● Ability to work with multiple plots and subplots
● Ability to export plots in different file formats like png, pdf, svg, etc.
● Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.
Objectives: (a) To learn how to interpret and analyze data visualizations, and to use them to draw
insights and make informed decisions.
Theory:
A bar plot is a type of chart that displays data as rectangular bars. The length or height of each bar
is proportional to the value of the data it represents. Bar plots are useful for comparing the values
of different categories or groups.
230210132026
Matplotlib is a popular data visualization library in Python that provides a wide range of functions
for creating different types of plots, including bar plots.
Use the bar() function to create the bar plot by passing the languages and popularity lists as
arguments. The bar() function automatically generates the rectangular bars for each category and
sets their lengths proportional to the values in the popularity list.
Procedure:
1. Define the data for the plot as lists or arrays.
2. Use the bar() function to create the plot, passing the data as arguments.
3. Customize the plot by changing the colors, labels, and other attributes.
4. Add a title and labels to the plot to provide context and improve its readability.
5. Display the plot using the show() function.
Suggested Reference:
1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Matplotlib Tutorial by Corey Schafer: https://www.youtube.com/playlist?list=PL-
osiE80TeTvipOqomVEeZ1HRrcEvtZB_
4. Python Data Science Handbook by Jake VanderPlas:
https://jakevdp.github.io/PythonDataScienceHandbook/
1. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-edition/9781800565547
Rubrics 1 2 3 4 5 Total
Marks
230210132026
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Experiment No: 12
Write a program to display below bar plot using matplotlib library For below
data display pie plot
Languages = ['Java', 'Python', 'PHP', 'JavaScript', 'C#', 'C++']
Popuratity = [22.2, 17.6, 8.8, 8, 7.7, 6.7]
Colors = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728",
"#9467bd", "#8c564b"]
Date:
Practical skills:
● Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
● Ability to customize the appearance of plots including labels, colors, legends, and titles
● Ability to add text, annotations, and shapes to the plots
● Ability to work with multiple plots and subplots
● Ability to export plots in different file formats like png, pdf, svg, etc.
● Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.
230210132026
Objectives: (a) To learn how to interpret and analyze data visualizations, and to use them to draw
insights and make informed decisions.
Theory:
A bar plot is a type of chart that displays data as rectangular bars. The length or height of each bar
is proportional to the value of the data it represents. Bar plots are useful for comparing the values
of different categories or groups.
Matplotlib is a popular data visualization library in Python that provides a wide range of functions
for creating different types of plots, including bar plots.
Use the bar() function to create the bar plot by passing the languages and popularity lists as
arguments. The bar() function automatically generates the rectangular bars for each category and
sets their lengths proportional to the values in the popularity list.
Procedure:
1. Import the necessary libraries (matplotlib.pyplot)
2. Define the data to be used (Languages, Popularity, Colors)
3. Create a figure object and set the figure size
4. Define the title of the plot and add the data to be displayed (Popularity) and their
corresponding labels (Languages)
5. Set the colors of the pie chart using the Colors list
6. Add a legend to the chart with the labels and colors used 7. Display the plot.
Suggested Reference:
230210132026
1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Matplotlib Tutorial by Corey Schafer: https://www.youtube.com/playlist?list=PL-
osiE80TeTvipOqomVEeZ1HRrcEvtZB_
4. Python Data Science Handbook by Jake VanderPlas:
https://jakevdp.github.io/PythonDataScienceHandbook/
5. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-edition/9781800565547
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Experiment No: 13
Write a program to display below bar plot using matplotlib library For 200
random points for both X and Y display scatter plot.
Date:
Practical skills:
230210132026
● Ability to create different types of plots such as line plots, scatter plots, bar plots, etc.
● Ability to customize the appearance of plots including labels, colors, legends, and titles
● Ability to add text, annotations, and shapes to the plots
● Ability to work with multiple plots and subplots
● Ability to export plots in different file formats like png, pdf, svg, etc.
● Ability to integrate matplotlib with other Python libraries like NumPy and Pandas.
Objectives: (a) To learn how to interpret and analyze data visualizations, and to use them to draw
insights and make informed decisions.
Theory:
In Matplotlib, a scatter plot is a chart type that displays data as a collection of points with the
position determined by the values of two variables. Each point on the scatter plot represents an
observation, and the position of the point on the X-Y axis is determined by the values of the two
variables.
A scatter plot is useful for exploring the relationship between two continuous variables. It can be
used to identify patterns or trends in the data and to detect the presence of outliers or unusual
observations. Scatter plots can also be used to assess the correlation between the two variables.
Matplotlib provides the scatter() function for creating scatter plots. The function takes two arrays,
one for the X-axis data and one for the Y-axis data, as its input arguments. Additional parameters
can be used to customize the appearance of the scatter plot, such as the color, size, and
transparency of the points.
Procedure:
1. Import necessary libraries: We will need the Matplotlib and NumPy libraries for this task.
2. Generate random data for the X and Y axes: We can use the NumPy library to generate
random data for both the X and Y axes
3. Create a scatter plot: We can use the scatter method of the Matplotlib library to create a
scatter plot. We need to pass the X and Y data as arguments and specify the marker style
and color using the marker and c parameters, respectively
4. Add title and labels: We can add a title and labels for the X and Y axes using the title,
xlabel, and ylabel methods of the Matplotlib library.
5. Set axes limits: We can set the limits for the X and Y axes using the xlim and ylim
methods of the Matplotlib library.
230210132026
6. Display the plot: We can display the plot using the show method of the Matplotlib library.
Suggested Reference:
1. https://matplotlib.org/stable/index.html
2. https://realpython.com/python-matplotlib-guide/
3. Matplotlib Tutorial by Corey Schafer: https://www.youtube.com/playlist?list=PL-
osiE80TeTvipOqomVEeZ1HRrcEvtZB_
4. Python Data Science Handbook by Jake VanderPlas:
https://jakevdp.github.io/PythonDataScienceHandbook/
5. Mastering Matplotlib by Duncan M. McGreggor and Paul Ivanov:
https://www.packtpub.com/product/mastering-matplotlib-second-edition/9781800565547
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Experiment No: 14
Develop a program that reads .csv and plot the data of the dataset stored in the
.csv file file from the url:
(https://github.com/chris1610/pbpython/blob/master/data/sample salesv3.xlsx?raw=true)
230210132026
Date:
Practical skills:
Objectives: (a) To analyze and visualize the data in an efficient and effective way.
(b) To identify patterns, trends, and outliers in the data.
Theory:
Reading a .csv file from a URL and plotting the data is a common data analysis and visualization
task in many fields. Here are the main steps involved in this process:
Importing the necessary libraries: To read and plot the .csv file, we typically use the pandas and
matplotlib libraries. We need to import them at the beginning of our program.
Loading the data from the URL: We can use the pandas library's read_csv function to read the
data from the URL. We need to provide the URL of the .csv file as an argument to this function.
Data cleaning and preparation: Once we have loaded the data, we may need to clean and prepare
it for visualization. This may include dropping unnecessary columns, filling missing values, and
transforming the data.
Data visualization: Once the data is cleaned and prepared, we can use matplotlib's various plotting
functions to create visualizations such as line plots, scatter plots, bar plots, and more. We can
customize the plot with various parameters such as colors, labels, titles, and more.
Displaying the plot: After creating the plot, we need to display it using the show function
provided by the matplotlib library.
1. Validate inputs.
2. Handle errors.
3. Secure the program 4. Optimize performance
5. Test and review.
230210132026
Procedure:
1. Import the necessary libraries: You will need the pandas library to read the .csv file, and
matplotlib library to create the plot.
2. Read the .csv file from the URL: Use the pandas library to read the .csv file from the URL
and store it as a DataFrame object.
3. Preprocess the data: Preprocess the data as required. This may involve cleaning the data,
removing duplicates, handling missing values, and converting data types.
4. Visualize the data: Use the matplotlib library to create a visualization of the data. You can
create scatter plots, line graphs, histograms, and other types of visualizations based on the
data.
5. Save or display the visualization: Save the visualization to a file or display it on the
screen, depending on the user requirements.
6. Test and validate the program: Test the program thoroughly to ensure that it works as
expected for various input datasets. Validate the results against the expected output and fix
any issues or errors.
7. Document the program: Document the program by providing clear and concise comments
in the code and a user manual that explains how to use the program.
Suggested Reference:
1. Pandas documentation on reading a CSV file from a URL:
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-csv-files
2. Matplotlib documentation on creating plots:
https://matplotlib.org/stable/tutorials/introductory/pyplot.html
3. Real Python tutorial on reading and writing CSV files in Python:
https://realpython.com/python-csv/
4. DataCamp tutorial on data visualization with Matplotlib:
https://www.datacamp.com/community/tutorials/matplotlib-tutorial-python
5. Towards Data Science tutorial on creating visualizations with Pandas and Matplotlib:
https://towardsdatascience.com/data-visualization-with-pandas-and-matplotlib-8dadc69f2f
79
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
230210132026
Experiment No: 15
Write a text classification pipeline using a custom preprocessor and
CharNGramAnalyzer using data from Wikipedia articles as a training set.
Evaluate the performance on some held out test sets.
Date:
Practical skills:
Objectives: (a) To develop a machine learning model that can accurately classify text documents
into predefined categories that can be used for various applications such as sentiment analysis,
spam detection, and topic modeling.
Theory:
Text classification is the task of assigning predefined categories or labels to text documents based
on their content. A text classification pipeline typically consists of several stages, including data
preprocessing, feature extraction, model training, and evaluation.
In the context of Wikipedia articles, the first step in building a text classification pipeline is to
collect a dataset of articles with their corresponding labels. These labels can be either manually
assigned or obtained from existing metadata such as categories or tags.
Once a dataset is obtained, the next step is data preprocessing. This typically involves text
normalization, tokenization, stop word removal, and stemming/lemmatization. The goal of data
230210132026
preprocessing is to clean the text and reduce its dimensionality while retaining the relevant
information for classification.
After preprocessing, the text is converted into numerical features that can be used as input to a
machine learning model. A popular technique for feature extraction is the bag-of-words model,
which represents each document as a vector of word frequencies. However, this approach may not
capture the semantic meaning of words and their relationships in the text.
The final stage in the text classification pipeline is model training and evaluation. A common
approach is to use supervised learning algorithms such as Naive Bayes, Logistic Regression, or
Support Vector Machines. The performance of the model is evaluated using metrics such as
accuracy, precision, recall, and F1 score on held-out test sets.
1. Data privacy.
2. Bias and fairness.
3. Model accuracy and reliability 4. Ethical considerations
5. Test and review.
Procedure:
Collect and preprocess the data: Download a set of Wikipedia articles that represent the different
categories you want to classify (e.g., sports, politics, entertainment, etc.). Preprocess the data by
removing any unnecessary characters, converting all text to lowercase, and removing any stop
words.
Split the data: Split the preprocessed data into two sets: training and test sets. The training set will
be used to train the model, while the test set will be used to evaluate the model's performance.
Feature extraction: Extract the features from the preprocessed text using CharNGramAnalyzer.
This will convert each text document into a vector of features that can be used as input to the
classification model.
Train the model: Train a text classification model using the extracted features and the training set.
You can use any machine learning algorithm, such as Naive Bayes, SVM, or Neural Networks.
Evaluate the model: Use the trained model to classify the test set and evaluate its performance
using metrics such as accuracy, precision, recall, and F1-score.
230210132026
Tune the model: If the model's performance is not satisfactory, you can tune the hyperparameters
of the algorithm or try different algorithms to improve its performance.
Deploy the model: Once you are satisfied with the model's performance, you can deploy it in
production to classify new text documents.
Suggested Reference:
1. "Building a Text Classification Pipeline with Python" by Dipanjan Sarkar: This article
provides a step-by-step guide on how to build a text classification pipeline using Python
and scikit-learn library. It covers preprocessing techniques, feature extraction, model
selection, and evaluation.
2. "Text Classification with NLTK and Scikit-Learn" by Ahmed Besbes: This tutorial
provides a detailed guide on how to perform text classification using Python and two
popular libraries, NLTK and scikit-learn. It covers data preprocessing, feature extraction,
and model training and evaluation.
3. "Using Wikipedia Articles for Text Classification" by Nikolay Krylov: This article
demonstrates how to use Wikipedia articles as a training set for text classification. It
covers data collection, preprocessing, feature extraction using TF-IDF and
CharNGramAnalyzer, model training, and evaluation.
4. "Text Classification with Python and Scikit-Learn" by Sebastian Raschka: This book
chapter provides a comprehensive guide on how to perform text classification using
Python and scikit-learn. It covers data preprocessing, feature extraction, model training,
and evaluation, as well as advanced topics such as model selection and parameter tuning.
5. "A Complete Tutorial on Text Classification using Naive Bayes Algorithm" by Divya
Gupta: This tutorial provides a detailed guide on how to perform text classification using
Naive Bayes algorithm in Python. It covers data preprocessing, feature extraction, model
training and evaluation, as well as parameter tuning.
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
Experiment No: 16
Write a text classification pipeline to classify movie reviews as either positive or
negative.
Find a good set of parameters using grid search.
Evaluate the performance on a held out test set.
Date:
Theory:
The theory behind writing a text classification pipeline to classify movie reviews as either
positive or negative involves several key steps:
Data preprocessing: This step involves cleaning and preparing the raw text data by removing stop
words, converting text to lowercase, and performing stemming or lemmatization.
Feature extraction: This step involves converting the preprocessed text data into a numerical
representation that can be used as input to a machine learning algorithm. Common techniques
include Bag-of-Words, TF-IDF, and Word Embeddings.
Model selection and training: This step involves selecting an appropriate machine learning
algorithm and training it on the preprocessed and transformed data. Popular algorithms include
Naive Bayes, Support Vector Machines, and Neural Networks.
Hyperparameter tuning: This step involves selecting the optimal hyperparameters for the chosen
machine learning algorithm. This can be done using techniques such as grid search or random
search.
Evaluation: This step involves evaluating the performance of the trained model on a held-out test
set. This can be done using metrics such as accuracy, precision, recall, and F1-score.
Deployment: This step involves deploying the trained model in a production environment, where
it can be used to classify new movie reviews.
Grid search is a hyperparameter tuning technique that involves searching for the optimal set of
hyperparameters for a given machine learning algorithm by exhaustively trying all possible
combinations of hyperparameter values. This can be done by training and evaluating the model
with different combinations of hyperparameters on a validation set, and selecting the combination
that yields the best performance.
Evaluating the performance of the trained model on a held-out test set is important to ensure that
the model generalizes well to new, unseen data. This helps to avoid overfitting, where the model
performs well on the training data but poorly on new data.
Overall, the theory behind writing a text classification pipeline to classify movie reviews as either
positive or negative involves a combination of data preprocessing, feature extraction, model
selection and training, hyperparameter tuning, evaluation, and deployment.
1. Data preprocessing
230210132026
2. Feature extraction
3. Model selection
4. Hyper parameter tuning
5. evaluation
Procedure:
1. Preprocess the data: Preprocess the movie review data by cleaning the text, removing stop
words, and performing stemming or lemmatization to reduce the dimensionality of the
feature space.
2. Split the data: Split the preprocessed data into training, validation, and test sets. The
training set will be used to train the model, the validation set will be used to tune the
hyperparameters, and the test set will be used to evaluate the final performance of the
model.
3. Extract features: Extract features from the preprocessed text using techniques such as Bag-
of-Words, TF-IDF, or Word Embeddings. This will convert the text data into a numerical
representation that can be used as input to a machine learning algorithm.
4. Select a model: Choose a suitable machine learning algorithm, such as Naive Bayes,
Support Vector Machines, or Neural Networks, and train it on the preprocessed and
transformed data.
5. Hyperparameter tuning: Use grid search to find the best set of hyperparameters for the
chosen machine learning algorithm. This involves training and evaluating the model with
different combinations of hyperparameters on the validation set, and selecting the
combination that yields the best performance.
6. Evaluate the model: Evaluate the performance of the trained model on the held-out test set
using metrics such as accuracy, precision, recall, and F1-score.
7. Deploy the model: Deploy the trained model in a production environment, where it can be
used to classify new movie reviews.
Suggested Reference:
230210132026
1. "Introduction to Machine Learning with Python" by Andreas C. Müller and Sarah Guido -
This book provides a comprehensive introduction to machine learning and includes a
section on text classification. It covers topics such as preprocessing text data, feature
extraction, and model evaluation.
2. "Natural Language Processing with Python" by Steven Bird, Ewan Klein, and Edward
Loper - This book provides an introduction to natural language processing and includes a
section on text classification. It covers topics such as feature selection, training classifiers,
and evaluation metrics.
4. "Text Classification in Python using spaCy" by Dipanjan Sarkar - This tutorial provides an
introduction to text classification using spaCy, a popular NLP library in Python. It covers
topics such as preprocessing text data, feature extraction, model selection, and
hyperparameter tuning.
Rubrics 1 2 3 4 5 Total
Marks
Good Average Good Average Good Satisfactory Good Satisfactory Good Average
(2) (1) (2) (1) (2) (1) (2) (1) (2) (1)
230210132026