0% found this document useful (0 votes)
21 views19 pages

PWP Project Done

The micro-project titled 'Remove Duplicate In Dataset' aims to clean datasets by identifying and removing duplicate entries using Python's pandas library. The project outlines a structured methodology, including a timeline for completion and the benefits of improved data quality and processing efficiency. It emphasizes the importance of data cleaning in ensuring accurate analysis and model performance in data science and machine learning.

Uploaded by

kalpeshk1407
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views19 pages

PWP Project Done

The micro-project titled 'Remove Duplicate In Dataset' aims to clean datasets by identifying and removing duplicate entries using Python's pandas library. The project outlines a structured methodology, including a timeline for completion and the benefits of improved data quality and processing efficiency. It emphasizes the importance of data cleaning in ensuring accurate analysis and model performance in data science and machine learning.

Uploaded by

kalpeshk1407
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

SHREEYASH PRATISHTHAN’S

SHREEYASH COLLEGE OF ENGINEERING AND TECHNOLOGY


(POLYTECHNIC), CHH. SAMBHAJINAGAR

MICRO-PROJECT REPORT

NAME OF DEPARTMENT:- COMPUTER ENGINEERING


ACADEMIC YEAR:- 2024-25
SEMESTER:- 6TH
COURSE NAME:- Programming with python
COURSE CODE:-_22616
MICRO-PROJECT TITLE:-Remove Duplicate In Dataset
BY:-
1) KALPESH RAVASAHEB KACHOLE EN. NO. 2210920112

UNDER THE GUIDANCE OF:- P.P.Angadi


MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION,
MUMBAI CERTIFICATE
This is to certify that Mr./ Ms. ROHIT SUNIL SOLUNKE of 5 TH Semester of
Diploma in COMPUTER ENGINEERING of Institute SHREEYASH COLLEGE OF
ENGINEERING AND TECHNOLOGY
has successfully completed Micro-Project Work in Course of WPD for the academic year
202425 as prescribed in the I-Scheme Curriculum.

Date:- Enrollment No:- 23511510357


Place:- Chh. Sambhajinagar Exam Seat No.:- 464660

Signature Signature Signature


Guide HOD Principle
P.P.Angadie A,C,Naik S.S.Khandagale
ACKNOWLEDGEMENT
We wish to express our profound gratitude to our guide
Prof. P.P.ANGADI who guided us endlessly in framing and completion of
Micro-Project. He / She guided us on all the main points in that Micro-Project.
We are indebted to his / her constant encouragement, cooperation and help. It was
his / her enthusiastic support that helped us in overcoming of various obstacles in
the Micro-Project.
We are also thankful to our Principal, HOD, Faculty Members
and classmates for extending their support and motivation in the completion of
this Micro-Project.

1) ) KALPESH RAVASAHEB KACHOLE EN. NO.2210920112


Annexure-1
MicroProject Proposal
(Format or Micro-Project Proposal about1-2pages)

Title of Micro-Project:-Remove duplicate in dataset

1.0 Aims/Benefits of the Micro-Project (minimum30-50words)


The aim of framework testing is to ensure that the framework performs its intended functions
effectively, reliably, and securely, while also providing the expected benefits in terms of
reusability, scalability, and integration capabilities. Testing a framework is critical because
frameworks serve as the foundation for building applications, and any issues in the framework
can propagate and impact the overall system.

2.0 Course Outcomes Addressed


a) Validated Functionality
b) Integrated Systems Validation
c) Performance Benchmarks
d) Security Assurance
e) Compatibility and Platform Support f)
3.0 Proposed Methodology (Procedure in brief that will be followed to do the micro- project) in about
100 to 200 words).
Framework Testing is a specialized area of software testing that focuses on validating the core
functionality and integration of software frameworks. Frameworks are reusable, foundational
software components or libraries that provide a structure to help developers build applications
more efficiently. Testing a framework ensures that it functions correctly, remains stable under
various conditions, and works well when used in larger applications.
Framework Testing is the process of verifying that a software framework—
whether it's for a web application, a test automation suite, or a development
environment—functions as expected. Frameworks are reusable, pre-built
structures or libraries that help developers build applications more efficiently by
providing a set of common services or tools. The testing of these frameworks
ensures they are robust, stable, and ready to be used in production environments.
4.0 Action Plan (Sequence and time required for major activity. The following is for Reference, The
Activities can be Added / reduced / Modified )

Name of
Sr. Details of Responsible
Week Planned
No. activity Team Members
Planned Start Finish
date date
1 1 &2 Discussion & Finalization 14/1/2025 15/1/2025 Kalpesh kachole
of Topic
2 3 Preparation of the 20/1/2025 21/1/2025 Kalpesh kachole
Abstract
3 4 Literature Review 25/1/2025 5/2/2025 Kalpesh kachole
4 5 Submission of 7/2/2025 10/2/2025 Kalpesh kachole
Microproject Proposal (
Annexure-I)
5 6 Collection of information 11/2/2025 13/2/2025 Kalpesh kachole
about Topic
6 7 Collection of 20/2/2025 21/2/2025 Kalpesh kachole
relevant content
/ materials for the
execution of
Microproject.
7 8 Discussion and 14/3/2025 16/3/2025 Kalpesh kachole
submission of outline of
the Microproject.
8 9 Analysis / execution of 25/3/2025 Kalpesh kachole
Collected data / 22/3/2025
information and
preparation of
Prototypes / drawings /
photos / charts / graphs /
tables / circuits / Models /
programs etc.
9 10 Completion of Contents 2/4/2025 3/4/2025 Kalpesh kachole
5.0 of Project Report
10 11 Completion of Weekly 5/4/2025 7/4/2025 Kalpesh kachole
progress Report
11 12 Completion of Project20/4/2025 22/4/2025 Kalpesh kachole
Report ( Annexure-II)
12 13 Viva voce / Delivery of 7/5/2025 10/5/2025 Kalpesh kachole
Presentation

5
Resources Required (major resources such asraw material, some machining facility, software etc.)

Sr. Name of Resources / Materials Specification Qty Remarks


No.

Names of Team Members with En. Nos.

1. KALPESH RAVASAHEB KACHOLE En.No :- 2210920112

6
INDEX
SR.NO TOPIC PAGE.NO

1 Informa on 3

2 Advantages 4

3 Specifica on 5

4 Project Code 8

5 Conclusion 12

6 Referance 13

7
Informa on
This project aims to clean a dataset by identifying and removing
duplicate records using Python. Duplicate data entries are common in
real-world datasets and can lead to incorrect analysis, skewed results,
and poor model performance in data science and machine learning
tasks. Therefore, data cleaning is a critical step in any data processing
pipeline.
In this project, the Python pandas library is used to load, inspect, and
manipulate the dataset. The script reads a CSV file, detects duplicate
rows using built-in functions, and removes them while optionally
logging the duplicates for reference. The cleaned dataset is then saved
as a new CSV file for further use.
By automating the process of duplicate removal, this project ensures
improved data quality and prepares the dataset for accurate analysis. It
is a foundational step in data preprocessing and is especially useful in
fields like business analytics, data science, and machine learning.
This project aims to clean a dataset by identifying and removing
duplicate records using Python. Duplicate data entries are common in
real-world datasets and can lead to incorrect analysis, skewed results,
and poor model performance in data science and machine learning
tasks. Therefore, data cleaning is a critical step in any data processing
pipeline.
In this project, the Python pandas library is used to load, inspect, and
manipulate the dataset. The script reads a CSV file, detects duplicate
rows using built-in functions, and removes them while optionally
logging the duplicates for reference. The cleaned dataset is then saved
as a new CSV file for further use.
By automating the process of duplicate removal, this project ensures
improved data quality and prepares the dataset for accurate analysis. It
is a foundational step in data preprocessing and is especially useful in
fields like business analytics, data science, and machine learning.

8
Advantages
Improved Data Quality
Removing duplicates ensures your dataset is clean, accurate, and
reliable for analysis or modeling.
Faster Processing
With fewer redundant entries, data processing, analysis, and training
models become faster and more efficient.
Reduces Data Size
Removing duplicate rows reduces the overall size of the dataset,
saving storage space and memory.
Accurate Insights
Clean data leads to more accurate statistical analysis and
visualizations, preventing misleading conclusions.
Easy Implementation
Using Python and pandas makes the process quick and simple, even
for beginners in data science.

9
Specification
1. Functional Requirements:
 Input:
o The script accepts a CSV file as input, which contains the
dataset to be processed.
o The file should have one or more columns with data that
may contain duplicates.
 Processing:
o The script reads the dataset into a pandas DataFrame.
o It identifies duplicate rows based on all columns or
specific columns (if specified).
o The duplicates are removed, leaving only unique rows in
the dataset.
o An optional log of removed duplicates can be created for
auditing purposes.
 Output:
o A cleaned CSV file containing the dataset without
duplicates.
o A log file (optional) with details of the removed duplicate
rows.
2. Technical Specifications:
 Programming Language: Python
 Libraries:
o pandas: For handling and manipulating the dataset.
o numpy (optional): For advanced data operations (if
needed).
 Input File Format: CSV (Comma-Separated Values)

10
o Example columns: Name, Email, Address, Phone Number,
etc.
 Output File Format: CSV
o The cleaned dataset will be saved as a new CSV file
(cleaned_data.csv).
o A log of duplicates (if requested) will be saved as a
separate CSV (duplicates_log.csv).
3. Features:
 Detects Duplicates:
o Finds rows with identical data across all columns or
specific columns.
 Drop Duplicates:
o Removes duplicate entries, retaining only the first
occurrence.
o Option to retain the last occurrence or both if specified.
 Logging:
o Logs the removed duplicates in a separate file for
transparency (optional).
 Reusable:
o The script can be reused on different datasets without any
code changes, provided they are in CSV format.

11
Purpose
The purpose of this project is to provide a simple and efficient
solution for cleaning datasets by identifying and removing duplicate
entries. Duplicate records can significantly affect the quality and
reliability of data analysis and machine learning models. This project
leverages Python and the pandas library to automate the process of
cleaning datasets, ensuring that the resulting data is free from
redundancy and ready for accurate analysis or model training.
By using this script, users can quickly:
 Identify duplicate rows in a dataset.
 Remove redundant entries, improving data quality.
 Save the cleaned dataset for further processing or analysis.
This project is aimed at:
 Data analysts and data scientists looking to clean their datasets
as part of the data preprocessing workflow.
 Business analysts who need to prepare clean data for reporting
or decision-making.
 Beginners learning data cleaning and Python programming,
providing them with a useful tool to enhance their skills.

12
Project code
import pandas as pd

# Step 1: Load the dataset


# Make sure to replace 'data.csv' with the path to your dataset.
file_path = 'data.csv' # Provide the correct file path
df = pd.read_csv(file_path)

# Step 2: Display basic info about the dataset


print("Original Dataset Shape:", df.shape)
print("First 5 Rows of the Dataset:\n", df.head())

# Step 3: Find and display duplicates


duplicates = df[df.duplicated()]
print("\nDuplicate Rows Found:\n", duplicates)
print("\nNumber of Duplicate Rows:", df.duplicated().sum())

# Step 4: Remove duplicates (default keeps the first occurrence)


df_cleaned = df.drop_duplicates()

# Step 5: Show new shape after removing duplicates


print("\nCleaned Dataset Shape:", df_cleaned.shape)

# Step 6: Save the cleaned dataset to a new CSV file


df_cleaned.to_csv('cleaned_data.csv', index=False)
13
print("\nCleaned dataset saved as 'cleaned_data.csv'.")

# (Optional) Save the duplicates to a separate file for auditing


purposes
if not duplicates.empty:
duplicates.to_csv('duplicates_log.csv', index=False)
print("Duplicate rows saved in 'duplicates_log.csv'.")

14
Example output
Original Dataset Shape: (100, 5)
First 5 Rows of the Dataset:
Name Email Address Phone Number
0 John john@example.com 1234 Elm St 555-1234
1 Jane jane@example.com 5678 Oak St 555-5678
2 John john@example.com 1234 Elm St 555-1234
3 Jane jane@example.com 5678 Oak St 555-5678

Duplicate Rows Found:


Name Email Address Phone Number
2 John john@example.com 1234 Elm St 555-1234
3 Jane jane@example.com 5678 Oak St 555-5678

Number of Duplicate Rows: 2

Cleaned Dataset Shape: (98, 5)

Cleaned dataset saved as 'cleaned_data.csv'.


Duplicate rows saved in 'duplicates_log.csv'.

15
Future Scope:
 The project can be expanded to support multiple file formats
such as Excel, JSON, or databases.
 Advanced duplicate detection techniques, such as fuzzy
matching, could be added to handle nearly identical but not
exact duplicate records.
 The project can be integrated into a larger data preprocessing
pipeline for more complex data analysis or machine learning
workflows.

16
Conclusion
In this project, we successfully developed a Python script that
automates the process of identifying and removing duplicate entries
from datasets. Using the powerful pandas library, the script reads a
dataset, identifies duplicate rows, removes them, and then saves the
cleaned dataset for further analysis. By offering the option to log the
removed duplicates, the script also provides transparency and
auditability.
The removal of duplicates is a crucial step in the data preprocessing
pipeline, as duplicate data can negatively impact the accuracy of
analysis and model predictions. This project not only demonstrates
how to handle such data issues effectively but also highlights the
importance of data cleaning in ensuring high-quality datasets for real-
world applications.

17
Referance
https://docs.python.org/3/
https://www.w3schools.com/python/pandas/default.asp
https://stackoverflow.com/
https://www.w3schools.com/python/pandas/default.asp
https://realpython.com/python-data-cleaning-numpy-
pandas/

18
Name of Student:-KALPESH RAVASAHEB KACHOLE En. No._2210920112
Name of Program:-_COMPUTER ENGINEERING Semester:-_6TH
Course Name:- PWP Course Code:-22616
Title of The Micro-Project:- Remove Duplicate In Dataset
Course Outcomes Achieved:-

Sub
Sr. Poor Average Good Excellent
Total
No. (Marks1-3) (Marks4-5) (Marks 6-8) (Marks9-10)
Characteristic to be assessed
(A) Process and Product Assessment (Convert Below total marks out of 6Marks )

Relevance to the course


1

2 Literature
Review/information
collection
3 Completion of the Target as
Per project proposal
4 Analysis of Data and
representation
5 Quality of Prototype/Model

6 Report Preparation

(B) Individual Presentation/Viva(Convert Below total marks out of 4Mark )

Presentation
7

8 Viva

(A) (B)
Process and Product Individual Presentation/ Viva (4 Total Marks
Assessment (6 marks) marks) 10

Comments/Suggestions about team work/leadership/inter-personal communication (if any)

Name of Course Teacher:-P.P.Angadi

Dated Signature:-

19

You might also like