Network Traffic Data Cleaning: UCM FibIoT 2024 Dataset

This repository contains the data cleaning and preprocessing pipeline for the UCM FibIoT 2024 dataset. The project focuses on preparing raw network packet capture (pcap) data for machine learning tasks, specifically aimed at detecting Distributed Denial of Service (DDoS) attacks in IoT environments.

Project Overview

The primary goal of this notebook (data_cleaning.ipynb) is to transform raw, noisy CSV files exported from Wireshark into a standardized, "analysis-ready" format. It handles three specific types of network flood attacks:

Attack Dataset	Layer	Description
`HTTP_flood.csv`	Application	Layer 7 DDoS attack targeting web servers via HTTP requests.
`ICMP_flood.csv`	Network	Layer 3 flood using Echo Request packets (Ping flood).
`SYN_flood.csv`	Transport	Layer 4 TCP attack exploiting the TCP three-way handshake.

Cleaning Pipeline

The notebook implements a robust cleaning function that performs the following 11 steps to ensure data integrity:

Standardize Column Names: Converts all headers to lowercase and replaces spaces/dots with underscores (e.g., Source Port -> source_port).
Drop Indexing Columns: Removes the Wireshark capture index (no) as it does not contribute to statistical analysis.
Parse Datetime: Converts time strings into high-precision Pandas datetime objects.
Handle Missing Ports: * For portless protocols (ARP, ICMP, LLDP), missing ports are filled with a -1 sentinel value.
- For port-based protocols (TCP, UDP), rows with missing ports are dropped.
Remove Duplicates: Identifies and removes exact duplicate packet entries.
Whitespace Removal: Strips leading/trailing spaces from string values in Source, Destination, and Protocol columns.
Length Validation: Filters out packets with invalid (non-positive) lengths.
Port Range Validation: Ensures ports are within the valid 0–65535 range.
Memory Optimization: Converts the protocol column into a Categorical type to significantly reduce memory footprint.
Chronological Sorting: Reorders all logs by their packet arrival time.
Labeling: Adds an attack_type column to facilitate supervised multi-class classification.

Dataset Statistics

The raw dataset contains over 44.7 million rows. After processing, the pipeline maintains high data integrity with minimal row loss (less than 0.01%):

Dataset	Raw Rows	Cleaned Rows	Rows Removed
HTTP Flood	10,799,707	10,799,526	181
ICMP Flood	11,203,031	11,202,929	102
SYN Flood	22,780,665	22,780,333	332

Getting Started

Prerequisites

Python 3.8+
Pandas
NumPy
Matplotlib / Seaborn (for visualizations)

Usage

Place your raw data in the following directory structure: Dataset_UCM_FibIoT2024/.
Open the data_cleaning.ipynb notebook in Jupyter or VS Code.
Run the cells sequentially to perform the exploratory data analysis and execute the cleaning function.

Visualizations Included

The notebook generates several plots to verify data quality:

Missing Value Analysis: Charts showing null percentages before cleaning.
Protocol Distribution: Summary of the most frequent protocols found in each attack type (TCP, ICMP, MDNS, etc.).

This project was developed as part of a Data Science investigation into IoT security and network traffic analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
ML_folder		ML_folder
.gitattributes		.gitattributes
HTTP_flood.csv		HTTP_flood.csv
ICMP_flood.csv		ICMP_flood.csv
Project_report_ds.docx		Project_report_ds.docx
README.md		README.md
SYN_flood.csv		SYN_flood.csv
data_cleaning.ipynb		data_cleaning.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Network Traffic Data Cleaning: UCM FibIoT 2024 Dataset

Project Overview

Cleaning Pipeline

Dataset Statistics

Getting Started

Prerequisites

Usage

Visualizations Included

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Network Traffic Data Cleaning: UCM FibIoT 2024 Dataset

Project Overview

Cleaning Pipeline

Dataset Statistics

Getting Started

Prerequisites

Usage

Visualizations Included

About

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages