This repository contains the data cleaning and preprocessing pipeline for the UCM FibIoT 2024 dataset. The project focuses on preparing raw network packet capture (pcap) data for machine learning tasks, specifically aimed at detecting Distributed Denial of Service (DDoS) attacks in IoT environments.
The primary goal of this notebook (data_cleaning.ipynb) is to transform raw, noisy CSV files exported from Wireshark into a standardized, "analysis-ready" format. It handles three specific types of network flood attacks:
| Attack Dataset | Layer | Description |
|---|---|---|
HTTP_flood.csv |
Application | Layer 7 DDoS attack targeting web servers via HTTP requests. |
ICMP_flood.csv |
Network | Layer 3 flood using Echo Request packets (Ping flood). |
SYN_flood.csv |
Transport | Layer 4 TCP attack exploiting the TCP three-way handshake. |
The notebook implements a robust cleaning function that performs the following 11 steps to ensure data integrity:
- Standardize Column Names: Converts all headers to lowercase and replaces spaces/dots with underscores (e.g.,
Source Port->source_port). - Drop Indexing Columns: Removes the Wireshark capture index (
no) as it does not contribute to statistical analysis. - Parse Datetime: Converts time strings into high-precision Pandas datetime objects.
- Handle Missing Ports: * For portless protocols (ARP, ICMP, LLDP), missing ports are filled with a
-1sentinel value.- For port-based protocols (TCP, UDP), rows with missing ports are dropped.
- Remove Duplicates: Identifies and removes exact duplicate packet entries.
- Whitespace Removal: Strips leading/trailing spaces from string values in Source, Destination, and Protocol columns.
- Length Validation: Filters out packets with invalid (non-positive) lengths.
- Port Range Validation: Ensures ports are within the valid 0–65535 range.
- Memory Optimization: Converts the
protocolcolumn into a Categorical type to significantly reduce memory footprint. - Chronological Sorting: Reorders all logs by their packet arrival time.
- Labeling: Adds an
attack_typecolumn to facilitate supervised multi-class classification.
The raw dataset contains over 44.7 million rows. After processing, the pipeline maintains high data integrity with minimal row loss (less than 0.01%):
| Dataset | Raw Rows | Cleaned Rows | Rows Removed |
|---|---|---|---|
| HTTP Flood | 10,799,707 | 10,799,526 | 181 |
| ICMP Flood | 11,203,031 | 11,202,929 | 102 |
| SYN Flood | 22,780,665 | 22,780,333 | 332 |
- Python 3.8+
- Pandas
- NumPy
- Matplotlib / Seaborn (for visualizations)
- Place your raw data in the following directory structure:
Dataset_UCM_FibIoT2024/. - Open the
data_cleaning.ipynbnotebook in Jupyter or VS Code. - Run the cells sequentially to perform the exploratory data analysis and execute the cleaning function.
The notebook generates several plots to verify data quality:
- Missing Value Analysis: Charts showing null percentages before cleaning.
- Protocol Distribution: Summary of the most frequent protocols found in each attack type (TCP, ICMP, MDNS, etc.).
This project was developed as part of a Data Science investigation into IoT security and network traffic analysis.