This repository contains a machine learning project aimed at predicting passenger transport in the Spaceship Titanic competition hosted on Kaggle. The primary model used is TabNetClassifier, a high-performance neural network designed for tabular data.
The goal of this project is to accurately predict whether a passenger on the Spaceship Titanic was transported to an alternate dimension or not. The project leverages various Python libraries for data analysis, preprocessing, model training, and evaluation, including:
pandas: Data manipulation and analysis.numpy: Numerical operations.seaborn: Data visualization.sklearn: Machine learning algorithms and tools.pytorch_tabnet: Implementation of the TabNetClassifier model.
The provided code outlines a comprehensive data preprocessing pipeline. Here are the key steps:
- Imputation: Missing values are filled using the most frequent strategy.
- Feature Engineering: The Cabin feature is split into three separate features: Cabin, Seat, and Type.
- Encoding: Categorical features are encoded using LabelEncoder.
- Data Splitting: The data is split into training and testing sets for model evaluation.
The code demonstrates the training process for the TabNetClassifier model with specific hyperparameters. After training, the model's performance is evaluated using metrics such as accuracy and classification report.
The repository includes a function to generate a submission file in the correct format for Kaggle submission. This function utilizes the trained TabNetClassifier model to predict passenger transport on the test dataset.
The notebook further provides visualizations exploring relationships between features and the target variable. These visualizations can offer valuable insights into the data and potentially inform further feature engineering or model selection.
This project serves as a baseline for passenger transport prediction on the Spaceship Titanic dataset. Several potential directions for future work include:
- Hyperparameter Optimization: Exploring different hyperparameter combinations to improve the model's performance.
- Model Ensembling: Combining predictions from multiple models, such as TabNet and other algorithms, to achieve better generalization.
- Feature Engineering: Investigating additional feature engineering techniques to capture more complex relationships within the data.
train.csv: Training dataset.test.csv: Testing dataset.submit.csv: Generated submission file.*.ipynb: Jupyter notebook containing the code and analysis.