This repository contains a data analysis project for the Jane Street Market Prediction Kaggle competition. The goal of this competition is to create a model that predicts whether to accept or reject a trade.
The project is structured in three Jupyter notebooks that cover the same analysis:
janes-stock.ipynb: Main notebook with the complete analysis.Clustering+PCA: Notebook focused on clustering and PCA.feature+pca: Notebook focused on feature engineering and PCA.
The analysis includes the following steps:
-
Data Loading and Optimization: The training data is loaded from
train.csv, and its memory usage is optimized by convertingfloat64columns tofloat32. -
Feature Engineering:
- Clustering: K-Means clustering is used to group similar features together.
- PCA: Principal Component Analysis (PCA) is applied to reduce the dimensionality of the feature space.
-
Correlation Analysis: The correlation between the principal components and the target variables (
action,weight, andresp) is analyzed.
To run the analysis, you will need to have Python 3 and Jupyter Notebook installed. You will also need to install the following libraries:
- numpy
- pandas
- matplotlib
- scikit-learn
You can install these libraries using pip:
pip install numpy pandas matplotlib scikit-learn
Once you have installed the dependencies, you can run the Jupyter notebooks in this repository.