A comprehensive quantitative finance research project implementing factor modeling, feature selection, and machine learning for stock return prediction. This project demonstrates advanced techniques in alpha factor generation, systematic feature selection, and deep learning model comparison.
This project implements a complete quantitative research pipeline covering the entire workflow from raw market data to production-ready predictive models. The research achieved significant performance improvements with the best model (LSTM+ResNet) reaching an Information Coefficient of 0.06476, representing a 72% improvement over traditional linear regression benchmarks.
- 🏗️ Factor Library: Generated comprehensive alpha factor library using vectorized operations and technical operators
- 🔍 Feature Selection: Systematically reduced features from 100+ candidates to 85 high-quality factors using statistical and ML methods
- 🤖 Model Performance: Achieved IC of 0.06476 with LSTM+ResNet architecture, significantly outperforming baseline models
- ⚡ Optimization: Implemented automated hyperparameter tuning using Optuna for all model architectures
quantitative-factor-research/
├── notebooks/ # Jupyter notebooks for analysis
│ ├── Data_Preparation.ipynb # Data loading and preprocessing
│ ├── Alpha_Factor_Generation.ipynb # Base alpha factor creation
│ ├── Alpha_Factor_Generation_2.ipynb # High-frequency technical factors
│ ├── Alpha_Factor_Selection.ipynb # Feature selection pipeline
│ ├── ML_Model.ipynb # Machine learning models
│ └── factor_backtest.ipynb # Factor backtesting framework
├── data/ # Data storage
│ ├── raw/ # Raw market data
│ ├── processed/ # Processed datasets
│ ├── factors/ # Generated factors
│ │ ├── obtained_features/ # Raw factor library
│ │ └── selected_factors/ # Final selected factors
│ └── backtest_results/ # Backtesting outputs
└── README.md # Project documentation
Objective: Develop a comprehensive factor library capturing various market dynamics and price-volume relationships.
Implementation:
- Technical Operators: Implemented time-series operators (ts_max, ts_min, ts_sum, ts_std_dev, ts_delta, delay, rank)
- Vectorized Operations: Utilized pandas and numpy for efficient large-scale factor computation
- Wide-table Dataset: Generated standardized factor matrices aligned across time and assets
- Base Alpha Enhancement: Applied quantile-based segmentation to improve factor discriminative power
- Dynamic Factors: Incorporated rolling standard deviation and price-volume correlation factors
Key Features Generated:
- Classic Alpha101 Factors: Traditional quantitative factors
- Technical Indicators: RSI, MACD, Bollinger Bands variations
- Volume Microstructure: Intraday volume patterns and clustering
- Price-Volume Relationships: Correlation-based and interaction factors
- Statistical Moments: Skewness, kurtosis, and higher-order moments
Objective: Systematically reduce the feature space while maintaining predictive power and ensuring factor independence.
Multi-Stage Selection Process:
-
Statistical Screening:
- Information Coefficient (IC): Filtered factors with |IC| > 0.02
- IC Consistency: Maintained factors with stable predictive relationships
-
Correlation Analysis:
- Spearman Correlation: Eliminated highly correlated factors (threshold < 0.7)
- Greedy Selection: Iterative removal to maintain factor diversity
-
Machine Learning Selection:
- Lasso Regression: L1 regularization for sparse feature selection
- Ridge Regression: L2 regularization for feature importance ranking
- LightGBM: Gradient boosting feature importance analysis
Final Results:
- 85 Selected Factors: Optimized feature set balancing predictive power and independence
- Spearman Correlation < 0.7: Ensured factor diversification
- High IC Factors: Maintained strong predictive relationships with future returns
Objective: Compare multiple architectures to identify the optimal model for stock return prediction.
Model Architectures Implemented:
-
Linear Regression (Baseline)
- Traditional OLS regression for benchmark comparison
- Provides interpretable baseline performance metrics
-
Deep Neural Network (DNN)
- Multi-layer feedforward architecture
- Batch normalization and dropout for regularization
- Hidden layers: [128, 64, 32] with ReLU activation
-
DeepNet
- Enhanced deep architecture with residual connections
- Multiple hidden layers: [256, 128, 64, 32]
- Skip connections to address vanishing gradient problem
-
AlexNet
- Convolutional neural network adapted for financial time series
- 1D convolutions to capture sequential patterns
- Max pooling and adaptive pooling for feature extraction
-
LSTM+ResNet ⭐ Best Performing
- Hybrid architecture combining LSTM and residual networks
- Bidirectional LSTM for temporal pattern recognition
- ResNet blocks for feature refinement
- Achieved IC: 0.06476
-
Transformer
- Self-attention mechanism for sequence modeling
- Multi-head attention with positional encoding
- Encoder-only architecture optimized for regression
Training Framework:
- Time Series Cross-Validation: Proper temporal splitting to avoid look-ahead bias
- Early Stopping: Prevents overfitting with validation-based stopping criteria
- Hyperparameter Optimization: Optuna-based automated tuning for all models
- Information Coefficient: Primary evaluation metric for financial relevance
| Model | Mean IC | IC Std | Max IC | Hit Rate | Improvement vs Baseline |
|---|---|---|---|---|---|
| LSTM+ResNet | 0.06476 | 0.0234 | 0.0891 | 68.3% | +72.0% |
| Transformer | 0.05823 | 0.0198 | 0.0756 | 64.1% | +54.6% |
| DeepNet | 0.04992 | 0.0187 | 0.0634 | 61.2% | +32.5% |
| DNN | 0.04234 | 0.0156 | 0.0578 | 58.7% | +12.4% |
| AlexNet | 0.03891 | 0.0143 | 0.0523 | 56.9% | +3.3% |
| Linear Regression | 0.03767 | 0.0134 | 0.0489 | 55.4% | Baseline |
- 🥇 Best Model: LSTM+ResNet with IC = 0.06476
- 🚀 Significant Improvement: 72% better than linear regression baseline
- 📊 Consistent Performance: High hit rate (68.3%) indicating reliable predictions
- ⚡ Optimization Success: Hyperparameter tuning improved all models by 15-25%
- Data Ingestion: Multi-source market data loading and validation
- Feature Engineering: Vectorized factor computation using pandas/numpy
- Cross-sectional Standardization: Z-score normalization across assets
- Time Series Alignment: Consistent temporal indexing across all datasets
- Data Preprocessing: Standardization and missing value handling
- Time Series Splitting: Rolling window validation with proper temporal ordering
- Model Training: Batch processing with GPU acceleration
- Hyperparameter Optimization: Optuna TPE sampler with median pruning
- Performance Evaluation: IC calculation and statistical significance testing
- Python: Core programming language
- PyTorch: Deep learning framework
- Pandas/NumPy: Data manipulation and numerical computing
- Optuna: Hyperparameter optimization
- Scikit-learn: Traditional machine learning algorithms
- LightGBM: Gradient boosting for feature selection
# Clone repository
git clone <repository-url>
cd quantitative-factor-research
# Install dependencies
pip install torch pandas numpy scikit-learn lightgbm optuna matplotlib seaborn tqdm# Run data preparation notebook
jupyter notebook notebooks/Data_Preparation.ipynb# Generate base alpha factors
jupyter notebook notebooks/Alpha_Factor_Generation.ipynb
# Generate high-frequency factors
jupyter notebook notebooks/Alpha_Factor_Generation_2.ipynb# Run comprehensive feature selection pipeline
jupyter notebook notebooks/Alpha_Factor_Selection.ipynb# Train and compare all models
jupyter notebook notebooks/ML_Model.ipynb
# Run specific model training
results = train_all_models(splits, optimize_params=True)
summary_stats, best_model = analyze_results(results)# Evaluate factor performance
jupyter notebook notebooks/factor_backtest.ipynb- Volume Microstructure Factors: Intraday volume patterns show strong predictive power
- Price-Volume Correlations: Dynamic correlation factors outperform static alternatives
- Rolling Statistics: Time-varying standard deviation captures market regime changes
- Quantile Segmentation: Enhances factor discriminative power across different market conditions
- LSTM Effectiveness: Temporal patterns in financial data benefit from LSTM memory mechanisms
- Residual Connections: Skip connections crucial for training deep financial models
- Attention Mechanisms: Transformer self-attention captures complex factor interactions
- Hyperparameter Sensitivity: Proper optimization critical for model performance
- Feature Quality: High-IC factors with low correlation drive performance
- Temporal Modeling: Sequence models outperform feedforward architectures
- Regularization: Proper regularization prevents overfitting in noisy financial data
- Cross-Validation: Time series splitting essential for realistic performance estimates
This project implements methodologies from:
- Quantitative Finance Literature: Alpha factor construction and IC analysis
- Machine Learning Research: Advanced neural architectures for time series
- Feature Selection Theory: Statistical and ML-based feature selection techniques
- Hyperparameter Optimization: Bayesian optimization for model tuning
Project Timeline: March 2025 - June 2025
Research Focus: Quantitative Factor Modeling and Machine Learning for Alpha Generation
Key Achievement: 72% improvement in stock return prediction accuracy using advanced ML architectures