Quantitative Factor Research Project

A comprehensive quantitative finance research project implementing factor modeling, feature selection, and machine learning for stock return prediction. This project demonstrates advanced techniques in alpha factor generation, systematic feature selection, and deep learning model comparison.

🎯 Project Overview

This project implements a complete quantitative research pipeline covering the entire workflow from raw market data to production-ready predictive models. The research achieved significant performance improvements with the best model (LSTM+ResNet) reaching an Information Coefficient of 0.06476, representing a 72% improvement over traditional linear regression benchmarks.

📊 Key Achievements

🏗️ Factor Library: Generated comprehensive alpha factor library using vectorized operations and technical operators
🔍 Feature Selection: Systematically reduced features from 100+ candidates to 85 high-quality factors using statistical and ML methods
🤖 Model Performance: Achieved IC of 0.06476 with LSTM+ResNet architecture, significantly outperforming baseline models
⚡ Optimization: Implemented automated hyperparameter tuning using Optuna for all model architectures

🏗️ Project Structure

quantitative-factor-research/
├── notebooks/                          # Jupyter notebooks for analysis
│   ├── Data_Preparation.ipynb         # Data loading and preprocessing
│   ├── Alpha_Factor_Generation.ipynb   # Base alpha factor creation
│   ├── Alpha_Factor_Generation_2.ipynb # High-frequency technical factors
│   ├── Alpha_Factor_Selection.ipynb    # Feature selection pipeline
│   ├── ML_Model.ipynb                  # Machine learning models
│   └── factor_backtest.ipynb          # Factor backtesting framework
├── data/                               # Data storage
│   ├── raw/                           # Raw market data
│   ├── processed/                     # Processed datasets
│   ├── factors/                       # Generated factors
│   │   ├── obtained_features/         # Raw factor library
│   │   └── selected_factors/          # Final selected factors
│   └── backtest_results/              # Backtesting outputs
└── README.md                          # Project documentation

🔬 Research Methodology

1. Factor Modeling

Objective: Develop a comprehensive factor library capturing various market dynamics and price-volume relationships.

Implementation:

Technical Operators: Implemented time-series operators (ts_max, ts_min, ts_sum, ts_std_dev, ts_delta, delay, rank)
Vectorized Operations: Utilized pandas and numpy for efficient large-scale factor computation
Wide-table Dataset: Generated standardized factor matrices aligned across time and assets
Base Alpha Enhancement: Applied quantile-based segmentation to improve factor discriminative power
Dynamic Factors: Incorporated rolling standard deviation and price-volume correlation factors

Key Features Generated:

Classic Alpha101 Factors: Traditional quantitative factors
Technical Indicators: RSI, MACD, Bollinger Bands variations
Volume Microstructure: Intraday volume patterns and clustering
Price-Volume Relationships: Correlation-based and interaction factors
Statistical Moments: Skewness, kurtosis, and higher-order moments

2. Feature Selection

Objective: Systematically reduce the feature space while maintaining predictive power and ensuring factor independence.

Multi-Stage Selection Process:

Statistical Screening:
- Information Coefficient (IC): Filtered factors with |IC| > 0.02
- IC Consistency: Maintained factors with stable predictive relationships
Correlation Analysis:
- Spearman Correlation: Eliminated highly correlated factors (threshold < 0.7)
- Greedy Selection: Iterative removal to maintain factor diversity
Machine Learning Selection:
- Lasso Regression: L1 regularization for sparse feature selection
- Ridge Regression: L2 regularization for feature importance ranking
- LightGBM: Gradient boosting feature importance analysis

Final Results:

85 Selected Factors: Optimized feature set balancing predictive power and independence
Spearman Correlation < 0.7: Ensured factor diversification
High IC Factors: Maintained strong predictive relationships with future returns

3. Machine Learning Models

Objective: Compare multiple architectures to identify the optimal model for stock return prediction.

Model Architectures Implemented:

Linear Regression (Baseline)
- Traditional OLS regression for benchmark comparison
- Provides interpretable baseline performance metrics
Deep Neural Network (DNN)
- Multi-layer feedforward architecture
- Batch normalization and dropout for regularization
- Hidden layers: [128, 64, 32] with ReLU activation
DeepNet
- Enhanced deep architecture with residual connections
- Multiple hidden layers: [256, 128, 64, 32]
- Skip connections to address vanishing gradient problem
AlexNet
- Convolutional neural network adapted for financial time series
- 1D convolutions to capture sequential patterns
- Max pooling and adaptive pooling for feature extraction
LSTM+ResNet ⭐ Best Performing
- Hybrid architecture combining LSTM and residual networks
- Bidirectional LSTM for temporal pattern recognition
- ResNet blocks for feature refinement
- Achieved IC: 0.06476
Transformer
- Self-attention mechanism for sequence modeling
- Multi-head attention with positional encoding
- Encoder-only architecture optimized for regression

Training Framework:

Time Series Cross-Validation: Proper temporal splitting to avoid look-ahead bias
Early Stopping: Prevents overfitting with validation-based stopping criteria
Hyperparameter Optimization: Optuna-based automated tuning for all models
Information Coefficient: Primary evaluation metric for financial relevance

📈 Performance Results

Model Comparison

Model	Mean IC	IC Std	Max IC	Hit Rate	Improvement vs Baseline
LSTM+ResNet	0.06476	0.0234	0.0891	68.3%	+72.0%
Transformer	0.05823	0.0198	0.0756	64.1%	+54.6%
DeepNet	0.04992	0.0187	0.0634	61.2%	+32.5%
DNN	0.04234	0.0156	0.0578	58.7%	+12.4%
AlexNet	0.03891	0.0143	0.0523	56.9%	+3.3%
Linear Regression	0.03767	0.0134	0.0489	55.4%	Baseline

Key Performance Highlights

🥇 Best Model: LSTM+ResNet with IC = 0.06476
🚀 Significant Improvement: 72% better than linear regression baseline
📊 Consistent Performance: High hit rate (68.3%) indicating reliable predictions
⚡ Optimization Success: Hyperparameter tuning improved all models by 15-25%

🛠️ Technical Implementation

Data Processing Pipeline

Data Ingestion: Multi-source market data loading and validation
Feature Engineering: Vectorized factor computation using pandas/numpy
Cross-sectional Standardization: Z-score normalization across assets
Time Series Alignment: Consistent temporal indexing across all datasets

Model Training Pipeline

Data Preprocessing: Standardization and missing value handling
Time Series Splitting: Rolling window validation with proper temporal ordering
Model Training: Batch processing with GPU acceleration
Hyperparameter Optimization: Optuna TPE sampler with median pruning
Performance Evaluation: IC calculation and statistical significance testing

Key Technologies

Python: Core programming language
PyTorch: Deep learning framework
Pandas/NumPy: Data manipulation and numerical computing
Optuna: Hyperparameter optimization
Scikit-learn: Traditional machine learning algorithms
LightGBM: Gradient boosting for feature selection

📋 Usage Instructions

1. Environment Setup

# Clone repository
git clone <repository-url>
cd quantitative-factor-research

# Install dependencies
pip install torch pandas numpy scikit-learn lightgbm optuna matplotlib seaborn tqdm

2. Data Preparation

# Run data preparation notebook
jupyter notebook notebooks/Data_Preparation.ipynb

3. Factor Generation

# Generate base alpha factors
jupyter notebook notebooks/Alpha_Factor_Generation.ipynb

# Generate high-frequency factors
jupyter notebook notebooks/Alpha_Factor_Generation_2.ipynb

4. Feature Selection

# Run comprehensive feature selection pipeline
jupyter notebook notebooks/Alpha_Factor_Selection.ipynb

5. Model Training

# Train and compare all models
jupyter notebook notebooks/ML_Model.ipynb

# Run specific model training
results = train_all_models(splits, optimize_params=True)
summary_stats, best_model = analyze_results(results)

6. Backtesting

# Evaluate factor performance
jupyter notebook notebooks/factor_backtest.ipynb

🔍 Research Insights

Factor Analysis Findings

Volume Microstructure Factors: Intraday volume patterns show strong predictive power
Price-Volume Correlations: Dynamic correlation factors outperform static alternatives
Rolling Statistics: Time-varying standard deviation captures market regime changes
Quantile Segmentation: Enhances factor discriminative power across different market conditions

Model Architecture Insights

LSTM Effectiveness: Temporal patterns in financial data benefit from LSTM memory mechanisms
Residual Connections: Skip connections crucial for training deep financial models
Attention Mechanisms: Transformer self-attention captures complex factor interactions
Hyperparameter Sensitivity: Proper optimization critical for model performance

Performance Drivers

Feature Quality: High-IC factors with low correlation drive performance
Temporal Modeling: Sequence models outperform feedforward architectures
Regularization: Proper regularization prevents overfitting in noisy financial data
Cross-Validation: Time series splitting essential for realistic performance estimates

📚 References and Methodology

This project implements methodologies from:

Quantitative Finance Literature: Alpha factor construction and IC analysis
Machine Learning Research: Advanced neural architectures for time series
Feature Selection Theory: Statistical and ML-based feature selection techniques
Hyperparameter Optimization: Bayesian optimization for model tuning

Project Timeline: March 2025 - June 2025
Research Focus: Quantitative Factor Modeling and Machine Learning for Alpha Generation
Key Achievement: 72% improvement in stock return prediction accuracy using advanced ML architectures

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
docs		docs
notebooks		notebooks
.gitattributes		.gitattributes
.gitignore		.gitignore
Common_alphas.pdf		Common_alphas.pdf
PROJECT_STRUCTURE.md		PROJECT_STRUCTURE.md
README.md		README.md
REQUIREMENTS.md		REQUIREMENTS.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quantitative Factor Research Project

🎯 Project Overview

📊 Key Achievements

🏗️ Project Structure

🔬 Research Methodology

1. Factor Modeling

2. Feature Selection

3. Machine Learning Models

📈 Performance Results

Model Comparison

Key Performance Highlights

🛠️ Technical Implementation

Data Processing Pipeline

Model Training Pipeline

Key Technologies

📋 Usage Instructions

1. Environment Setup

2. Data Preparation

3. Factor Generation

4. Feature Selection

5. Model Training

6. Backtesting

🔍 Research Insights

Factor Analysis Findings

Model Architecture Insights

Performance Drivers

📚 References and Methodology

About

Uh oh!

Releases

Packages

Languages

nuglifeleoji/Factor-Research

Folders and files

Latest commit

History

Repository files navigation

Quantitative Factor Research Project

🎯 Project Overview

📊 Key Achievements

🏗️ Project Structure

🔬 Research Methodology

1. Factor Modeling

2. Feature Selection

3. Machine Learning Models

📈 Performance Results

Model Comparison

Key Performance Highlights

🛠️ Technical Implementation

Data Processing Pipeline

Model Training Pipeline

Key Technologies

📋 Usage Instructions

1. Environment Setup

2. Data Preparation

3. Factor Generation

4. Feature Selection

5. Model Training

6. Backtesting

🔍 Research Insights

Factor Analysis Findings

Model Architecture Insights

Performance Drivers

📚 References and Methodology

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages