Housing Regression MLE is an end-to-end machine learning pipeline for predicting housing prices using XGBoost. The project follows ML engineering best practices with modular pipelines, experiment tracking via MLflow, containerization, AWS cloud deployment, and comprehensive testing. The system includes both a REST API and a Streamlit dashboard for interactive predictions.
The codebase is organized into distinct pipelines following the flow:
Load → Preprocess → Feature Engineering → Train → Tune → Evaluate → Inference → Batch → Serve
-
src/feature_pipeline/: Data loading, preprocessing, and feature engineeringload.py: Time-aware data splitting (train <2020, eval 2020-21, holdout ≥2022)preprocess.py: City normalization, deduplication, outlier removalfeature_engineering.py: Date features, frequency encoding (zipcode), target encoding (city_full)
-
src/training_pipeline/: Model training and hyperparameter optimizationtrain.py: Baseline XGBoost training with configurable parameterstune.py: Optuna-based hyperparameter tuning with MLflow integrationeval.py: Model evaluation and metrics calculation
-
src/inference_pipeline/: Production inferenceinference.py: Applies same preprocessing/encoding transformations using saved encoders
-
src/batch/: Batch prediction processingrun_monthly.py: Generates monthly predictions on holdout data
-
src/api/: FastAPI web servicemain.py: REST API with S3 integration, health checks, prediction endpoints, and batch processing
app.py: Streamlit dashboard for interactive housing price predictions- Real-time predictions via FastAPI integration
- Interactive filtering by year, month, and region
- Visualization of predictions vs actuals with metrics (MAE, RMSE, % Error)
- Yearly trend analysis with highlighted selected periods
- AWS S3 Integration: Data and model storage in
housing-regression-databucket - Amazon ECR: Container registry for Docker images
- Amazon ECS: Container orchestration with Fargate
- Application Load Balancer: Traffic distribution and routing
- CI/CD Pipeline: Automated deployment via GitHub Actions
- housing-api-service: FastAPI backend (port 8000, 1024 CPU, 3072 MB memory)
- housing-streamlit-service: Streamlit dashboard (port 8501, 512 CPU, 1024 MB memory)
The project implements strict data leakage prevention:
- Time-based splits (not random)
- Encoders fitted only on training data
- Leakage-prone columns dropped before training
- Schema alignment enforced between train/eval/inference
# Install dependencies using uv
uv sync# Run all tests
pytest
# Run specific test modules
pytest tests/test_features.py
pytest tests/test_training.py
pytest tests/test_inference.py
# Run with verbose output
pytest -v# 1. Load and split raw data
python src/feature_pipeline/load.py
# 2. Preprocess splits
python -m src.feature_pipeline.preprocess
# 3. Feature engineering
python -m src.feature_pipeline.feature_engineering# Train baseline model
python src/training_pipeline/train.py
# Hyperparameter tuning with MLflow
python src/training_pipeline/tune.py
# Model evaluation
python src/training_pipeline/eval.py# Single inference
python src/inference_pipeline/inference.py --input data/raw/holdout.csv --output predictions.csv
# Batch monthly predictions
python src/batch/run_monthly.py# Start FastAPI server locally
uv run uvicorn src.api.main:app --host 0.0.0.0 --port 8000# Start Streamlit dashboard locally
streamlit run app.py --server.port 8501 --server.address 0.0.0.0# Build API container
docker build -t housing-regression .
# Build Streamlit container
docker build -t housing-streamlit -f Dockerfile.streamlit .
# Run API container
docker run -p 8000:8000 housing-regression
# Run Streamlit container
docker run -p 8501:8501 housing-streamlit# Start MLflow UI (view experiments)
mlflow uiEach pipeline component can be run independently with consistent interfaces. All modules accept configurable input/output paths for testing isolation.
- S3-First Storage: Models and data automatically sync from S3 buckets
- Containerized Services: Both API and dashboard run in Docker containers
- Auto-scaling Infrastructure: ECS Fargate provides serverless container scaling
- Environment-based Configuration: Separate configs for local development and production
Frequency and target encoders are saved as pickle files during training and loaded during inference to ensure consistent transformations.
Model parameters, file paths, and pipeline settings use sensible defaults but can be overridden through function parameters or environment variables. Production deployments use AWS environment variables.
- Unit tests for individual pipeline components
- Integration tests for end-to-end pipeline flows
- Smoke tests for inference pipeline
- All tests use temporary directories to avoid touching production data
Key production dependencies (see pyproject.toml):
- ML/Data:
xgboost==3.0.4,scikit-learn,pandas==2.1.1,numpy==1.26.4 - API:
fastapi,uvicorn - Dashboard:
streamlit,plotly - Cloud:
boto3(AWS integration) - Experimentation:
mlflow,optuna - Quality:
great-expectations,evidently
data/: Raw, processed, and prediction data (time-structured, S3-synced)models/: Trained models and encoders (pkl files, S3-synced)mlruns/: MLflow experiment tracking dataconfigs/: YAML configuration filesnotebooks/: Jupyter notebooks for EDA and experimentationtests/: Comprehensive test suite with sample data- AWS Task Definitions:
housing-api-task-def.json,streamlit-task-def.json - CI/CD:
.github/workflows/ci.ymlfor automated deployment