Box Office Prediction

This project aims to predict box office performance for movies using machine learning techniques. It involves data retrieval, preprocessing, feature engineering, and machine learning model training and evaluation.

Project Structure

Data Retrieval
- Located in code/1_data_retrieval/
- Retrieves movie data from The Movie Database (TMDb) API
- Handles rate limiting and error recovery
- Stores data in CSV files
Data Preprocessing
- Located in code/2_data_preprocessing/
- Processes raw data into a format suitable for machine learning
- Includes data cleaning, merging, and initial feature creation
Feature Engineering
- Located in code/3_feature_eng/
- Creates advanced features from the preprocessed data
- Includes complex KPI features, socioeconomic features, and holiday-related features
Machine Learning
- Located in code/4_machine_learning/
- Implements various machine learning models for box office prediction
- Includes model training, hyperparameter tuning, and evaluation
Infrastructure
- Located in infra/
- Contains Terraform configurations for compute instances and web scraping

Key Features

Asynchronous data retrieval from TMDb API
Comprehensive feature engineering, including:
- Production company performance
- Cast and crew performance metrics
- Genre and keyword analysis
- Socioeconomic indicators
- Release date analysis (including holidays)
Support for both regression and classification tasks
Multiple machine learning models, including XGBoost, LightGBM, and neural networks
Hyperparameter tuning using random search and grid search
Detailed model evaluation and logging

Setup and Usage

Install the required dependencies:
```
pip install -r requirements.txt
```
Set up your TMDb API token as an environment variable:
```
export TMDB_API_TOKEN=your_token_here
```

Run the data retrieval script:

python code/1_data_retrieval/tmdb_retrieval.py

Run the preprocessing script:

python code/2_data_preprocessing/main_preprocessing.py

Run the feature engineering and machine learning scripts (modify as needed):
```
python code/4_machine_learning/ml_logic.py
```

Data

The project uses data from various sources:

The Movie Database (TMDb)
Box Office Mojo (via web scraping)
World Bank and OECD for socioeconomic indicators

Data is stored in the data/ directory, with subdirectories for raw, processed, and machine learning-ready data.

Models

The project supports various machine learning models, including:

Logistic Regression
Random Forest
XGBoost
LightGBM
Neural Networks

Models can be configured for both regression (predicting box office revenue) and classification (predicting success categories) tasks.

Evaluation

Model performance is evaluated using various metrics, including:

For regression: MSE, MAPE, MAE, RMSE, R2
For classification: Accuracy, Precision, Recall, F1 Score, ROC AUC

Results are logged and stored in the logs/ and metadata/ directories.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Code		Code
Infra/mojo_scrapping		Infra/mojo_scrapping
code		code
data		data
infra/compute_instance		infra/compute_instance
old_code		old_code
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
enviroment_creation.txt		enviroment_creation.txt
param_grids.json		param_grids.json
reads.txt		reads.txt
requirements.txt		requirements.txt
results.txt		results.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Box Office Prediction

Project Structure

Key Features

Setup and Usage

Data

Models

Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ilias1111/box-office-prediction

Folders and files

Latest commit

History

Repository files navigation

Box Office Prediction

Project Structure

Key Features

Setup and Usage

Data

Models

Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages