This project uses machine learning techniques to predict housing prices based on various features. The dataset is fetched, prepared, and analyzed in a Jupyter Notebook, with Random Forest Regression used as the primary model. Cross-validation is implemented to evaluate the model's performance.
- Project Overview
- Dataset
- Project Structure
- Dependencies
- Running the Project
- Results
- Credits and Acknowledgments
This project demonstrates:
- Fetching and extracting data from a remote source.
- Data cleaning, feature engineering, and exploratory data analysis.
- Building and evaluating a machine learning model using Random Forest Regression.
- Performing cross-validation to measure model performance.
The dataset used for this project is the 'California Housing Prices' dataset from the StatLib repository. This dataset was based on data from the 1990 California cen sus. It has been added to this GitHub repository and includes housing data stored in a CSV file, which is automatically downloaded and extracted during the notebook's execution.
The dataset contains:
- Various housing-related features such as location, price, size, and more.
- Labels (target variable) representing housing prices.
The project is organized as follows:
- Jupyter Notebook: Contains all the code for data loading, preprocessing, model building, and evaluation.
- Dataset: Downloaded and extracted automatically into the
dataset
directory of your current workspace.
To run this project, ensure you have the following installed:
- Python 3.8+
- Jupyter Notebook
- NumPy
- Pandas
- Scikit-learn
- tarfile (standard library)
- six (for compatibility)
Install dependencies using pip:
pip install numpy pandas scikit-learn
-
Clone the GitHub repository:
git clone https://github.com/kitkat1424/housing-price-prediction-ML.git
-
Navigate to the project directory:
cd housing-price-prediction-ML
-
Open the Jupyter Notebook:
jupyter notebook housing_prices.ipynb
-
Run all cells in the notebook to:
- Fetch and extract the dataset.
- Perform data preprocessing.
- Train and evaluate the model.
The project tests Linear Regression, Decision Tree Regressor and Random Forest Regression on the training data set. It settles on the use of Random Forest Regression to predict housing prices. Model performance is evaluated using:
- RMSE (Root Mean Squared Error): Evaluated on both training data and cross-validation folds.
- Cross-validation: Ensures robust evaluation by splitting the dataset into multiple training and test sets. The mean RMSE is then computed.
- Geron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd ed.). O'Reilly Media.
- Libraries such as Scikit-learn, Pandas, and NumPy were instrumental in building this project.