This repository showcases a Google Cloud AutoML machine learning model for predicting house prices. The "House Prices - Advanced Regression Techniques" Kaggle dataset was used for training and evaluation.
- Getting started
- Submitting training jobs
- Model deployment
- Evaluation and model selection
- Making predictions
- Contributing
To get your environment set up and start using this model, please follow the step-by-step instructions provided below.
First, you'll need to clone this repository to your local machine or development environment. Open your terminal, navigate to the directory where you want to clone the repository, and run the following command:
git clone <repository-url>
Replace <repository-url> with the actual URL of this repository. Once cloned, navigate into the repository's directory with cd <repository-name>.
Within the root directory of the cloned repository, create a .env file to store your project configurations. This file should include the following environment variables tailored to your project:
PROJECT_ID=<project_id>
REGION=<region> # example: us-central1
Make sure to replace <project_id> and <region> with your specific project details.
To interact with Google Cloud resources, you need to install the Google Cloud Command Line Interface (CLI) on your system. Follow the detailed installation instructions provided in the official documentation here.
To be able to authenticate to Google Cloud services from your development environment, configure the Application Default Credentials (ADC) by following the guide here.
The training process is orchestrated using the training_jobs.py script. This script facilitates the training of machine learning models by leveraging Google Cloud's AutoML service. With it, you can easily specify your dataset, target column for prediction, and any columns to omit during training. The script supports various configurations to tailor the training process to your needs, including setting the optimization objective, splitting the dataset, and defining the computational budget.
python demo3/training_jobs.py --gcs_source "gs://your-source-bucket/your-source-file" \
--model_name your-model-name \
--target_column your-target-column \
--omit_columns your-omit-columns
This command initiates the training of your machine learning model using the specified dataset. It automatically handles the creation of a tabular dataset within Google Cloud, executes an AutoML training job based on your specified parameters, and outputs a trained model ready for evaluation and deployment.
The model deployment process utilizes the serving.py script, which is responsible for deploying your machine learning models to Google Cloud's Vertex AI endpoints with dedicated resources.
python demo3/serving.py --model_id "your-model-id" \
--machine_type "machine-type" \
--endpoint_name "your-endpoint-name" \
--model_name "your-model-name"
This command will create a new endpoint and deploy your model to it, making it ready for serving online predictions.
The evaluation process is handled by the evaluation.py script, which evaluates the performance of machine learning models deployed on Google Cloud's Vertex AI. This script fetches the latest model evaluation metrics and can help in identifying the "champion" model version based on specific criteria.
python demo3/evaluation.py --model_id "your-model-id"
This command retrieves the latest evaluation metrics for your model.
The prediction process is facilitated through the prediction.py script, which supports both online and batch prediction modes for models deployed on Google Cloud's Vertex AI.
python demo3/prediction.py --online \
--endpoint_id "your-endpoint-name" \
--input_file "sample_data/your_input_file.json"
This command performs real-time, online predictions by sending input data to a deployed model's endpoint. It's ideal for applications requiring immediate inference.
python demo3/prediction.py --batch \
--model_id "your-model-id" \
--gcs_batch_pred_source "your-test-data-uri" \
--gcs_destination "your-results-destination-uri"
This command initiates a batch prediction job, allowing for the processing of large volumes of data. The results are stored in a specified Google Cloud Storage location, making it suitable for asynchronous prediction tasks on bulk data.
Our project embraces a streamlined workflow that ensures high-quality software development and efficient collaboration among team members. To maintain this standard, we follow a specific branching strategy and commit convention outlined in our CONTRIBUTING.md file.
We highly encourage all contributors to familiarize themselves with these guidelines. Adhering to the outlined practices helps us keep our codebase organized, facilitates easier code reviews, and accelerates the development process. For detailed information on our branching strategy and how we commit changes, please refer to the CONTRIBUTING.md file.