Congratulations! You've been offered an exciting and lucrative new position. You're going to be a machine learning engineer at a small startup company called GourmetGram. They are developing an online photo sharing community focused on food.
On your first day at work, you are given the following brief:
When someone uploads a photo of food to our site, it should be tagged and categorized automatically, right?
Your predecessor had already implemented a machine learning model that classified photos into different categories: Bread, Dairy product, Dessert, Egg, Fried food, Meat, Noodles/Pasta, Rice, Seafood, Soup, and Vegetable/Fruit. And they built a simple web app around the upload and classification process, as well as a deployment "recipe" for the web app.
But, there were problems: one, the categorization was not very accurate. And second, it took too long. The classification happens in real time, when the photo is uploaded and before the user sees the next page (with the photo and info about it, including its assigned category), so it can't take too long.
Of course, we can fix the timing issue to some extent by throwing more compute resources at the problem. But we are also concerned with cost, so we don't want to pay for compute resources that are sitting idle when load is low.
You can fix both of these problems, right? (That's why we hired you!)
You are expected to deliver the following:
- an evaluation of your predecessor's model and deployment "recipe" (both in terms of optimizing metrics like latency, and operational metrics like inference time and resource usage.)
- a new model and deployment "recipe" to optimize the accuracy of the model. (Of course, you don't want to sacrifice on operational metrics unnecessarily - don't make choices that increase inference time or resource usage but don't improve accuracy. But, when there is a tradeoff between accuracy and latency, prioritize accuracy.)
- a new model and deployment "recipe" to optimize the inference latency of the model. (Of course, you don't want to sacrifice on accuracy unnecessarily - don't make choices that reduce accuracy but don't improve inference time or resource usage. But, when there is a tradeoff between latency and accuracy, prioritize latency.)
- a PDF set of slides that your manager can use to present your overall findings to the senior executives at GourmetGram.
This sequence of exercises will introduce you to some basic principles of cloud computing, in the context of deploying a machine learning model to the cloud. You will learn the following:
Deploy an image classification as a web service. After completing this section, you should be able to:
- describe advantages of deploying a web service using containers
- incorporate a machine learning model into a web application using Flask
- build, deploy, and administer containers to serve a web application
Deploy a service using container orchestration (Kubernetes). After completing this section, you should be able to:
- describe benefits of using a container orchestration framework, versus directly deploying containers
- configure, deploy, and test a service deployed on a Kubernetes cluster
Deploy a load balanced service on Kubernetes. After completing this section, you should be able to:
- explain tradeoffs of load balancing with respect to service time and resource usage
- configure, deploy, and test a load balanced service on Kubernetes
Deploy a service with dynamic scaling. After completing this section, you should be able to:
- explain tradeoffs of dynamic scaling with respect to service time and resource usage
- configure, deploy, and test a service with dynamic scaling on Kubernetes
This sequence assumes that you have already deployed a Kubernetes cluster on Chameleon, using this recipe, and that you can SSH into the primary node in your cluster.
On the primary node in your cluster, "node-0", run
git clone --single-branch --branch gh-pages https://github.com/teaching-on-testbeds/k8s-ml.git
to get the materials you will need for these exercises.
We will start by deploying an image classification model as a web service. Users can upload an image to a basic web app (in Flask) in their browser, then the app will:
- resize the image to the correct input dimensions
- pass it as input to an image classification model that has been fine tuned for food classification
- and return the most likely class (and estimated confidence), as well as the wall time of the
model.predict
call.
To make it easy to deploy our basic application, we will containerize it - package the source code together with an operating system, software dependencies, and anything else it needs to run. This is a convenient way to distribute the code along with everything it needs. This will also make it much easier to deploy multiple copies of the application (to handle heavier load) in future exercises.
The source code for our basic application is inside the app
subdirectory inside k8s-ml
. It has the following materials:
- app
- instance
- static
- templates
- app.py
- Dockerfile
- requirements.txt
- model.keras
You can browse these files in the web interface here.
Note that a saved model - model.keras
- is inside this directory. Then, inside app.py
,
-
when
app.py
runs, it loads the saved model:model = load_model("model.keras")
-
when a user uploads an images file to the app, or when a special "test" path in the URL is used, a
model_predict
function is called that returns the predicted class of the image:def model_predict(img_path, model): im = Image.open(img_path).convert('RGB') image_resized = im.resize(target_size, Image.BICUBIC) test_sample = np.array(image_resized)/255.0 test_sample = test_sample.reshape(1, target_size[0], target_size[1], 3) classes = np.array(["Bread", "Dairy product", "Dessert", "Egg", "Fried food", "Meat", "Noodles/Pasta", "Rice", "Seafood", "Soup", "Vegetable/Fruit"]) test_probs = model.predict(test_sample) most_likely_classes = np.argmax(test_probs.squeeze()) return classes[most_likely_classes], test_probs.squeeze()[most_likely_classes]
The app
directory also includes a Dockerfile. This file describes how to build a container for this application, including:
- what "base" container should be used (we'll use a Python 3.9 base container)
- what needs to run inside the container when it starts (we'll install the libraries listed in
requirements.txt
usingpip
) - what files should be copied to the container (everything in
app
) - and what command to run on the container when it is ready (we will run
python app.py
to start our web app)
We will use this Dockerfile to build the container (naming it ml-app
) and then push it to a local distribution "registry" of containers (which is already running on node-0, on port 5000).
In an SSH session on "node-0", run
docker build -t ml-app:0.0.1 ~/k8s-ml/app
docker tag ml-app:0.0.1 node-0:5000/ml-app:0.0.1
docker push node-0:5000/ml-app:0.0.1
Now that we have containerized our application, we can run it! Let's run it now, and will indicate that we want incoming requests on port 32000 to be passed to port 5000 on the container (where our Flask application is listening). In an SSH session on "node-0", run
docker run -d -p 32000:5000 node-0:5000/ml-app:0.0.1
Here,
-d
is for detach mode - so we can leave it running in the background. (If you need to run a container in the foreground, for debugging purposes, you will omit this argument.)-p
is to assign the mapping between "incoming request port" (32000) and "container port' (5000).
You can see the list of running containers with
docker ps
which will show containers related to the docker register and Kubernetes deployment, but will also show one running container using the node-0:5000/ml-app:0.0.1
image. To restrict the output to just this image, we can use
docker ps -f ancestor=node-0:5000/ml-app:0.0.1
The firewall on the node will block access to our app from outside, so let us temporarily disable it:
sudo systemctl stop firewalld.service
Now we can visit our web service and try it out! Run this command on "node-0" to get the URL to use:
echo http://$(curl -s ifconfig.me/ip):32000
Then, open your browser, paste this URL into the address bar, and hit Enter.
When the web app has loaded, upload an image to your classification service, and check its prediction.
You can also see the resource usage - in terms of CPU and memory - of your container, with
docker stats $(docker ps -q -f ancestor=node-0:5000/ml-app:0.0.1)
(which uses command substitution to get the ID of any running container using our ml-app
image, then gets its statistics). Use Ctrl+C to stop this live display.
When you are finished, you can stop the container by running:
docker stop $(docker ps -q -f ancestor=node-0:5000/ml-app:0.0.1)
(which uses command substitution to get the ID of any running container using our ml-app
image, then stops it).
In the previous exercise, we deployed a container by running it directly. Now, we will deploy the same container using Kubernetes, a platform for container orchestration.
What are some benefits of using a container orchestration framework like Kubernetes, rather than deploying containers directly?
- Container Orchestration: Kubernetes helps to automate the deployment, scaling and management of containers.
- Self-healing: If a container fails, Kubernetes can automatically detect it and replace it with a functional container.
- Load balancing: Kubernets can distribute traffic for an application across multiple instances of running containers. (We'll try this in the next exercise.)
- Resource management: Kubernetes allows you to set resource limits for containers, to ensure that the application has the resources to run efficiently.
Pods are the basic components in Kubernetes, and are used to deploy and manage containerized applications in a scalable and efficient way. They are designed to be ephemeral, meaning they can be created, destroyed, and recreated as needed. They can also be replicated, which allows for load balancing and high availability.
Although we will eventually deploy pods across all three of our "worker" nodes, our deployment will be managed from the "controller" node, which is "node-0".
To deploy an app on a Kubernetes cluster, we use a manifest file, which describes our deployment. For this exercise, we will use the "deployment_k8s.yaml" file inside the "~/k8s-ml/deploy_k8s" directory, which you can see here.
This manifest file defines a Kubernetes service named "ml-kube-service" and a Kubernetes deployment named "ml-kube-app".
- Inside the service definition, we create a service of
type: NodePort
. This service passes incoming requests on a specified port, to a (different) port on the pod. - Inside the deployment definition, you can see that
- the "ml-app" container you built earlier will be retrieved from the local registry ("node-0:5000/ml-app:0.0.1"),
- the deployment will include just a single copy of our pod ("replicas: 1").
- there is a "readiness" probe defined - the container is considered "Ready" and requests will be forwarded to it only when it responds with a success code to 3 HTTP requests on the
/test
endpoint.
It also defines the resource requirements of the container, in terms of CPU cores and memory. The "request" defines the minimum resource a container may get, and the "limit" defines the maximum resource a container may get.
To start this deployment, we will run:
kubectl apply -f ~/k8s-ml/deploy_k8s/deployment_k8s.yaml
and make sure the following output appears:
service/ml-kube-service created
deployment.apps/ml-kube-app created
Let's check the status of the service. Run the command:
kubectl get svc -o wide
The output will include a line similar to
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
ml-kube-service NodePort 10.233.37.25 <none> 6000:32000/TCP 12m app=ml-kube-app
It may take a few minutes for the pod to start running. To check the status of pods, run:
kubectl get pods -o wide
The output may include a line similar to
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
ml-kube-app-7b4c8648c6-r8zvv 0/1 ContainerCreating 0 22s <none> node-2 <none> <none>
In this example, the status of the pod is ContainerCreating
, which means the container is getting ready. When it reaches the Running
state, then it means the pod is healthy and is running. When it shows "1/1" in the "Ready" column, it is ready to accept requests according to the probe we had set up.
(As before, if your model is large, it may take a while before it is ready to accept requests.)
Once the pod is ready, check the resource usage (CPU and memory) of the pod with
kubectl top pod
Note that the resource usage varies depending on whether or not the pod is currently serving a request!
Get the URL of the service - run
echo http://$(curl -s ifconfig.me/ip):32000
copy and paste this URL into your browser's address bar, and verify that your app is up and running there.
To test the load on the deployment we will use siege, a command-line tool used to test and analyze the performance of web servers. It can generate a significant amount of traffic to test the response of a web server under load.
Install Siege on node-0:
sudo apt-get update; sudo apt-get -y install siege
Open a second SSH session on node-0. In one, run
watch -n 5 kubectl top pod
to monitor the pod's resource usage in real time (This will be updated every 5 seconds). In the second SSH session, run
siege -c 10 -t 30s http://$(curl -s ifconfig.me/ip):32000/test
Here Siege will generate traffic to a "test" endpoint on your website, which requests a prediction for a pre-saved image, for 30 seconds with a concurrency level of 10 users. After it finishes execution, make a note of key results - how many transactions were served successfully, how many failed, the transaction rate, and what the average response time was (note that this includes inference time, as well as several other elements).
Note: you may see some instances of
[error] socket: unable to connect sock.c:249: Connection refused
in the Siege output. This is an indication that in the tests, some of the connections failed entirely (rather than just waiting a long time!) because the server is under such heavy load.
When you are done with your experiment, make sure to delete the deployment and service. To delete, run the command:
kubectl delete -f ~/k8s-ml/deploy_k8s/deployment_k8s.yaml
and look for output like
service "ml-kube-service" deleted
deployment.apps "ml-kube-app" deleted
Use
kubectl get pods -o wide
and verify that (eventually) no pods are running your app.
In the previous exercise, we deployed a single replica of a Kubernetes pod. But if the load on the service is high, the single pod will have slow response times. We can address this by deploying multiple "replicas" of the pod, and distributing the traffic across them by assigning each incoming request to a pod. This is called load balancing.
The manifest file for deploying a load balanced service is named "deployment_lb.yaml", and it is inside the "~/k8s-ml/deploy_lb" directory. You can see it here.
This manifest file defines a Kubernetes service of type LoadBalancer
with the name "ml-kube-service" and a Kubernetes deployment named "ml-kube-app". There are two major differences between this deployment and the previous deployment:
- in this one, we specify that the service is of
type: LoadBalancer
, i.e. instead of directly passing incoming requests to one pod, will place a load balancer service in "front" of the pods that will distribute the requests across pods. - in this one, we specify
replicas: 5
where previously we usedreplicas: 1
.
To start this deployment, we will run:
kubectl apply -f ~/k8s-ml/deploy_lb/deployment_lb.yaml
and make sure the following output appears:
service/ml-kube-service created
deployment.apps/ml-kube-app created
It will take a few minutes for the pods to start running. To check the status of deployment, run:
kubectl get pods -o wide
and wait until the output shows that all pods are in the "Running" state and show as "1/1" in the "Ready" column. Note that the pods will be deployed across all of the nodes in the cluster. Also note that if some pods are "Ready" but not others, requests will be sent only to the pods that are "Ready".
Once the pods are ready, you can see the resource usage (CPU and memory) of the pods with
kubectl top pod
Get the URL of the service - run
echo http://$(curl -s ifconfig.me/ip):32000
copy and paste this URL into your browser's address bar, and verify that your app is up and running there.
As before, we will test this deployment under load. Open a second SSH session on node-0. In one, run
watch -n 5 kubectl top pod
to monitor the pod's resource usage in real time (This will be updated every 5 seconds). In the second SSH session, run
siege -c 10 -t 30s http://$(curl -s ifconfig.me/ip):32000/test
After it finishes execution, make a note of key results - how many transactions were served successfully, how many failed, the transaction rate, and what the average response time was (note that this includes inference time, as well as several other elements).
When you are done with your experiment, make sure to delete the deployment and service. To delete, run the command:
kubectl delete -f ~/k8s-ml/deploy_lb/deployment_lb.yaml
and look for output like
service "ml-kube-service" deleted
deployment.apps "ml-kube-app" deleted
Use
kubectl get pods -o wide
and verify that (eventually) no pods are running your app.
When we used load balancing to distribute incoming traffic across multiple pods, the response time under load was much faster. But during time intervals when the load is not heavy, it may be wasteful to deploy so many pods. (The application is loaded and uses memory and some CPU even when there are no requests!)
To address this issue, we can use scaling - where the resource deployment changes in response to load on the service. In this exercise, specifically we use horizontal scaling, which adds more pods/replicas to handle increasing levels of work, and removes pods when they are not needed. (This is in contrast to vertical scaling, which would increase the resources assigned to pods - CPU and memory - to handle increasing levels of work.)
The manifest file for deploying a service with scaling is "deployment_hpa.yaml", and it is inside the "~/k8s-ml/deploy_hpa" directory. You can see it here.
There are two differences between this deployment and the previous deployment:
- in this one, we add a service of
type: HorizontalPodAutoscaler
. We specify the minimum number of replicas and maximum number of replicas we want to have in our deployment, and the condition under which to increase the number of replicas. (ThisHorizontalPodAutoscaler
service is in addition to theLoadBalancer
service, which is also in place.) - in this one, we specify
replicas: 1
again in the deployment - the initial deployment has 1 replica, but it may be scaled up to 5 by the autoscaler.
To start this deployment, we will run:
kubectl apply -f ~/k8s-ml/deploy_hpa/deployment_hpa.yaml
Let's check the status of the service. Run the command:
kubectl get svc -o wide
and then
kubectl get pods -o wide
Initially, you will see one pod in the deployment. Wait until the pod is "Running" and has a "1/1" in the "Ready" column.
Get the URL of the service - run
echo http://$(curl -s ifconfig.me/ip):32000
copy and paste this URL into your browser's address bar, and verify that your app is up and running there.
You will need two SSH sessions on node-0. In one, run
kubectl get hpa --watch
to see the current state of the autoscaler.
In the second SSH session, run
siege -c 10 -t 360s http://$(curl -s ifconfig.me/ip):32000/test
Note that this test is of a longer duration, so that you will have time to observe additional replicas being brought up and becoming ready to use.
When you are done with your experiment, make sure to delete the deployment and services. To delete, run the command:
kubectl delete -f ~/k8s-ml/deploy_hpa/deployment_hpa.yaml
Use
kubectl get pods -o wide
and verify that (eventually) no pods are running your app.
Also, re-enable the firewall:
sudo systemctl start firewalld.service