The purpose of this project is to create a scalable and power-efficient LLM (Large Language Model) inference service using Kubernetes. The service will utilize a custom power capping operator that accepts a Custom Resource Definition ( CRD) to specify the power capping limit. The operator will use KEDA (Kubernetes Event-Driven Autoscaling) to scale the LLM inference service deployment based on the specified power cap. Kepler, a power monitoring tool, will be used to monitor the power consumption of CPU and GPU resources on the server.
Please see BENEFITS for a detailed description of the motivations of this project. Problem Statement provides a detailed power performance optimization problem statement.
- Power Capping Operator: A custom Kubernetes operator that manages the power capping functionality of the LLM inference service.
- Custom Resource Definition (CRD): Defines the power capping limit and other configuration parameters for the LLM inference service.
- KEDA: Kubernetes Event-Driven Autoscaling tool that scales the LLM inference service deployment based on the power consumption metrics.
- LLM Inference Service: A Kubernetes deployment that runs the LLM inference workload.
- Kepler: A power monitoring tool that measures the power consumption of CPU and GPU resources on the server.
graph TD
A[Power Capping Operator] -->|Reads| B(Power Capping CRD)
A -->|Configures| C(KEDA)
C -->|Scales| D(LLM Inference Service Deployment)
E[Kepler] -->|Monitors| F(Server CPU/GPU Power Usage)
F -->|Provides Metrics| C
- Reads the power capping CRD to obtain the power capping limit and Prometheus parameters.
- Configures the referenced KEDA
ScaledObject
to scale the LLM inference service deployment based on the power consumption metrics provided by Kepler. - Continuously monitors the power consumption metrics and adjusts the scaling configuration if necessary.
apiVersion: powercapping.climatik-project.ai/v1
kind: PowerCappingConfig
metadata:
name: llm-inference-power-cap
spec:
powerCapLimit: <power_cap_limit_in_watts>
deploymentName: <llm_inference_service_deployment_name>
scaleObjectRef:
- apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: <scale_object_name_1>
- apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: <scale_object_name_2>
metrics:
- type: Prometheus
prometheusAddress: <prometheus_server_address>
query: <prometheus_query_for_power_consumption>
threshold: <power_consumption_threshold>
- KEDA will be configured to scale the LLM inference service deployment based on the power consumption metrics provided by Kepler.
- The scaling configuration will be managed by the power capping operator.
- KEDA will ensure that the number of replicas stays within the specified minimum and maximum limits.
- A standard Kubernetes deployment that runs the LLM inference workload.
- The deployment will be scaled by KEDA based on the power consumption metrics.
- Kepler will be deployed on the server to monitor the power consumption of CPU and GPU resources.
- Kepler will expose the power consumption metrics to power capping operator via Prometheus.
This section demonstrates the integration of the power capping operator with KServe, a standardized Serverless ML Inference Platform on Kubernetes. KServe creates deployments for serving LLM inference and associated KEDA ScaledObjects. The power capping operator then updates the CRD to manage the power capping configuration.
- KServe creates a deployment for serving LLM inference.
- KServe creates an associated KEDA ScaledObject for the deployment.
- The power capping operator watches for changes in the KServe deployments and ScaledObjects.
- The power capping operator updates the PowerCappingConfig CRD with the ScaledObject references.
- The power capping operator monitors the power consumption metrics and adjusts the scaling configuration for the new ScaledObject if necessary.
graph TD
A[KServe] -->|Creates| B(LLM Inference Deployment)
A -->|Creates| C(KEDA ScaledObject)
D[Power Capping Operator] -->|Watches| B
D -->|Watches| C
D -->|Updates| E(PowerCappingConfig CRD)
D -->|Monitors| F(Power Consumption Metrics)
D -->|Adjusts| C
-
KServe creates a deployment for serving LLM inference using the InferenceService resource.
apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: name: llm-inference-service spec: predictor: serviceAccountName: sa containers: - image: llm-inference-service:latest name: llm-inference-service
-
KServe creates an associated KEDA ScaledObject for the deployment.
apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: llm-inference-scaledobject spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: llm-inference-service pollingInterval: 15 cooldownPeriod: 30 minReplicaCount: 1 maxReplicaCount: 10 triggers: - type: prometheus metadata: serverAddress: http://prometheus-server metricName: average_token_per_second query: average_token_per_second[1m] threshold: "500"
-
The power capping operator watches for changes in the KServe deployments and ScaledObjects.
-
The power capping operator updates the PowerCappingConfig CRD with the ScaledObject references.
apiVersion: powercapping.climatik-project.ai/v1 kind: PowerCappingConfig metadata: name: llm-inference-power-cap spec: powerCapLimit: 1000 scaleObjectRef: - apiVersion: keda.sh/v1alpha1 kind: ScaledObject metadata: name: llm-inference-scaledobject
-
The power capping operator monitors the power consumption metrics and adjusts the scaling configuration if necessary.
This integration allows the power capping operator to seamlessly work with KServe deployments and manage their power capping configuration using KEDA ScaledObjects.
The power capping operator can also be integrated with vLLM, a framework for serving large language models. vLLM provides an memory efficient and scalable solution for deploying and serving LLMs.
vLLM creates deployments for serving LLM inference. Each vLLM deployment is associated with a KEDA ScaledObject that defines the scaling behavior based on the incoming workload.
Here's an example of a vLLM deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
ports:
- containerPort: 8000
The KEDA ScaledObject associated with the vLLM deployment defines the scaling rules based on the incoming requests and the desired target metrics.
Here's an example of a KEDA ScaledObject for vLLM:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-deployment
pollingInterval: 15
cooldownPeriod: 30
minReplicaCount: 1
maxReplicaCount: 10
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server
metricName: http_requests_total
threshold: "100"
query: average_token_throughput_per_second[1m]
The power capping operator integrates with vLLM deployments in the same way as it does with KServe. It watches for changes in the vLLM deployments and their associated KEDA ScaledObjects.
The power capping operator performs the following steps:
- Monitors the power consumption metrics from Kepler for the vLLM deployments.
- Retrieves the KEDA ScaledObject associated with each vLLM deployment.
- Adjusts the
maxReplicaCount
of the KEDA ScaledObject based on the power consumption metrics and the defined power capping rules. - Updates the KEDA ScaledObject to enforce the power capping limits.
The integration with vLLM ensures that the power capping operator can effectively manage the power consumption of vLLM deployments, similar to how it manages KServe deployments.
graph TD
A[Power Capping Operator] -->|Monitors| B(vLLM Deployment)
A -->|Monitors| C(KEDA ScaledObject)
A -->|Adjusts maxReplicaCount| C
C -->|Scales| B
The diagram illustrates the integration flow between the power capping operator, vLLM deployment, and KEDA ScaledObject.
The power capping operator monitors the vLLM deployment and its associated KEDA ScaledObject, adjusts
the maxReplicaCount
based on the power consumption metrics, and updates the KEDA ScaledObject to enforce the power
capping limits.
By integrating with vLLM, the power capping operator extends its capabilities to manage the power consumption of LLM inference deployments across multiple frameworks, providing a comprehensive solution for power-efficient and scalable LLM serving.
To integrate real-time carbon intensity for dynamic power capping and achieve the target carbon capping, we need to modify the power capping operator to fetch the carbon intensity data from an external source and adjust the power cap accordingly. Here's an updated integration section that includes this functionality:
In this integration, we enhance the power capping operator to utilize real-time carbon intensity data for dynamic power capping. The goal is to achieve a target carbon capping by adjusting the power cap based on the current carbon intensity.
To obtain real-time carbon intensity data, we can use an external API or data source that provides this information. For example, we can use the Carbon Intensity API provided by the National Grid ESO in the UK. This API offers real-time and forecasted carbon intensity data for the UK electricity grid.
To calculate the carbon emission, we multiply the current power usage by the carbon intensity. The power usage can be obtained from the Kepler Prometheus metrics, as described in the previous sections. This step omits the detail of PUE ( Power Usage Effectiveness) and other factors that may affect the carbon emission calculation.
The power capping operator can dynamically adjust the power cap based on the current carbon intensity to achieve the target carbon capping. When the carbon intensity is high, the power cap is reduced to limit the carbon emission. Conversely, when the carbon intensity is low, the power cap can be increased to allow higher power usage.
To integrate the carbon intensity-based power capping into the existing power capping operator, we need to modify
the monitor_power_usage
function to include the following steps:
- Fetch the current carbon intensity.
- Calculate the carbon emission.
- Adjust the power cap based on the current carbon intensity and target carbon cap.
- Update the power capping configuration with the adjusted power cap.
Here's a diagram illustrating the integration of real-time carbon intensity with the power capping operator:
graph LR
A[Power Capping Operator] --> B(Fetch Carbon Intensity)
B --> C(Calculate Carbon Emission)
C --> D(Adjust Power Cap)
D --> E(Update Power Capping Configuration)
E --> A
In this diagram:
- The power capping operator fetches the current carbon intensity from the external data source.
- It calculates the carbon emission based on the current power usage and carbon intensity.
- The power cap is adjusted based on the carbon intensity and target carbon cap.
- The power capping configuration is updated with the adjusted power cap.
- The process continues in a loop, with the power capping operator continuously monitoring and adjusting the power cap based on the real-time carbon intensity.
By integrating real-time carbon intensity into the power capping operator, we can dynamically adjust the power cap to achieve the target carbon capping. This allows for more environmentally-friendly operation of the system while still maintaining the desired performance characteristics.
In this integration, we leverage the Kubernetes Vertical Pod Autoscaler (VPA) to dynamically adjust the resource requirements of pods based on the workload demands and resource availability. VPA complements the horizontal scaling capabilities of KEDA by optimizing the resource allocation for each pod.
To enable VPA for the LLM inference workloads, we need to create a VPA resource that specifies the target deployments and the desired resource recommendations. Here's an example VPA configuration:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: llm-inference-vpa
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: llm-inference-deployment
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 100m
memory: 256Mi
nvidia.com/gpu: 1
maxAllowed:
cpu: 4
memory: 16Gi
nvidia.com/gpu: 8
controlledResources: ["cpu", "memory", "nvidia.com/gpu"]
In this configuration, we specify the target deployment (llm-inference-deployment
) and define
the resource policy for the containers. The minAllowed
and maxAllowed
fields set the minimum and maximum
resource limits for CPU and memory. VPA will recommend resource adjustments within these boundaries based on
the observed workload requirements.
To integrate VPA with KEDA, we need to ensure that the resource recommendations made by VPA are considered during the scaling process. KEDA can be configured to use the VPA-recommended resource values when scaling the pods.
Here's an example KEDA ScaledObject that incorporates VPA recommendations:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-inference-scaledobject
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference-deployment
pollingInterval: 15
cooldownPeriod: 30
minReplicaCount: 1
maxReplicaCount: 10
advanced:
restoreToOriginalReplicaCount: true
verticalPodAutoscalerConfig:
behavior:
scaleDown:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 100
periodSeconds: 15
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server
metricName: average_token_per_second
threshold: 100
NOTE: KEDA is still not supporting VPA, see KEDA Issus #5435
VPA can be particularly useful when dealing with different LLM model sizes. Larger models may require more GPU/CPU resources, while smaller models can operate with fewer resources. VPA can automatically adjust the resource claims based on the model size and the observed resource utilization.
For example, if a larger LLM model is deployed, VPA can increase the GPU/CPU resource claim to ensure optimal performance. Conversely, if a smaller model is used, VPA can reduce the GPU/CPU resource claim to avoid over-allocation and improve resource and power consumption efficiency.
In situations where the Kubernetes cluster experiences GPU resource fragmentation, VPA can help optimize the resource allocation. VPA can recommend adjusting the GPU/CPU resource claims of pods to fit the available GPU resources more efficiently.
For instance, if a pod requires 2 GPUs but the cluster has fragmented GPU resources with 1 GPU available on multiple nodes, VPA can recommend reducing the GPU resource claim to 1 GPU per pod. This allows the pods to be scheduled on nodes with available GPU resources, thereby improving overall utilization and reducing fragmentation.
Here's a diagram illustrating the integration of VPA with KEDA and the power capping operator:
graph TD
A[Power Capping Operator] --> B(KEDA)
B --> C[VPA]
C --> D[LLM Inference Deployment]
D --> E[Prometheus]
E --> A
In this diagram:
- The power capping operator interacts with KEDA to manage the scaling of the LLM inference deployment.
- KEDA integrates with VPA to obtain resource recommendations based on the workload requirements and resource availability.
- VPA analyzes the resource utilization of the LLM inference deployment and provides recommendations for resource adjustments.
- The LLM inference deployment is scaled and its resources are adjusted based on the recommendations from VPA and the scaling policies defined in KEDA.
- Prometheus monitors the LLM inference deployment and provides metrics to the power capping operator for decision-making.
By integrating VPA with KEDA and the power capping operator, we can achieve more efficient resource utilization and improved performance for LLM inference workloads. VPA ensures that the pods are allocated the appropriate amount of resources based on the workload demands and resource availability, while KEDA handles the horizontal scaling of the pods. The power capping operator can then make informed decisions based on the resource utilization and power consumption metrics to maintain the desired power limits.
This section illustrates how the power capping operator works in a real-world scenario. The operator continuously monitors the power consumption metrics provided by Kepler and makes adjustments to the KEDA ScaledObjects based on the current power usage and the defined power cap limit.
The power capping operator periodically retrieves the power consumption metrics from Kepler. It calculates the total power being used by the LLM inference deployments at any given time. This power usage is then compared against the power cap limit specified in the PowerCappingConfig CRD.
Based on the current power usage and the power cap limit, the power capping operator adjusts the maxReplicaCount
of
the KEDA ScaledObjects associated with the LLM inference deployments. The following scenarios describe how the operator
handles different power usage levels:
-
Power usage below the power cap limit:
- If the current power usage is below the power cap limit, the operator makes no changes to the KEDA ScaledObjects.
- The LLM inference deployments can scale up or down based on their configured scaling rules.
-
Power usage at 80% of the power cap limit:
- If the current power usage reaches 80% of the power cap limit, the operator sets the
maxReplicaCount
of the KEDA ScaledObjects to one above the current number of replicas. - This allows for a small buffer for scaling up while preventing excessive power consumption.
- If the current power usage reaches 80% of the power cap limit, the operator sets the
-
Power usage at 95% of the power cap limit:
- If the current power usage reaches 95% of the power cap limit, the operator sets the
maxReplicaCount
of the KEDA ScaledObjects to the current number of replicas. - This prevents any further scaling up of the LLM inference deployments to ensure the power usage stays within the power cap limit.
- If the current power usage reaches 95% of the power cap limit, the operator sets the
graph TD
A[Power Capping Operator] -->|Monitors| B(Kepler Metrics)
B -->|Power Usage| C{Check Power Usage}
C -->|Below Power Cap| D[No Changes to ScaledObjects]
C -->|80% of Power Cap| E[Set maxReplicaCount to Current Replicas + 1]
C -->|95% of Power Cap| F[Set maxReplicaCount to Current Replicas]
The flowchart above illustrates the decision-making process of the power capping operator based on the current power usage:
-
The power capping operator monitors the Kepler metrics to obtain the current power usage.
-
The operator checks the power usage against the power cap limit.
-
If the power usage is below the power cap limit, no changes are made to the KEDA ScaledObjects.
-
If the power usage reaches 80% of the power cap limit, the operator sets the
maxReplicaCount
of the KEDA ScaledObjects to one above the current number of replicas. -
If the power usage reaches 95% of the power cap limit, the operator sets the
maxReplicaCount
of the KEDA ScaledObjects to the current number of replicas.
By continuously monitoring the power usage and adjusting the KEDA ScaledObjects accordingly, the power capping operator ensures that the LLM inference deployments operate within the defined power cap limit. This prevents excessive power consumption and helps maintain the overall stability and efficiency of the data center.
This section discusses potential enhancements to the power capping operator and the LLM inference system to further optimize power efficiency and performance.
One enhancement to the LLM inference system is to introduce power efficiency aware routing using a Layer 7 router, such as Envoy or vLLM Router. The idea is to route LLM prompts to the LLM inference services that have the highest token/watts ratio, indicating better power efficiency.
The token/watts metric represents the number of tokens processed per watt of power consumed by an LLM inference service. This metric provides a measure of power efficiency, with higher values indicating more efficient processing.
The token/watts metric per Deployment can be calculated as follows:
token/watts = average_token_througput_per_second / sum(irate(kepler_container_joules_total))
This metric is exposed by the LLM inference services and collected by Prometheus, making it available for the power capping operator and the Layer 7 router.
A Layer 7 router is introduced to handle the routing of LLM prompts to the most power-efficient LLM inference services. The router considers the token/watts metric when making routing decisions.
The Layer 7 router performs the following steps:
- Receives an LLM prompt from a client.
- Retrieves the token/watts metrics for all available LLM inference services from Prometheus.
- Selects the LLM inference service with the highest token/watts ratio.
- Routes the LLM prompt to the selected LLM inference service for processing.
By routing prompts to the most power-efficient services, the Layer 7 router optimizes the overall power efficiency of the LLM inference system.
The power capping operator can be enhanced to consider the token/watts metric when adjusting the maxReplicaCount
of
the KEDA ScaledObjects associated with the LLM inference deployments.
The enhanced power capping operator performs the following steps:
- Monitors the power consumption metrics from Kepler and the token/watts metrics from Prometheus.
- Identifies the LLM inference deployments with higher token/watts ratios.
- Prioritizes the deployments with higher token/watts ratios by allowing a higher number of maximum replicas compared to less efficient deployments.
- Adjusts the
maxReplicaCount
of the KEDA ScaledObjects based on the power usage and the priority assigned to each deployment.
By selectively allowing a higher number of replicas for more power-efficient deployments, the power capping operator ensures that the overall power efficiency of the LLM inference system is optimized while still adhering to the power cap limit.
graph TD
A[Layer 7 Router] -->|Receives LLM Prompt| B(Retrieve Token/Watts Metrics)
B -->|Selects Highest Token/Watts| C[Route to Selected LLM Inference Service]
D[Power Capping Operator] -->|Monitors| E(Power Consumption Metrics)
D -->|Monitors| F(Token/Watts Metrics)
D -->|Prioritizes Higher Token/Watts| G[Adjust maxReplicaCount]
The flowchart above illustrates the power efficiency aware LLM inference routing enhancement:
- The Layer 7 router receives an LLM prompt.
- It retrieves the token/watts metrics for available LLM inference services.
- The router selects the LLM inference service with the highest token/watts ratio.
- The LLM prompt is routed to the selected LLM inference service for processing.
- The power capping operator monitors the power consumption metrics and token/watts metrics.
- It prioritizes the deployments with higher token/watts ratios.
- The operator adjusts the
maxReplicaCount
of the KEDA ScaledObjects based on the power usage and the assigned priorities.
By incorporating power efficiency aware routing and enhancing the power capping operator, the LLM inference system can optimize its power efficiency while maintaining the desired performance levels and adhering to the power cap limit.
Another enhancement to improve the power efficiency and performance of the LLM inference system is to introduce a GPU frequency tuning mechanism. This enhancement involves creating an external Kubernetes job that adjusts the GPU frequency to optimize the token/watts ratio and the maximum number of replicas while ensuring the power cap is not violated.
The GPU frequency tuning job is a Kubernetes job that runs periodically or can be triggered based on certain events. The job performs the following tasks:
- Retrieves the current power consumption metrics from Kepler and the token/watts metrics from Prometheus for each LLM inference deployment.
- Analyzes the metrics to determine if adjusting the GPU frequency can improve the token/watts ratio and the maximum number of replicas.
- Calculates the optimal GPU frequency for each LLM inference deployment based on the metrics and the power cap limit.
- Applies the new GPU frequency settings to the LLM inference deployments using the appropriate GPU management tools or APIs.
By tuning the GPU frequency, the job aims to find the sweet spot where the token/watts ratio is maximized while allowing for a higher number of replicas, ultimately improving the overall token throughput of the LLM inference system.
The power capping operator can be extended to interact with the GPU frequency tuning job. The operator can trigger the job when certain conditions are met, such as when the power usage approaches the power cap limit or when there is a significant change in the token/watts metrics.
The power capping operator performs the following steps:
- Monitors the power consumption metrics from Kepler and the token/watts metrics from Prometheus.
- Analyzes the metrics to determine if GPU frequency tuning is required.
- Triggers the GPU frequency tuning job with the necessary parameters and configurations.
- Waits for the job to complete and receives the updated GPU frequency settings.
- Updates the
maxReplicaCount
of the KEDA ScaledObjects based on the new GPU frequency settings and the power cap limit.
By integrating the GPU frequency tuning job with the power capping operator, the LLM inference system can dynamically adjust the GPU frequency to optimize power efficiency and performance while adhering to the power cap limit.
graph TD
A[Power Capping Operator] -->|Monitors| B(Power Consumption Metrics)
A -->|Monitors| C(Token/Watts Metrics)
A -->|Triggers| D[GPU Frequency Tuning Job]
D -->|Retrieves Metrics| B
D -->|Retrieves Metrics| C
D -->|Calculates Optimal Frequency| E[Apply GPU Frequency Settings]
A -->|Updates maxReplicaCount| G[KEDA ScaledObjects]
The flowchart above illustrates the GPU frequency tuning enhancement:
- The power capping operator monitors the power consumption metrics and token/watts metrics.
- It analyzes the metrics to determine if GPU frequency tuning is required.
- The operator triggers the GPU frequency tuning job with the necessary parameters.
- The GPU frequency tuning job retrieves the metrics and calculates the optimal GPU frequency settings.
- The power capping operator updates the
maxReplicaCount
of the KEDA ScaledObjects based on the updated GPU frequency settings and the power cap limit.
By incorporating GPU frequency tuning, the LLM inference system can further optimize its power efficiency and performance, maximizing the token throughput while operating within the power cap limit. This enhancement complements the power efficiency aware routing and the power capping operator, providing a comprehensive solution for efficient and scalable LLM inference in a Kubernetes environment.
In this enhancement, we investigate how to apply the insights from the research on power utilization in Facebook datacenters to our power capping operator and KEDA. The main idea is to leverage the heterogeneity of power consumption patterns among different services to re-shape the power profile of each power node by re-distributing services. By grouping services with asynchronous peak times under the same power node, we can reduce the peak power of each node, creating more power headroom to allow more servers to be hosted, achieving higher throughput.
The power capping operator is modified to include a workload-aware service placement component. This component analyzes the power consumption patterns of different LLM inference services and systematically spreads the service instances with synchronous power patterns evenly under the power supply tree. The placement is optimized to reduce the peak power draw at power nodes.
The power capping operator is enhanced to dynamically reshape the power profile of each power node by utilizing the headroom unlocked by the workload-aware service placement. It continuously monitors the power consumption patterns and adjusts the service placement and resource allocation accordingly, aiming to maximize the utilization of the available power headroom while ensuring the power cap is not exceeded.
KEDA is extended to consider the power consumption patterns and the workload-aware service placement when scaling the LLM inference services. The scaling rules are modified to take into account the power headroom available at each power node, and the scaling behavior is adjusted to distribute the workload evenly across power nodes with asynchronous peak times. KEDA collaborates with the power capping operator to ensure the scaling actions align with the power usage smoothing strategy.
The monitoring capabilities of the power capping operator are enhanced to collect and analyze power consumption patterns of LLM inference services. It integrates with Kepler and Prometheus to gather real-time power usage data and performs data analysis to identify synchronous and asynchronous power consumption patterns among services. The insights gained from the analysis are used to inform the workload-aware service placement and dynamic power profile reshaping.
graph TD
A[Power Capping Operator] -->|Collects and Analyzes| B(Power Consumption Patterns)
A -->|Implements| C[Workload-Aware Service Placement]
A -->|Implements| D[Dynamic Power Profile Reshaping]
C -->|Spreads Service Instances| E[Power Supply Tree]
D -->|Utilizes| F[Unlocked Power Headroom]
A -->|Collaborates| G[KEDA]
G -->|Considers| C
G -->|Adjusts Scaling| H[LLM Inference Services]
A -->|Monitors and Analyzes| I[Real-time Power Usage Data]
I -->|Informs| C
I -->|Informs| D
The flowchart above illustrates the power usage smoothing enhancement:
- The power capping operator collects and analyzes power consumption patterns of LLM inference services.
- It implements workload-aware service placement to spread service instances with synchronous power patterns evenly under the power supply tree.
- The operator also implements dynamic power profile reshaping to utilize the unlocked power headroom.
- KEDA collaborates with the power capping operator, considering the workload-aware service placement and adjusting the scaling behavior of LLM inference services.
- The power capping operator continuously monitors and analyzes real-time power usage data to inform the ongoing optimization and adjustment of the power usage smoothing strategy.
By incorporating power usage smoothing into our power capping operator and KEDA, we can significantly improve power utilization efficiency, increase throughput, and enhance the scalability of the LLM inference system within the constraints of the existing power infrastructure.