Use Intel® SGX Virtual Machines in Azure Kubernetes* Services (AKS) Deployments
When deploying machine learning models into production, it is critical to ensure the application’s security, scalability, and availability. Using the Kubernetes* service on the Microsoft Azure* confidential computing platform is one of the most effective ways to ensure these requirements are met.
In this tutorial, build and deploy a highly available and scalable XGBoost pipeline on Azure. This reference solution is a part of the Intel® Cloud Optimization Modules for Azure, a set of cloud-native open source reference architectures that are designed to facilitate building and deploying Intel-optimized AI solutions on leading cloud providers, including Amazon Web Services (AWS)*, Azure, and Google Cloud Platform* service. Each module, or reference architecture, includes a complete instruction set and all source code published on GitHub*.
You can download the code for this module on GitHub.
This solution architecture uses Docker* for application containerization and stores the image in an Azure container registry (ACR). The application is then deployed on a cluster managed by Azure Kubernetes* Services (AKS).
The cluster runs on confidential computing virtual machines leveraging Intel® Software Guard Extensions (Intel® SGX). Use a mounted Azure file share for persistent data and model storage. An Azure load balancer is provisioned by the Kubernetes service that the client uses to interact with your application. These Azure resources are managed in an Azure resource group.
The deployed machine learning pipeline builds a loan default risk prediction system using Intel® Optimization for XGBoost*, Intel® oneAPI Data Analytics Library (oneDAL), and Intel® Extension for Scikit-Learn* to accelerate model training and inference. This also demonstrates how to use incremental training of the XGBoost model as new data becomes available, which aims to tackle challenges such as data shift and very large datasets. Each pod is assigned to an Intel SGX node when the pipeline is run. Figure 1 shows the cloud solution architecture.
Configure Azure* Services
Before beginning this tutorial, ensure that you have downloaded and installed the prerequisites for the module. Then, in a new terminal window, use the following command to log into your Azure account interactively with the Azure command-line interface.
az login
Next, create a resource group that holds Azure resources for the solution. Call the resource group intel-aks-kubeflow and set the location to eastus.
# Set the names of the Resource Group and Location
export RG=intel-sgx-loan-default-app
export LOC=eastus
# Create the Azure Resource Group
az group create -n $RG -l $LOC
Use an Azure file share for persistent volume storage of the application’s data and model objects. To create the file share, first create an Azure storage account named loanappstorage using the following command:
# Set the name of the Azure storage account
export STORAGE_NAME=loanappstorage
# Create an Azure storage account
az storage account create
--resource-group $RG
--name $STORAGE_NAME
--kind StorageV2
--sku Standard_LRS
--enable-large-file-share
--allow-blob-public-access false
Next, create a new file share in your storage account named loan-app-file-share with a quota of 1,024 GB.
# Create an Azure file share
az storage share-rm create
--resource-group $RG
--storage-account $STORAGE_NAME
--name loan-app-file-share
--quota 1024
Create an Azure container registry to build, store, and manage the container image for the application. The following command creates a new container registry named loandefaultapp:
# Set the name of the Azure container registry
export ACR=loandefaultapp
# Create an Azure container registry
az acr create --resource-group $RG
--name $ACR
--sku Standard
Log in to the registry using the following command:
az acr login -n $ACR
Next, build an application image from the Dockerfile provided in this repository and push it to the container registry. Name the image loan-default-app with the tag latest. Ensure the Dockerfile is in your working directory before running the following command:
az acr build --image loan-default-app:latest --registry $ACR -g $RG --file Dockerfile .
Now, you can deploy the AKS cluster with confidential computing nodes that use Intel SGX virtual machines (VM).
Intel SGX VMs allow you to run sensitive workloads and containers within a hardware-based Trusted Execution Environment (TEE). TEEs allow user-level code from containers to allocate private memory regions to execute the code with the CPU directly.
These private memory regions that execute directly with the CPU are called enclaves. Enclaves help protect data confidentiality, data integrity, and code integrity from other processes running on the same nodes and Azure operator. These machines are powered by third-generation Intel® Xeon® Scalable processors and use Intel® Turbo Boost Max Technology 3.0 to reach 3.5 GHz.
To set up the confidential computing node pool, first create an AKS cluster with the confidential computing add-on enabled, confcom. This creates a system node pool that hosts the AKS system pods, like CoreDNS and metrics server.
Executing the following command creates a node pool with a Standard_D4_v5 virtual machine. The Kubernetes version used for this tutorial is 1.25.5. Provision a standard Azure load balancer for the cluster and attach the container registry you created in the previous step, which allows the cluster to pull images from the registry.
# Set the name of the AKS cluster
export AKS=aks-intel-sgx-loan-app
# Create the AKS cluster
az aks create --resource-group $RG
--name $AKS
--node-count 1
--node-vm-size Standard_D4_v5
--kubernetes-version 1.25.5
--enable-managed-identity
--generate-ssh-keys -l $LOC
--load-balancer-sku standard
--enable-addons confcom
--attach-acr $ACR
Once the system node pool has been deployed, add the Intel SGX node pool to the cluster using an instance of the DCSv3 series. The name of the confidential node pool is intelsgx, which is referenced in the Kubernetes deployment manifest for scheduling application pods. Enable cluster autoscaling for this node pool with a minimum of one node and a maximum of five nodes.
# Add the Intel SGX node pool to the AKS cluster
az aks nodepool add --resource-group $RG
--name intelsgx
--cluster-name $AKS
--node-count 1
--node-vm-size Standard_DC4s_v3
--enable-cluster-autoscaler
--min-count 1
--max-count
Once the Intel SGX node pool has been added, obtain the cluster access credentials and merge them into your local .kube/config file using the following command:
az aks get-credentials -n $AKS -g $RG
To ensure that the Intel SGX VM nodes were created successfully, run the following:
kubectl get nodes
You should see two agent nodes running that begin with the name aks-intelsgx.
To ensure that the DaemonSet was created successfully, run:
kubectl get pods -A
In the kube-system namespace, you should see two pods running that begin with the name sgx-plugin. The confidential node pool has been created successfully if you see the previous pods and nodes running.
Set Up the Kubernetes Resources
Now that the AKS cluster has been deployed, use the following command to set up a Kubernetes namespace for the cluster resources called intel-sgx-loan-app:
export NS=intel-sgx-loan-app
kubectl create namespace $NS
Next, create a Kubernetes secret with the Azure storage account name and account key so that the loan-app-file-share can be mounted to the application pods.
export STORAGE_KEY=$(az storage account keys list -g $RG -n $STORAGE_NAME --query [0].value -o tsv)
kubectl create secret generic azure-secret
--from-literal azurestorageaccountname=$STORAGE_NAME
--from-literal azurestorageaccountkey=$STORAGE_KEY
--type=Opaque
Now that you have created a secret for your Azure storage account, you can set up persistent volume storage for the loan default risk prediction system. A persistent volume (PV) enables the application data to persist beyond the pods’ lifecycle and allows each pod in your deployment to access and store data in the loan-app-file-share.
First, create the persistent volume using the pv-azure.yaml file in the Kubernetes directory of the repository. This creates a storage resource in the cluster backed by your Azure file share. Then, create the persistent volume claim (PVC) using the pvc-azure.yaml file, which requests 20 GB of storage from the azurefile-csi storage class with ReadWriteMany access. Once the claim matches the volume, they are bound together in a one-to-one mapping.
# Change working directory to the kubernetes directory
cd kubernetes
# Create the Kubernetes persistent volume and persistent volume claim
kubectl create -f pv-azure.yaml -n $NS
kubectl create -f pvc-azure.yaml -n $NS
Next, create an Kubernetes external load balancer to distribute incoming API calls evenly across the available pods in the cluster and ensure that the application remains highly available and scalable. Running the following command configures an Azure load balancer for the cluster with a new public IP. This IP address requests the application’s API endpoints for data processing, model training, and inference.
kubectl create -f loadbalancer.yaml -n $NS
Now you are ready to deploy the first application pod. To create and manage our application pods, use a Kubernetes deployment. A Kubernetes deployment is a Kubernetes resource that allows you to declaratively manage a set of replica pods for a given application, ensuring that the desired number of pods are running and available at all times.
In the pod template of the deployment spec, specify that the pods can only be scheduled on a node in the Intel SGX node pool. If an SGX node is unavailable, the cluster automatically scales up to add an additional SGX node.
Also, specify the directory in your pod where the volume is mounted, which is in the loan_app/azure-file directory. When the pod is scheduled, the cluster inspects the persistent volume claim to find the bound volume and mount the Azure file share to the pod.
Note: Before executing the following command, update the container image field in the deployment.yaml file with the name of the Azure container registry, repository, and tag. For example, loandefaultapp.azurecr.io/loan-default-app:latest.
kubectl create -f deployment.yaml -n $NS
To check that the deployment was created successfully, you can use the kubectl get command:
kubectl get all -n $NS
Your output should be similar to:
NAME READY STATUS RESTARTS AGE
pod/sgx-loan-app-5948f49746-zl6cf 1/1 Running 0 81s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/loan-app-load-balancer LoadBalancer 10.0.83.32 20.xxx.xxx.xxx 8080:30242/TCP 3m40s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/sgx-loan-app 1/1 1 1 81s
NAME DESIRED CURRENT READY AGE
replicaset.apps/sgx-loan-app-5948f49746 1 1 1 81s
Finally, create a Kubernetes horizontal pod autoscaler (HPA) for your application that maintains between one and five replicas of the pods. The HPA controller increases and decreases the number of replicas (by updating the deployment) to maintain an average CPU use of 50%. Once CPU use by the pods goes above 50%, a new pod is automatically scheduled.
kubectl create -f hpa.yaml -n $NS
Deploy the Application
Now that the Azure resources and Kubernetes infrastructure have been set up, you can send requests to the application’s API endpoints. The three main endpoints in our application are data processing, model training, and inference. The credit risk data set used in the application is obtained from Kaggle* and synthetically augmented for testing and benchmarking purposes.
Data Processing
The data processing endpoint receives the original credit risk CSV file, creates a data preprocessing pipeline, and generates training and test sets of the size specified. The train and test data and our preprocessor are stored in the Azure file share.
The data processing endpoint accepts the following four parameters:
- az_file_path: The volume mount path to the Azure file share for object storage.
- data_directory: The directory in the file share where processed data should be stored.
- file: The original credit risk CSV file is located in the working directory.
- size: The desired size of the final dataset, default 4M rows.
You can call this endpoint using the following command with the external IP address created by the Azure load balancer.
curl <external-IP>:8080/data_processing
-H "Content-Type: multipart/form-data"
-F az_file_path=/loan_app/azure-fileshare
-F data_directory=data
-F file=@credit_risk_dataset.csv
-F size=4000000 | jq
In the loan-app-file-share, you should see a new directory named data with the processed training and test sets and the data preprocessor.
Model Training
Now you are ready to train the XGBoost model. The model training endpoint begins training the model or continues training the model, depending on the value of the continue_training parameter. The XGBoost classifier and the model validation results are then stored in the Azure file share.
This endpoint accepts the following six parameters:
- az_file_path: The volume mount path to the Azure file share.
- data_directory: The directory in the file share where processed data is stored.
- model_directory: The directory in the file share for model storage.
- model_name: The name to store the model object.
- continue_training: The XGBoost model parameter for training continuation.
- size: The size of the processed dataset.
You can make a call to the model training endpoint using the following command:
curl <external-IP>:8080/train
-H "Content-Type: multipart/form-data"
-F az_file_path=/loan_app/azure-fileshare
-F data_directory=data
-F model_directory=models
-F model_name=XGBoost
-F continue_training=False
-F size=4000000 | jq
In the loan-app-file-share, you should see a new directory named models with the XGBoost model saved as .joblib object and the model performance results.
Continue Training the XGBoost Model
To demonstrate training continuation of the XGBoost model, first process a new batch of 1,000,000 rows of data using the following command:
curl <external-IP>:8080/data_processing
-H "Content-Type: multipart/form-data"
-F az_file_path=/loan_app/azure-fileshare
-F data_directory=data
-F file=@credit_risk_dataset.csv
-F size=1000000 | jq
Then, call the model training endpoint with the continue_training parameter set to True.
curl <external-IP>:8080/train
-H "Content-Type: multipart/form-data"
-F az_file_path=/loan_app/azure-fileshare
-F data_directory=data
-F model_directory=models
-F model_name=XGBoost
-F continue_training=True
-F size=1000000 | jq
Inference
In the final step of the pipeline, the trained XGBoost classifier is used to predict the likelihood of a loan default in the inference endpoint. This endpoint retrieves the trained XGBoost model from the Azure file share and converts it into a daal4py format to perform model inference.
The inference results are then stored in the loan-app-file-share, containing the predicted probabilities and a corresponding label of True when the probability is greater than 0.5 or False otherwise.
The inference endpoint accepts a CSV file with sample data in the same format as the credit risk data, in addition to the following parameters:
- file: The sample CSV file for model inference.
- az_file_path: The volume mount path to the Azure file share.
- data_directory: The directory in the file share where the data preprocessor is stored.
- model_directory: The directory in the file share where the trained XGBoost model is stored.
- model_name: The name of the stored model object.
- sample_directory: The directory in the file share to save the prediction results.
You can make a call to the inference endpoint using the following command:
curl <external-IP>:8080/predict
-H "Content-Type: multipart/form-data"
-F file=@sample.csv
-F az_file_path=/loan_app/azure-fileshare
-F data_directory=data
-F model_directory=models
-F model_name=XGBoost
-F sample_directory=samples| jq
In the loan-app-file-share, you should see a new directory named samples with the daal4py prediction results saved in a CSV file.
Summary
This tutorial demonstrated building a highly available and scalable Kubernetes application on the Azure cloud using Intel SGX confidential computing nodes. The solution architecture implemented the Loan Default Risk Prediction AI Reference Kit, which built an XGBoost classifier capable of predicting the probability that a loan results in default and used Intel optimizations in XGBoost and oneDAL.
The tutorial further demonstrated how an XGBoost classifier can be updated with new data and trained incrementally, which aims to tackle challenges such as data shift and very large datasets.
Next Steps
- Access the full source code on GitHub.
- Register for support during office hours for help with your implementation.
- Learn more about all of the Intel Cloud Optimization Modules.
- Come chat with us on the DevHub from Intel on Discord* server to keep interacting with fellow developers.
View the full performance results of the Loan Default Risk Prediction AI Reference Kit.
References
This work was republished for Medium from Kelli Belcher’s article found here.
Notices and Disclaimers
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.
Intel technologies may require enabled hardware, software, or service activation.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
Build Secure, Scalable, and Accelerated Machine Learning Pipelines was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.