Cost-efficient data platform in Azure, harnessing Terraform, Cointainers, Polars and Blob Storage

October 18, 2023
Rss Fetcher

Cost-efficient data platform in Azure, harnessing Terraform, Containers, Polars, and Blob Storage

Cloud (Source: José Ramos https://unsplash.com/@joser0337)

Let’s face it: data platforms in the cloud can be enormously expensive if not carefully and consciously designed. The cost tends to be pretty linear with how high up in the tech stack you decide to go.

So, what does it mean to be high up in the tech stack? For example, the managed services offered by cloud providers to make it easier to get started with your workload and use-case (e.g., AWS Glue/Lambda/Sagemaker/Athena/Lake Formation and Azure Data Factory/Functions/Data Explorer/Synapse Analytics, etc.). They all require less- to no- code to get started, and the Time To Value (TTV) is just around the corner. However, scaling these services can result in unexpectedly high invoices and not offer the flexibility enough for later customization and requirements.

“Scaling these services can result in unexpectedly high invoices and not offer the flexibility enough for later customization and requirements.”

What’s the option then? Go further down in the tech stack! Here, we can find the core components on which these managed cloud services are built (e.g., AWS EC2/Fargate/S3 and Azure Virtual Machines/Container Instances/Blob Storage).

Managed Service (Source: https://www.freecodecamp.org/news/cloud-computing-abstractions-explained)

This article will go through how to build a robust, cost-efficient, and flexible data platform in Azure that scales. It is built around Azure Container Instances for Compute, Azure Blob Storage for data storage, Terraform, and Azure Resource Groups for Infrastructure as Code deployments.

The article is split into six sections to make it a bit easier to follow along.

Architecture
Deployment
Data jobs containers
Container triggers
Cost analysis
Local data exploration in Jupyter Notebook

Let’s jump into it!
The full code can be found here: https://github.com/martinkarlssonio/azure-dataplatform

1) Architecture

Data platform architecture overview (Source: Martin Karlsson)

Let’s look at the components of this architecture.

Controlling the cloud resources (Azure Resource Manager)
Azure Resource Manager is a management layer for creating, updating, and deleting resources in your Azure account. It includes management features like access control, tags, cost management, etc. Deployment is done via Infrastructure as Code (IaC) templates.

Data jobs (Azure Container Instances)
By running your workloads in Azure Container Instances (ACI), you can focus on designing and building your applications instead of managing the infrastructure that runs them.

The three example containers use Azure packages to connect to Blob Storage and Polars to load the data to data frames. Polars is a high-performance data frame package written in Rust.

Network (Azure Virtual Network)
Creates a virtual network in the cloud for all the containers. The purpose of this is to control how the containers can be accessed. In this example, they are fully isolated from external access, as we have specified it to be private in Terraform deployment.

Container images (Azure Container Registry)
Storage and registry for all container images. The container images are built locally and then pushed to this private container registry.

Triggers (Azure Logic Apps)
Logic Apps allow for custom workflows. Here, we use the scheduled triggers to start each Container Group. It is impossible to trigger an individual container; therefore, we place the containers in three groups (Raw/Enriched/Curated). One container group can host up to 60 containers.

Data exploration & Data science (Jupyter Notebook)
Instead of using cloud computing during development and data exploration, this solution connects to Azure Blob Storage directly from a local Jupyter Notebook. It puts the computing on the developer’s local machine.

2) Deployment

The deployment is built around Terraform, an infrastructure-as-code tool (IaC). It ensures consistency when deploying through different environments, increases development speed, robust CI/CD pipelines, etc.

Terraform logo (Source : https://www.hashicorp.com/brand)

There are two deployment scripts, one for Windows (winRun.ps1) and one for Linux (linuxRun.sh), which follow these three steps.

Deploy Core resources (Blob Storage Account & Container Registry)
Build and Push container images to Container Registry (Raw, Enriched, Curated)
Deploy Containers resources (Virtual Network, Container Instances, Logic Apps)

As of the writing of this article, Terraform does not fully support deploying Azure Logic Apps. Therefore, this is deployed in the end with Azure CLI (final.sh).

Note: The final Azure CLI script is automatically executed inside the container during Containers resources deployment.

Deployment flow (Source: Martin Karlsson)

3) Data jobs containers

If we look inside the containers deployed to the cloud (mocked-raw/enriched/curated), we find Polars, which is a high-performance data frame package written in Rust. It will ensure that the execution time is kept to a minimum and that the CPU and memory utilization are optimized.

Polars logo (Source: https://www.pola.rs/)

You can read more about Polars on the official website.
https://www.pola.rs/

Containers with data jobs (Source: Martin Karlsson)

Container: Mocked Raw

In a real scenario, the raw containers would fetch data from the sources and ensure it follows the expected data schema, then add it to the Raw Blob Storage.

In this example, the raw container creates a mocked data frame with IT ticket data, which will be used to demonstrate a real-world scenario.

The Polars data frame contains 100,000 rows and has columns such as Priority, Category, Status, and Timestamps(Open/Closed).

For partitioning purposes, it creates Open_Year/Month/Day columns. It is always a good practice to verify the data towards some data contract and ensure good partitioning already in the Raw layer. We do not want to end up with a ‘Data Swamp’.

The mocked raw data frame (Source: Martin Karlsson)

“It is always a good practice to both verify the data towards some kind of data contract and ensure good partitioning already in the Raw layer. We do not want to end up with a ‘Data Swamp’.”

The data frame is partitioned by Year/Month/Day to Apache Parquet files and then added to Azure Blob Storage.

Apache Parquet logo (Source: https://en.wikipedia.org/wiki/Apache_Parquet)

Partition the data frame and store it as parquet files (Source: Martin Karlsson)

Upload of parquet to Blob (Source: Martin Karlsson)

If we take a look at the Raw Blob Storage, we can see neatly structured data in the format of :
mocked-raw/Open_Year=YYYY/Open_Month=MM/Open_Day=DD

With this structure, we can load only a specific date range when accessing the data, reducing the need to load everything. This will reduce computing, network, execution time, and cost.

Raw Blob Storage showing the partitioning of data (Source : Martin Karlsson)

Container: Mocked Enriched

The container fetches the raw parquet data from Blob Storage and loads it to a Polars data frame. It then enriches the dataset with a new column that calculates the number of minutes to close each ticket, “Minutes_To_Close”.

Minimalistic enrichment in Polars (Source: Martin Karlsson)

The mocked enriched data frame (Source: Martin Karlsson)

Container: Mocked Curated

In the curated container, the data is read from Enriched Blob Storage. Then it removes all the rows with the category “Hardware” because we are not interested in that data in this fictitious scenario.

Minimalistic curation in Polars (Source: Martin Karlsson)

4) Container triggers

Deploying the triggers was certainly not a straightforward thing. At the point when writing this article, Terraform does unfortunately not fully support deploying Azure Logic Apps; hence, there’s a CLI script in the end that deploys this.

But the key here is the files named containers/logicApp{stage}.json where the triggers and workflows are defined.

Looking at the file below, we can see a trigger that re-occurs every day at 02:01 UTC. It triggers the action start_curated_containergroup, which is pointed out under path.

If we go into the Azure portal and navigate to Logic Apps, choose logicAppCurated, and then enter Logic app designer we can see a graphical visualization of this workflow. Here, we can also see the formatting of the workflow JSON when pressing {} View Code.

Container logs

In Azure, you can navigate into Container Instances and select your Container Group and a specific container. Apart from seeing some numbers, events, and properties, there’s a tab to access the runtime logs. This can be very useful — and it is important to know how to access it!

5) Cost

Looking at the Cost Analysis in Azure, we can see that this solution, with mocked data, is very low in price.

Technical details

Containers execution time : 3 x 30–50 s per day
Containers CPU : 3 x 2
Containers Memory : 3 x 2 GB
Data size : 3 x 40 MB per day

Cost details

The cost for running this solution with mocked data is very low, a couple of $USD per month. But in real scenarios, you can expect longer execution time, more data storage, etc. The point is the same, though: the solution is very cost-efficient!

Blob Storage: 60% of total cost
Containers: 35% of total cost
Others: 5% of total cost

“The point is the same though, the solution is very cost efficient!”

Cost Analysis in Azure (Source: Martin Karlsson)

6) Local data exploration in Jupyter Notebook

Instead of deploying Jupyter Notebook instances in the cloud and paying a hefty amount for utilizing cloud computing, this solution suggests using local computing instead. It’s free of charge!

The repository contains an example Notebook that connects to Azure Blob Storage and performs some data exploration on the data.

The notebook will first establish the proper access rights and then download the data of interest. When the parquet files are downloaded, they are loaded into a Polars Dataframe.

Download parquet files from Blob Storage (Source: Martin Karlsson)

With the data in place, we can start doing some computing. Let’s aggregate, visualize, and train a simple Machine Learning Anomaly model!

The data is categorized and summarized using Polars and visualized with matplotlib.

Aggregated amount of minutes to close tickets (Source: Martin Karlsson)

Using IsolationForest (SciKit-Learn) we can train a simple Machine Learning model, get an anomaly score on the data, and plot that right under the weekly minutes aggregation. The example also adds trend lines using Numpy.

Trend line with Numpy and IsolationForest with SciKit-Learn (Source: Martin Karlsson)

Weekly aggregation and trend together with Machine Learning Anomaly scores (Source: Martin Karlsson)

There we go — a cost-efficient data platform with data exploration on a local machine!

The intention is for this to serve as a boilerplate to continue to build upon — without the business case failing due to high cloud costs!

The full code can be found here: https://github.com/martinkarlssonio/azure-dataplatform

Cost-efficient data platform in Azure, harnessing Terraform, Cointainers, Polars and Blob Storage was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.