Introduction to Apache Iceberg | PySpark

7 min readApr 25, 2024

The Story Behind a Data Lake

In a world saturated with data, it seems there’s no escape from its grasp. Want to buy something online? Your credit card info becomes data. Craving a cup of coffee? You’re prompted to input your details. Data is the currency for nearly every transaction. But amidst this sea of information, have you ever stopped to ponder where it all resides? How vast is this ocean of data, and what are the costs associated with storing such colossal amounts? Let’s embark on a journey to explore these questions in depth within this article.

What is a Data lake?
A data lake serves as a centralized reservoir engineered to efficiently store, process, and safeguard extensive volumes of structured, semi-structured, and unstructured data. It operates by accommodating data in its original form, irrespective of format, and adeptly manages diverse types of data without being encumbered by size constraints

While numerous data lake solutions flood the market, this article aims to delve deep into Apache Iceberg.

What is the entire story all about? (TLDR)

Understanding how Apache Iceberg works.
Hands-On Journey: Exploring Apache Iceberg with Spark.

Story Resources

GitHub Link: https://github.com/pavan-kumar-99/medium-manifests
GitHub Branch: spark-iceberg

Prerequisites

A laptop :).

What is Apache Iceberg ?

Apache Iceberg is an open table format for huge analytic datasets. Yes, you read that right. Apache Iceberg is not a Database. It is an open table format designed for huge, petabyte-scale tables. Iceberg was built for huge tables. Iceberg is used in production where a single table can contain tens of petabytes of data and even these huge tables can be read without a distributed SQL engine.

Some of the Outstanding Capabilities of Iceberg are

a) Multiple concurrent writers ( Great Plus for Distributed Computing like Spark )

b) Works with any cloud store ( Great Cost Savings :) )

c) Table changes are atomic and readers never see partial or uncommitted changes…. and may more

Iceberg’s Architecture

This table format tracks individual data files in a table instead of directories. This allows writers to create data files in place and only adds files to the table in an explicit commit.

a) Metadata Layer ( Metadata file ): The table state is maintained in metadata files. All changes to the table state create a new metadata file and replace the old metadata with an atomic swap.The table metadata file tracks the table schema, partitioning config, custom properties, and snapshots of the table contents.
s0, s1: A snapshot represents the state of a table at some time and is used to access the complete set of data files in the table.

a.1) Manifest List: The manifests that make up a snapshot are stored in a manifest list file. Each manifest list stores metadata about manifests, including partition stats and data file counts. These stats are used to avoid reading manifests that are not required for an operation.

a.2) Manifest Files: Data files in snapshots are tracked by one or more manifest files that contain a row for each data file in the table, the file’s partition data, and its metrics. The data in a snapshot is the union of all files in its manifests. Manifest files are reused across snapshots to avoid rewriting metadata that is slow-changing. Manifests can track data files with any subset of a table and are not associated with partitions.

b) Data file: This is the Physical file where the actual data is stored. This could be your CSV file or JSON file.

Alright, we now have a deep understanding of what Iceberg is. Let us now get into action by deploying it. There are multiple ways to create a Data lake locally. As a part of this article, we’d use an S3-compatible object Storage MinIO ( MinIO is a high-performance, S3-compatible object store ) as our storage backend for our Datalake.

MinIO can be installed using different ways. Since I am going to run my Datalake on my mac cluster, I would simply install MinIO by running the following commands.

brew install minio/stable/minio

Once minio is installed, the minio server could be started by running the command

mkdir minio 

minio server ./minio

Once the server starts, it prints the URL for both API and GUI access and also the username and password, In my case the API and GUI URLs looked something like

API: http://10.0.0.185:9000
GUI: http://10.0.0.185:59819

From the GUI, I have created a bucket called openlake and we will use this bucket as the storage backend for our datalake.

Let us also create an access key and a secret key to authenticate with MinIO.

With all the prerequisites in place, it’s time to take action. Let’s launch a Spark application designed to ingest and store three million records into Apache Iceberg. This step marks the commencement of our journey into harnessing Iceberg’s and Spark’s power for robust data management.

Environment Configuration:

OS: macOS

Spark and Scala Version:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.2.3
      /_/
                        
Using Scala version 2.12.15, OpenJDK 64-Bit Server VM, 1.8.0_402
Branch HEAD
Compiled by user sunchao on 2022-11-14T17:20:20Z
Revision b53c341e0fefbb33d115ab630369a18765b7763d
Url https://github.com/apache/spark
Type --help for more information.

Spark Example:

In the below example, we are using Spark to load

Before proceeding, it’s essential to customize certain parameters within the code to align with your specific environment. Please ensure to make the following replacements:

Replace Line 11 with your DataLake name.
Update Line 12 with your MinIO API endpoint.
Modify Lines 16 and 17 with the Secret and Access Key created for your MinIO instance.
Adjust Line 18 with the MinIO API Endpoint.
Update Line 28 with the location of your dataset

For the scope of this article, I will be using the H-1B Visa Petitions 2011–2016 dataset from Kaggle.

Before we submit the spark application, let us clone my repository for the updated code

$ git clone https://github.com/pavan-kumar-99/medium-manifests.git -b spark-iceberg

$ cd medium-manifests/

Let us now run the Spark Application.

spark-submit \
--packages org.mongodb.spark:mongo-spark-connector_2.12:10.2.2,org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.4.3,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901  \ 
--num-executors 5 \
spark-medium-h1b.py

Thats it. Your data is committed into the Iceberg table in 501ms (A total of 3 million records processed in total ).

Let us Query this data. We can use Trino ( Fast distributed SQL query engine for big data analytics ) for viewing the data, however, it is not covered as a part of this article. Instead, we would be using spark-sql to query the data from the Iceberg Catalog.

spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:1.4.3,org.apache.hadoop:hadoop-aws:3.3.1,com.amazonaws:aws-java-sdk-bundle:1.11.901 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ 
--conf spark.sql.catalog.demo=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.demo.warehouse=s3a://openlake/warehouse/ \
--conf spark.sql.catalog.demo.s3.endpoint=http://10.0.0.185:9000/ \
--conf spark.sql.defaultCatalog=demo \
--conf spark.sql.catalogImplementation=in-memory \
--conf spark.sql.catalog.demo.type=hadoop \
--conf spark.executor.heartbeatInterval=300000 \
--conf spark.network.timeout=400000 \
--conf spark.hadoop.fs.s3a.access.key=AAxigFYtrxtvmegk8OVx \
--conf spark.hadoop.fs.s3a.secret.key=N0W4msgzlFNlnpyTy4u4qYicNJOX5Fb9iQ7Jyzys \
--conf spark.hadoop.fs.s3a.endpoint=http://10.0.0.185:9000/

Ensure to replace the values like secret key, access key, warehouse, and endpoint as per your configuration.

I am going to select the first 10 rows of the h1b.info table.

select * from demo.h1b.info LIMIT 10;

Honestly, this is the most rewarding aspect for many Data Engineers. It took me weeks of dedication to stitch all these intricate parts together. But now, seeing everything seamlessly integrated and functioning smoothly brings an immense sense of fulfillment and satisfaction :)

Cost Comparisons:

Ahh, How could I end the article without doing a cost comparison?

Let’s delve into the pricing disparity between two approaches: running a traditional SQL Server on AWS RDS versus constructing a Data Lake with Apache Iceberg on AWS S3 Bucket as the storage backend.

RDS ( m5.4x large ) Instance and 100GB Data Storage ( 11296$ per month ):

S3 Bucket: ( 2.86$ per month )

Closing Thoughts

Apache Iceberg is an open table format for huge analytic datasets. Iceberg was built for huge tables. Iceberg is used in production where a single table can contain tens of petabytes of data and even these huge tables can be read without a distributed SQL engine. It is an open table format designed for huge, petabyte-scale tables. Combining this with Spark would give you the best results for any big data workloads. If you’re interested in delving further into its capabilities, feel free to reach out. It’s important to note that the setup outlined in the article is not intended for production environments.

Until next time…..

Introduction to Apache Iceberg | PySpark

What is the entire story all about? (TLDR)

Story Resources

Prerequisites

What is Apache Iceberg ?

Iceberg’s Architecture

Spark Example:

Cost Comparisons:

Closing Thoughts

Recommended

Introduction to OpenCost

Open-source cost monitoring for cloud-native environments

MlOps: Machine learning Pipelines using kubeflow

Effective MLOps on Kubernetes using Kubeflow

Policies as Code in Kubernetes using jsPolicy

DevSecops in Kubernetes using jsPolicy

Kubernetes Policies as code using Kyverno

Kubernetes native policy management

Written by Pavan Kumar