The Snowflake Data Cloud continues to stand out as a pioneer, consistently introducing innovative features to simplify and optimize data storage and compute workloads. One such feature recently added by Snowflake is to support the Iceberg table format, which is currently in public preview for all Snowflake customers.
In this blog, we’ll explain the architecture of Apache Iceberg tables, the different types of Iceberg tables supported in Snowflake, and how they perform compared to native and external Snowflake tables. This blog also explains the use cases when you should use Iceberg tables and some of the limitations in the current version.
What is Apache Iceberg?
Apache Iceberg is a high-performance open table format designed to manage huge datasets at scale. This format determines how to manage, organize, and track all the data files stored in open file formats that make up a table.
Architecture Overview
The architecture of Apache Iceberg consists of three different layers:
Catalog
Metadata layer
Data layer
Catalog
The catalog manages a collection of tables and tracks the table’s current metadata. Additionally, it supports atomic operations to update current metadata pointers. Hive metastore, AWS Glue, Nessie, JDBC database, Snowflake, etc., can all be used as catalogs for Iceberg tables.
Metadata File
The metadata file stores the metadata of a table at a given timestamp in JSON format. This includes details about table snapshots, manifest lists, schema, partition spec, etc. Every time there is a change in table data or metadata, a new metadata file is created.
Manifest List
The manifest list is a group of manifest files that are linked to a snapshot. These are normally avro files that store details about manifest files, including statistics.
Manifest File
This is also an Avro file, which contains a list of data files with column statistics and partition information for each data file.
Data Files
Data files are the actual files storing data in open file formats: ORC, parquet, and AVRO.
Iceberg Tables in Snowflake
Iceberg tables in Snowflake are a new type of table where the actual data is stored outside of Snowflake. The data is stored in a public cloud object storage location (Amazon S3, Google Cloud Storage, or Azure Storage) in Apache Iceberg table format. Snowflake can access the data using new objects called external volume and catalog integration.
Snowflake uses its native query semantics with Iceberg specifications and libraries to read data from and write data into the cloud object storage. The below picture describes the difference between external tables, Snowflake native tables, and Iceberg tables.
External Volume
An external volume is an account-level object that stores the identity and access details of external cloud storage where Iceberg tables data is stored. A single external volume can be used to create multiple Iceberg tables.
Catalog Integration
Catalog integration is an account-level object that defines the source of metadata for Iceberg tables in Snowflake when Snowflake is not used as the catalog.
The Iceberg tables in Snowflake support features such as:
ACID transactions
Schema evolution
Time travel (Snapshot-based reads)
Hidden partitioning
Multiple query engine support
Key Differences Between Snowflake Native Tables and Iceberg Tables
Table metadata is stored in Iceberg format in public cloud storage and optionally in Snowflake if Snowflake is used as a catalog.
Data is stored in parquet format in public cloud storage, not in Snowflake.
Both Snowflake and external compute engines like Spark can read data from cloud storage and write back to the same location.
Iceberg Table Types in Snowflake
Depending on where the catalog is managed for Iceberg tables, Snowflake can have two types of Iceberg tables:
Snowflake Managed Iceberg Tables – Snowflake manages the metadata and catalog for these tables. These tables can support all Snowflake features with read and write access.
Externally Managed Iceberg Tables – An external system such as AWS Glue manages the metadata and catalog. These tables can support read-only access in Snowflake.
Feature | Snowflake Managed Iceberg tables | Externally Managed Iceberg tables |
---|---|---|
Read access from Snowflake | ✔️ | ✔️ |
Write access from Snowflake | ✔️ | ❌ |
Use of warehouse cache for queries | ✔️ | ✔️ |
Automatic metadata refresh in Snowflake | ✔️ | ❌ |
Snowflake platform features: time travel, row/column mask, etc. | All features | Limited features |
Nested datatype support | ✔️ | ✔️ |
Support of table clustering | ✔️ | ❌ |
Iceberg Table Performance
The below picture shows the performance of different table types in Snowflake.
Externally managed Iceberg tables perform 2x better than Snowflake external tables because Snowflake uses its SSD cache to cache the data and reuse it in queries using externally managed Iceberg tables. Snowflake also uses its highly-optimized parquet scanner, which can help in reading parquet files of Iceberg tables faster.
Snowflake-managed tables perform better because of the efficient way Snowflake writes parquet files with full statistics. They perform almost the same as Snowflake native tables.
When to Use Iceberg Tables in Snowflake
The data needs to be stored in a public cloud using an open file format like Parquet.
Multiple teams are using different computing engines (Snowflake, Spark, etc. to analyze the same data without storing data redundantly.
Data is already stored in the public cloud in an Iceberg table format, and one does not want to migrate data to Snowflake native tables.
Limitations of Iceberg Tables in Snowflake
Snowflake supports Iceberg tables with external volume in the same cloud and region as that of the Snowflake account. Cross-cloud and cross-region Iceberg tables are not supported.
Only supports parquet format to store data files.
Can create only permanent Iceberg tables and cannot create transient or temporary Iceberg tables.
Time travel in Spark is not supported for Snowflake-managed Iceberg tables.
Cannot create clones and cannot replicate Snowflake Iceberg tables.
Third-party clients cannot modify data in Snowflake Iceberg tables.
Conclusion
With the incorporation of Iceberg tables, Snowflake has become more flexible, enabling customers to leverage Snowflake’s performance and features for data they prefer to store outside the platform. By using Iceberg tables, customers can reduce storage costs and allow various teams within an enterprise to utilize different technologies for processing and analyzing data without the need to copy it to their respective platforms.
Need assistance getting started with Iceberg tables in Snowflake?
As the 2023 Snowflake Partner of the Year, phData can help your organization get the most out of Snowflake.
FAQs
Snowflake-managed Iceberg table's performance is at par with Snowflake native tables while storing the data in public cloud storage. This is ideal for situations where the data is already stored in data lakes and you do not intend to load it into Snowflake but need to use Snowflake's features and performance.
While external tables are read-only tables created to access the data files stored outside Snowflake, Iceberg tables are read/write-capable tables that store data in public cloud storage using an open table format. Iceberg tables support a lot of ]features like time travel, schema evolution, hidden partition, multiple query engine support, etc.