Organizations will store data in an Avro file to accommodate their needs to exchange data between programs without concern for language efficiency. This can make processing challenging for many traditional databases, but not Snowflake Data Cloud.Â
In this blog, we will help you master handling Avro files to allow for seamless data operations with Snowflake.
What is Avro?
Avro is a data serialization framework used to exchange data between systems and languages. Avro files are typically used in distributed computing systems and are efficient for storing large datasets due to their compact binary nature.
Avro stores the data definition in JSON format, which makes it easy to read and interpret. Avro files include markers that can be used to split large data sets into subsets.
The key feature of Avro is its excellent support for data schemas that change over time, which is called schema evolution. Avro effortlessly handles schema changes like missing fields, added fields, and changed fields. Data stored using Avro can be passed from programs written in different languages, even from a compiled language like C to a scripting language like Apache Pig.
Two Types of Avro File Extensions
file_name.avro:
These files contain serialized data records along with the Avro schema, which describes the structure of the data in a compact binary format.
file_name.avsc files:
These files contain Avro schemas. Avro schemas are written in JSON format and define the structure of the data stored in Avro files. They describe the fields, data types, and structure of the data.
.avsc
files hold these schema definitions in a human-readable format separate from the data itself. The purpose of keeping a separate schema file is to maintain version control, ease of modification, and reusability across different datasets or applications using Avro serialization.
Sample Avro Data: user.avro
Obj
{"type":"record","name":"example","fields":[{"name":"id","type":"int"},{"name":"name","type":"string"}]}
1Alice2Bob
What Does It Mean?
Obj
denotes the Avro data file in binary format.The following sectionÂ
{
"type":"record",
"name":"example",
"fields":[
{"name":"id", "type":"int"},
{"name":"name", "type":"string"}
]
}
is the Avro schema header.
The data records (
1Alice
and2Bob
) correspond to the actual data. In this case, it represents two records: one withid=1
andname=Alice
, and another withid=2
andname=Bob
.
How to Load Avro Files Into Snowflake Tables
In this example, we’ll say you have a file named part-1.avro
in the s3 bucket, and your task is to load the Avro file into the Snowflake table.
Step 1: Create a storage integration in Snowflake as per the Snowflake documentation.
Step 2: Create an external stage in Snowflake, which will point to the bucket where your Avro file resides.
CREATE STAGE s3_dev_stg
STORAGE_INTEGRATION = s3_int
URL = 's3://phdata-blog/avro-files/'
FILE_FORMAT = AVRO;
In the above code, we’ve provided the file format alongside. You can choose to include file format at the time of stage creation or to create a separate file format for your Avro files.
Step 3: Load the data into a table.
Consider the table definition as follows:Â
CREATE TABLE dev_db.phdata_blog.customer (
id NUMBER(38.0),
name VARCHAR(100),
load_date DATE
)
Running the copy into the command:
COPY INTO dev_db.phdata_blog.customer FROM (
SELECT
$1:id::NUMBER,
$1:name::VARCHAR,
current_date AS load_date
FROM @dev_db.phdata.s3_dev_stg/customer/
)
Scenarios Where More Granularity may be Required
Scenario 1: Getting Date from Filename in a Snowflake Table
Suppose a vendor performs daily file dumps in s3 and appends a date at the end. Organizations should keep track of all the files in the table in the load_date
column.
Filenames may look as follows: Part-1-20231201.avro
, Part-1-20231202.avro
, etc.
Here’s what the copy command would look like:
COPY INTO dev_db.phdata_blog.customer FROM (
SELECT
$1:id::NUMBER,
$1:name::VARCHAR,
TO_DATE(SUBSTR(METADATA$FILENAME, -13, 8), 'YYYYMMDD')
FROM @dev_db.phdata.s3_dev_stg/customer/
)
The METADATA$FILENAME
 will extract the entire file path from s3, and SUBSTR
 will help us get the date part out of that.
Scenario 2: Filtering Only Avro Files
In some instances, an organization may find .SUCCESS
files accompanying Avro files. These .SUCCESS
files serve as indicators, signaling the successful completion of the data import process for specific datasets, as demonstrated below:
Part-1-20231201.avro
Part-1-20231202.avro
.SUCCESS(0 bytes file)
In this case, the above copy command will fail because it will encounter a 0 byte .SUCCESS
file.Â
To fix this, the modified copy command shown below works like magic.
COPY INTO dev_db.phdata_blog.customer FROM (
SELECT
$1:id::NUMBER,
$1:name::VARCHAR,
TO_DATE(SUBSTR(METADATA$FILENAME, -13, 8), 'YYYYMMDD')
FROM @dev_db.phdata.s3_dev_stg/customer/
)
PATTERN = '.*.avro'
Scenario 3: Loading Avro Data into Variant Columns for Direct Querying in Snowflake.
We are using the userdata.avro example file.
The schema looks like below:Â
{
"type":"record",
"name":"User",
"fields":[
{"name":"birthdate", "type":"string"},
{"name":"cc", "type":"long"},
{"name":"comments", "type":"string"},
{"name":"country", "type":"string"},
{"name":"email", "type":"string"},
{"name":"first_name", "type":"string"},
{"name":"gender", "type":"string"},
{"name":"id", "type":"int"},
{"name":"ip_address", "type":"string"},
{"name":"last_name", "type":"string"},
{"name":"registration_dttm", "type":"string"},
{"name":"salary", "type":"double"},
{"name":"title", "type":"string"}
]
}
Below is a table of user data with one variant column, user_info
, and loaded the data using the copy command (see snippet below).
The file format is Avro, which is defined on the stage level.
Snowflake has read user data and Avro data into a single VARIANT
column. Now, we can query the data in the VARIANT
column user_info
just as we would JSON data or any other semi-structured data.
Best Practices
The column names in the copy-into command should match the Avro schema file or the Avro file header, or it will cause an error.
Snowflake does provide a feature while loading Avro files (shown below), but it doesn’t work when you run the copy command with a transformation.
MATCH_BY_COLUMN_NAME = CASE_SENSITIVE | CASE_INSENSITIVE
Let’s take the same userdata.avro
file, as mentioned in scenario 3, and let’s create a new table as per the user data schema.
CREATE TABLE user_data (
birthdate STRING,
cc BIGINT,
comments STRING,
country STRING,
email STRING,
first_name STRING,
gender STRING,
id INT,
ip_address STRING,
last_name STRING,
registration_dttm STRING,
salary FLOAT,
title STRING
);
Running the copy command without any transformation with the MATCH_BY_COLUMN_NAME
feature works smoothly (see below).
But, if we use the copy command with transformation, it will throw an error.
Here, even $1:birthdate::VARCHAR
is a transformation for Snowflake. We have also added an extra column filename to the table and populated it using the METADATAFILENAME
feature available in Snowflake. The above command would work smoothly if we omit match_by_column_name=case_insensitive
.
File format can be declared at the time of stage creation, file format creation, or even at the time of running copy into command. It’s totally up to you.
Conclusion
Snowflake’s support for Avro is powerful, allowing for effortless data loading, querying, and analysis. The native compatibility and optimized performance with Avro files empower organizations to manage complex datasets efficiently within Snowflake’s cloud data platform. Snowflake can handle different Avro file processing scenarios with ease.