In our role as Solution Architects, we engage in various discussions with clients regarding data ingestion, transformation, and related topics. During a recent meeting focused on data ingestion, a client highlighted a recurring challenge: they receive data in the form of flat files, and the source systems frequently alter the layout of these files.
The dynamic nature of these changes, involving the addition or removal of columns, necessitates additional efforts from the development team. Specifically, they must inspect the file, adjust the table schema, and subsequently load the data.
This process introduces considerable time and effort into the overall data ingestion workflow, delaying the availability of data to end consumers. Fortunately, the client has opted for Snowflake Data Cloud as their target data warehouse.
Snowflake’s copy command includes Schema detection and evolution features, which prove invaluable in addressing this challenge. This functionality eliminates the need for manual schema adjustments, streamlining the data ingestion process and ensuring quicker access to data for their consumers.
Where to use Schema Evolution?
It is incredibly simple to use INFER_SCHEMA
and SCHEMA EVOLUTION
features to speed up data ingestion into Snowflake. Schema evolution works with both batch COPY
and SNOWPIPE. If the incoming data file is changing quite often, this is a great way to automate data load. Schema evolution is supported in all file types like CSV, Parquet, and JSON.
Benefits:
Reduced Development Effort – As Snowflake automates the schema detection and evolution process, it significantly reduces the effort required from the development team. There’s no need for developers or analysts to manually adjust table schemas or modify ETL (Extract, Transform, Load) processes whenever the source data structure changes.
Time Efficiency – The automated schema detection and evolution features contribute to faster data availability. Without the need for manual adjustments, the data ingestion process becomes more efficient, enabling quicker insights for end consumers.
Compatibility with Diverse Data Sources – Snowflake’s schema detection and evolution features are compatible with various data sources, including flat files and other structured data formats like JSON, and Parquet. This adaptability enhances the platform’s versatility in handling diverse data inputs.
Enhanced Data Warehousing Experience – By automating schema-related tasks, Snowflake contributes to a more seamless and user-friendly data warehousing experience. Data Analysts and Scientists can focus on analyzing and deriving insights from data rather than dealing with the complexities of schema modifications.
In this blog, we will look at how you can use Snowflake’s schema inference and evolution features to continuously adjust the schema as you load the data from source systems. We will use simple CSV files for this blog.
For this setup, you need the following.
The Snowflake account is set up with a demo database and schema to load data.
Sample CSV files (download files here)
Step 1: Load Sample CSV Files Into the Internal Stage Location
Open the SQL worksheet and create a stage if it doesn’t exist.
Once the stage is created, navigate to the Snowsight stage location using the menu item listed below.
From the homepage: Data > Databases > Select your database/schema and select stages.
Click on +Files button to upload the sample files.
Go back to the SQL worksheet and verify if the files exist.
Step 2: Infer Schema on Initial File and Create Table
Snowflake provides 2 functions to infer schema and generate DDL itself. This is incredibly useful for both Data Engineers and Data Scientists. During the development phase, Data engineers can quickly use INFER_SCHEMA
to scan text files and generate DDLs.
This is very useful when 100s of different layouts need to be loaded into Snowflake. Data Scientists, on the other hand, can use these functions to quickly infer schema and start analyzing the data without waiting for the data to be loaded by the Engineering team.
Let’s infer schema from the initial CSV file. This requires a FILE FORMAT
to be created with a specific option. Once the file format is created, run the INFER_SCHEMA
function on the CSV file.
CREATE OR REPLACE FILE FORMAT CSV_FORMAT PARSE_HEADER=TRUE FIELD_OPTIONALLY_ENCLOSED_BY='\"';
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@csv_stage/customer-initial-data.csv'
, FILE_FORMAT=>'csv_format'
)
);
Note the option PARSE_HEADER=TRUE
in the file format option. INFER_SCHEMA
uses this to generate the field names. The output of this function will look like the following:
The data type generated will depend on the number of rows parsed from the file. If you use a relatively small sample, then the datatype may be inaccurate.
Step 3: Load the Initial File Into the Snowflake Table
If you are satisfied with the data type, then create the table using the TEMPLATE
feature in the CREATE TABLE
command.
CREATE TABLE CUSTOMER_TABLE
USING TEMPLATE (
SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*))
WITHIN GROUP (ORDER BY order_id)
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@csv_stage/customer-initial-data.csv',
FILE_FORMAT=>'csv_format'
)
));
This creates CUSTOMER_TABLE
with the definition generated by INFER_SCHEMA
. Once the table is created, the data load is as simple as using the COPY
command. Note that we are using a new option MATCH_BY_COLUMN_NAME
to automatically parse the header and match with the table column to load the data.
COPY INTO CUSTOMER_TABLE FROM @csv_stage/customer-initial-data.csv FILE_FORMAT = (FORMAT_NAME= 'CSV_FORMAT') MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE;
Verify if the data is loaded by querying the table.
Step 4: Load the Second File with Schema Evolution Enabled
Now that the initial file is loaded, let’s look at how schema evolution works in Snowflake. Schema evolution is a feature where Snowflake attempts to scan incoming data as part of the COPY
command to identify any new column to be added and add them to the table before loading the data into the table.
For validation, let’s run INFER_SCHEMA
on the second file to see the difference.
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@csv_stage/customer-updated-data.csv'
, FILE_FORMAT=>'csv_format'
)
);
We can see that there are two new fields – Email and Subscription date.
When we run the copy command, Snowflake should add 2 fields and then load the data. As a first step, enable schema evolution on the table.
ALTER TABLE CUSTOMER_TABLE SET ENABLE_SCHEMA_EVOLUTION = TRUE;
Note that this requires SECURITYADMIN
or any role that manages grants to GRANT EVOLVE SCHEMA
on the role that manages this table load. Run the COPY
command on the new file, as shown below. This COPY
command requires ERROR_ON_COLUMN_COUNT_MISMATCH
to be set to false. Else COPY
command will error out with a message, column count is not matching.
COPY INTO CUSTOMER_TABLE
FROM @csv_stage/customer-updated-data.csv
FILE_FORMAT = (FORMAT_NAME='CSV_FORMAT' ERROR_ON_COLUMN_COUNT_MISMATCH=FALSE)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
The COPY
command succeeds now and loads the data into CUSTOMER_TABLE
with 2 additional columns. For the older records, the value will be set to NULL
. Validate the data by querying the table.
In the next blog, we will see how to use Schema evolution with Kafka to automate the creation of fields as the data is ingested.
Closing
This overview provides a high-level understanding of the Schema Detection and Evolution process. However, the intricacies of implementation in a real production environment involve additional steps. Structuring the data pipeline to dynamically accommodate schema changes in production and implementing notifications for such changes are crucial aspects.
At phData, our team of highly skilled data engineers specializes in ETL/ELT processes across various cloud environments. If you require assistance or expertise in data engineering, don’t hesitate to connect with our knowledgeable professionals at phData. We’re here to empower your data solutions and ensure their seamless evolution in dynamic production environments.