This blog was co-written by Sanket Tawer & Surya Chandra Balina.
External tables offer a flexible and efficient approach to accessing and integrating data from external sources within the Snowflake Data Cloud ecosystem.
They simplify data loading, support data integration, enable data sharing, and provide cost-effective data archival capabilities, making them valuable features for data management and analysis tasks.
Snowflake’s external tables can be applied across various industries and sectors where data integration, analysis, and data sharing are crucial. Some of the industries and sectors where Snowflake’s external tables find relevance include:
- Financial Services
- Retail/E-commerce
- Healthcare
- Manufacturing/Supply Chain
- Technology
- Software
- Media/Entertainment
- Government
- Public Sector
- Research/Academia
In this blog, we’ll cover what external tables and directory tables are in Snowflake and why they are important for your business.
What are External Tables in Snowflake?
An external table is a Snowflake feature that lives outside of a database in a text-based, delimited file or in a fixed-length format file. It can be used to store data outside the database while retaining the ability to query its data.
These files need to be in one of the Snowflake-supported cloud systems: Amazon S3, Google Cloud Storage, or Microsoft Azure Blob storage.
These are Snowflake objects that overlay a table structure on top of files stored in an EXTERNAL STAGE. They provide a “read-only” level of access for data within these remote files straight from the object store.
These tables store metadata (name, path, version identifier, etc.) to facilitate this type of access, which is made available through VIEWs and TABLEs in the INFORMATION_SCHEMA.
Why External Tables are Important
- Data Ingestion: External tables allow you to easily load data into Snowflake from various external data sources without the need to first stage the data within Snowflake.
- Data Integration: Snowflake supports seamless integration with other data processing systems and data lakes. External tables provide a way to access and query data that resides in external systems or formats.
- Cost Efficiency: Storing data in Snowflake’s native storage is typically more expensive than storing data in cloud storage services like Amazon S3 or Azure Blob Storage. By using external tables, you can keep your cold or infrequently accessed data in cheaper storage tiers while still being able to query and analyze the data as if it were stored within Snowflake. This helps optimize your storage costs while maintaining data accessibility.
- Data Sharing: Snowflake’s data sharing feature allows you to securely share data with other accounts or organizations. External tables play a crucial role in data sharing by allowing you to grant access to specific external tables stored in your cloud storage.
- Data Archival: External tables are often used for long-term data archival purposes. As data ages and becomes less frequently accessed, you can move it to cheaper storage systems while preserving its query ability through external tables.
Overall, external tables in Snowflake offer flexibility, efficiency, and seamless integration with external data sources, enabling you to ingest, integrate, and analyze data from various locations and formats while optimizing storage costs.
How to Use External Tables in Snowflake
Let’s say you have a CSV file stored in an Amazon S3 bucket that contains customer information, and you want to query and analyze that data using an external table in Snowflake.
Step 1: Set up the External Stage
First, you need to set up an external stage in Snowflake that points to the location of your data file in Amazon S3. You can do this using the following command:
CREATE STAGE my_stage
URL='s3://my-bucket/my-data-folder/'
CREDENTIALS=(AWS_KEY_ID='your_aws_key_id' AWS_SECRET_KEY='your_aws_secret_key');
Replace my_stage with the name you want to assign to your stage, s3://my-bucket/my-data-folder/ with the actual path of your data file in Amazon S3, and provide your AWS credentials (your_aws_key_id and your_aws_secret_key).
Another option is to provide credentials using storage integration CREATE STORAGE INTEGRATION | Snowflake Documentation.
Step 2: Create the External Table
Next, you can create the external table referencing the data file in your external stage. Here’s an example:
CREATE EXTERNAL TABLE my_external_table (
customer_id INT,
first_name STRING,
last_name STRING,
email STRING
)
LOCATION = '@my_stage'
FILE_FORMAT = (TYPE = CSV);
In this example, my_external_table is the name of the external table, and it has four columns: customer_id, first_name, last_name, and email. The LOCATION parameter is set to @my_stage, which refers to the external stage you created earlier. The FILE_FORMAT parameter is set to CSV since the data file is in CSV format.
Step 3: Query the External Table
Once the external table is created, you can query and analyze the data using standard SQL queries in Snowflake. For example, you can retrieve all the customer records from the external table:
SELECT * FROM my_external_table;
You can also apply filtering, aggregations, and joins to the external table, just like a regular table in Snowflake.
Step 4: Data Loading and Updates
If you add or update the data file in the Amazon S3 bucket, you can refresh the external table to reflect the changes. This can be done using the ALTER EXTERNAL TABLE command:
ALTER EXTERNAL TABLE my_external_table REFRESH;
Snowflake will detect the changes in the data file and update the metadata associated with the external table accordingly.
External tables in Snowflake enable seamless integration and analysis of data stored in a data lake. It simplifies the data exploration process, reduces data movement, and provides cost efficiency, allowing organizations to unlock insights from their existing data lake infrastructure using Snowflake’s powerful analytics capabilities.
What are Directory Tables in Snowflake?
Directory tables are used to store metadata about the staged files. Users with proper privileges can query directory tables to retrieve file URLs to access the staged files. Using pre-signed URLs and directory tables, the users can access the file securely without needing direct cloud provider access.
A directory table is not a separate database object but an implicit object layered on a stage.
Why Directory Tables are Important
In the given real-time scenario, the source system feeds a file to an external stage in Snowflake. This file will be consumed in the Snowflake database using the COPY command.
However, due to compliance issues, parallel Snowflake users are not authorized to log in directly to the cloud provider (AWS/GCP/AZURE) hosted by Snowflake.
To address this situation, a possible solution is to provide a mechanism for the parallel users to download and access the file without requiring direct access to the cloud provider. One approach could be combining Snowflake’s GET_PRESIGNED_URL function and DIRECTORY tables. Here’s an overview of the process:
- Use the GET_PRESIGNED_URL function: This function generates a pre-signed URL for a specific file in a stage. It ensures secure access to the file for a limited time.
- Query the DIRECTORY table: The DIRECTORY table in Snowflake contains metadata about the files in a stage. You can query this table to obtain information about the files, such as their names, sizes, and other attributes.
- Combine the information: By combining the results from the GET_PRESIGNED_URL function and the DIRECTORY table, you can obtain a pre-signed URL specific to the file you want to download or access. This URL can be used in Snowsight or any web browser to retrieve the file’s content.
How to Create Directory Tables
A directory table can be added explicitly to a location when creating the stage using the “CREATE STAGE” command or, at a later point, using the “ALTER STAGE” command.
Directory tables store file-level metadata about the data files in a stage, and it includes the below fields:
Column | Data Type | Description |
---|---|---|
RELATIVE_PATH | TEXT | Path to the files to access using the file URL. |
SIZE | NUMBER | Size of the file (in bytes). |
LAST_MODIFIED | TIMESTAMP_LTZ | Timestamp when the file was last updated in the stage. |
MD5 | HEX | MD5 checksum for the file. |
ETAG | HEX | ETag header for the file. |
FILE_URL | TEXT | Snowflake-hosted file URL to the file. |
How to Create a Directory Table on the Named Internal Stage
CREATE OR REPLACE STAGE MY_INTERNAL_STAGE
DIRECTORY = ( ENABLE = TRUE)
FILE_FORMAT= (TYPE = 'CSV'
FIELD_DELIMITER = ','
SKIP_HEADER = 1)
ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE') ;
How to Create a Directory Table on the External Stage
CREATE STAGE MY_EXTERNAL_STAGE
URL='s3://my-bucket/my-data-folder/'
CREDENTIALS=(AWS_KEY_ID='your_aws_key_id' AWS_SECRET_KEY='your_aws_secret_key')
DIRECTORY = ( ENABLE = TRUE);
How to Add Files to the Named Internal Stage
PUT file:/// @MY_INTERNAL_STAGE;
Query a Directory Table
SELECT * FROM DIRECTORY( @MY_INTERNAL_STAGE );
To see the data after running the above query, we need to update the metadata. This is done using refreshing directory table metadata once the file is modified(i.e., update/delete/insert).
How to Refresh the Directory Table Metadata
- When a new file is added/removed/updated in an external/internal stage, it is required to refresh the directory table to synchronize the metadata with the latest set of associated files in the stage and path.
- It is possible to refresh the metadata automatically for directory tables on external stages using the event messaging service for your cloud storage service.
- Automatic metadata refreshing is not supported for directory tables located on internal stages. We must manually refresh the directory table metadata for internal stages.
- Below is an example of refreshing the directory table on an internal stage manually.
How to Manually Refresh Directory Table Metadata
Use the ALTER STAGE command to refresh the metadata in a directory table on the external/internal stage.
ALTER STAGE MY_INTERNAL_STAGE REFRESH;
Next, select from the directory table.
How to Access Staged Files Using Pre-Signed URL & Directory Table
GET_PRESIGNED_URL function generates a pre-signed URL to a staged file using the stage name and relative file path as inputs.
Files in a stage can be accessed by navigating directly to the pre-signed URL in a web browser. This allows for direct retrieval and viewing of the files stored in the stage.
Syntax:
GET_PRESIGNED_URL( @ , '' , [ ] )
stage_name: Name of the internal or external stage where the file is stored.
relative_file_path: Path and filename of the file relative to its location in the stage.
expiration_time: Length of time (in seconds) after which the short-term access token expires. Default value: 3600 (60 minutes).
Use the directory and GET_PRESIGNED_URL function to generate a pre-signed URL for the file in the stage. Copy and paste the above URL into the browser to download the file on your computer. The above-generated URL is valid only for 60 seconds as the expiration_time is given 60 seconds.
Usecase
Directory tables can also be used with standard (delta) streams to monitor the addition or removal of files within the specified cloud storage location.
Closing
To download or access content in a specific file within a stage through Snowsight, you can utilize the combination of Snowflake’s GET_PRESIGNED_URL function and DIRECTORY tables.
By leveraging the GET_PRESIGNED_URL function and DIRECTORY tables, you can effectively manage and access the content of individual files within a stage using Snowsight or any other means compatible with pre-signed URLs.
If you have any questions about using External and Directory tables in Snowflake, contact our team of experts!